Reading & writing

Two words cover every way in and out of dpyr: read() and .write(). read() takes a path, a URL, or any tabular object in memory and gives you a dataframe (or a database catalog); .write() sends a result to any of the same places. Whatever the source, the data ends up as Arrow in RAM, and your verbs run on polars or duckdb — you never pick a parser.

There are deliberately almost no options to learn. read() takes the thing (a path, a URL, an object) and, when the source contains more than one table, a name — the sheet of a spreadsheet, the table of a database, the split of a dataset:

read("trees.csv")             # one-table sources: just the path
read("survey.xlsx", "2024")   # the sheet called "2024"
read("forest.db", "plots")    # the table called "plots"

Multi-table sources opened without a name give you a catalog you can explore — a duckdb file lists its tables, a multi-sheet workbook its sheets — so print(read("mystery.xlsx")) is always a safe first move.

That's the whole API. Everything format-specific — what the second argument means, what can go wrong, how to fix it — lives on one page per format below.

Every source and destination

Source / destination	`read()`	`.write()`	Details
`.csv`, `.tsv` (and `.gz`)	✓	✓	CSV & TSV
`.parquet` / `.pq`	✓	✓	Parquet
`.xlsx`, Google Sheets URLs	✓	✓ (`.xlsx`)	Excel & Google Sheets
`.json`, `.jsonl` / `.ndjson`	✓	✓	JSON
`.arrow` / `.feather` / `.ipc`	✓	✓	Arrow IPC
`.db` / `.duckdb` / `.ddb`, `.sqlite` / `.sqlite3`, live connections	✓	✓	Databases
`https://`, `s3://`, `hf://` URLs	✓	—	Remote data
dict, polars, pandas, arrow, numpy, torch/jax, 🤗 datasets	✓	n/a	In-memory objects

An unknown extension fails with the list of what's supported, so the error message is also the documentation. And every source joins every other source — see Joins.

Files round-trip

import tempfile, pathlib
from dpyr import read, col, n

tmp = pathlib.Path(tempfile.mkdtemp())

trees = read({
    "species": ["sugar maple", "red oak", "white pine", "sugar maple"],
    "height_m": [24.0, 19.5, 31.0, 12.5],
    "tapped":  [True, False, False, True],
})

trees.write(str(tmp / "trees.parquet"))
trees.write(str(tmp / "trees.jsonl"))
trees.write(str(tmp / "trees.tsv"))

tall = read(str(tmp / "trees.parquet")).filter(col.height_m > 15)
print(tall.collect()["species"].to_list())
print(read(str(tmp / "trees.jsonl")).collect().height)

['sugar maple', 'red oak', 'white pine']
4

File reads are scans: nothing is parsed until something materializes, and only the columns and rows your chain needs are touched.

Where write() runs

On duckdb-backed chains, write() to parquet, csv, or jsonl compiles to an in-engine COPY (<query>) TO ... — the rows go straight from the database to the file without entering Python. To land results inside an engine instead of a file, see to_table() and to_view() in the backends guide.