Reading & writing
Two words cover every way in and out of dpyr: read() and .write().
read() takes a path, a URL, or any tabular object in memory and gives
you a dataframe (or a database catalog); .write() sends a result to any of
the same places. Whatever the source, the data ends up as Arrow in RAM,
and your verbs run on polars or duckdb — you never pick a parser.
There are deliberately almost no options to learn. read() takes the
thing (a path, a URL, an object) and, when the source contains more
than one table, a name — the sheet of a spreadsheet, the table of a
database, the split of a dataset:
read("trees.csv") # one-table sources: just the path
read("survey.xlsx", "2024") # the sheet called "2024"
read("forest.db", "plots") # the table called "plots"
Multi-table sources opened without a name give you a catalog you can
explore — a duckdb file lists its tables, a multi-sheet workbook its
sheets — so print(read("mystery.xlsx")) is always a safe first move.
That's the whole API. Everything format-specific — what the second argument means, what can go wrong, how to fix it — lives on one page per format below.
Every source and destination
| Source / destination | read() |
.write() |
Details |
|---|---|---|---|
.csv, .tsv (and .gz) |
✓ | ✓ | CSV & TSV |
.parquet / .pq |
✓ | ✓ | Parquet |
.xlsx, Google Sheets URLs |
✓ | ✓ (.xlsx) |
Excel & Google Sheets |
.json, .jsonl / .ndjson |
✓ | ✓ | JSON |
.arrow / .feather / .ipc |
✓ | ✓ | Arrow IPC |
.db / .duckdb / .ddb, .sqlite / .sqlite3, live connections |
✓ | ✓ | Databases |
https://, s3://, hf:// URLs |
✓ | — | Remote data |
| dict, polars, pandas, arrow, numpy, torch/jax, 🤗 datasets | ✓ | n/a | In-memory objects |
An unknown extension fails with the list of what's supported, so the error message is also the documentation. And every source joins every other source — see Joins.
Files round-trip
import tempfile, pathlib
from dpyr import read, col, n
tmp = pathlib.Path(tempfile.mkdtemp())
trees = read({
"species": ["sugar maple", "red oak", "white pine", "sugar maple"],
"height_m": [24.0, 19.5, 31.0, 12.5],
"tapped": [True, False, False, True],
})
trees.write(str(tmp / "trees.parquet"))
trees.write(str(tmp / "trees.jsonl"))
trees.write(str(tmp / "trees.tsv"))
tall = read(str(tmp / "trees.parquet")).filter(col.height_m > 15)
print(tall.collect()["species"].to_list())
print(read(str(tmp / "trees.jsonl")).collect().height)
['sugar maple', 'red oak', 'white pine']
4
File reads are scans: nothing is parsed until something materializes, and only the columns and rows your chain needs are touched.
Where write() runs
On duckdb-backed chains, write() to parquet, csv, or jsonl compiles to
an in-engine COPY (<query>) TO ... — the rows go straight from the
database to the file without entering Python. To land results inside
an engine instead of a file, see to_table() and to_view() in the
backends guide.