Parquet
The format to standardize on. Parquet files are compressed, store the column types (dates stay dates), and can be read partially — dpyr only touches the columns and rows your chain needs. If your project reads the same data many times, convert it to parquet once:
read("survey.csv").write("survey.parquet")
Reading
from dpyr import read, col
trees = read("trees.parquet") # .pq works too
trees.filter(col.species == "sugar maple").collect()
No options, no second argument — a parquet file is one table, complete
with its schema. Reads are lazy scans: opening the file is instant, and
work happens at collect().
Many files at once
Globs work, which is how partitioned datasets are usually stored:
read("plots/*.parquet") # every file in the folder, one frame
read("logs/2026-*/**.parquet") # nested folders too
All files must share the same columns; they're read as one big table.
Writing
result.write("summary.parquet")
On the duckdb backend this compiles to an in-engine
COPY (<query>) TO ... (FORMAT PARQUET) — the rows go straight from
the database to the file. On polars it streams, so results larger than
RAM can still be written.
When things go wrong
- File doesn't exist — the error names the path; check spelling and working directory.
read(table=...) does not apply to parquet files— you passed a second argument. Parquet has no sheets or tables; just the path.- Glob matches nothing — you'll get an empty-source error from the
engine. Test the pattern with
import glob; glob.glob("plots/*.parquet").
Good to know
- Parquet + dpyr is the lazy combination:
read()costs nothing, and filters and column selections are pushed into the file read itself. - This is also the best format for remote data — over HTTP or S3, only the needed byte ranges are downloaded.