CSV & TSV
The format your collaborators email you. dpyr reads and writes both
comma-separated (.csv) and tab-separated (.tsv) files, plus their
gzip-compressed variants.
Reading
from dpyr import read, col
field = read("field_measurements.csv")
field.filter(col.height_m > 15).collect()
No options needed: the delimiter comes from the extension (.csv →
comma, .tsv → tab), the header row is detected, and column types are
inferred from the data. The second argument to read() doesn't apply
here — a CSV is always exactly one table.
Reading is a scan: the file isn't parsed until you collect(), and
only the columns your chain actually uses are read. A 2 GB CSV filtered
down to one species costs far less than 2 GB of work.
Compressed files
.csv.gz and .tsv.gz open exactly the same way:
read("survey_2019.csv.gz")
One difference: compressed files are decompressed up front rather than
scanned lazily, so very large .gz files use memory proportional to
their full size. If a file is big enough for that to hurt, convert it
once to parquet and work from that.
Writing
result.write("summary.csv")
result.write("summary.tsv")
Writing always includes a header row. There is no .csv.gz writer —
write a plain .csv, or better, a parquet file, which is
smaller than gzipped CSV and keeps the column types.
When things go wrong
- File doesn't exist — the error names the missing path. Check the
spelling and your working directory (
import os; os.getcwd()). - Columns read as the wrong type (e.g. an ID column of
00123codes becomes an integer) — type inference looks at the data, and ambiguous columns can guess wrong. Fix the type in the chain:read("f.csv").mutate(id=col.id.cast(str)). - A non-standard delimiter (semicolons, pipes) — dpyr trusts the
extension. For exotic files, parse with polars directly and hand the
result to dpyr:
read(pl.read_csv("odd.txt", separator=";")).
Good to know
- CSV stores no types — every read re-infers them, and dates often arrive as text. If you read the same file more than twice, write it to parquet once and read that instead: faster, smaller, and the types stick.
- On the duckdb backend,
.write("out.csv")runs inside the database engine (COPY ... TO), so big results never pass through Python.