Backends: polars and duckdb
Every dpyr chain builds the same logical plan + expression IR
(DESIGN §4); the backend is decided by where the data came
from. A polars-backed source compiles to a LazyFrame, a duckdb-backed
source compiles to SQL — and the same chain returns the same rows on both.
That promise is enforced by a Hypothesis fuzzer that runs random verb chains
on both engines and compares bit-for-bit, plus differential tests against
goldens produced by real dplyr (the oracle). You pick a backend for data
location and performance, never for behavior.
The polars backend
Anything that starts from in-process data or local files runs on polars:
read() on a dict, a polars or pandas dataframe, or a .parquet/.csv path.
import polars as pl
import dpyr
from dpyr import read, col, n, desc
sales = read({
"city": ["Hull", "Hull", "Aylmer", "Aylmer", "Wakefield", "Wakefield"],
"month": [1, 2, 1, 2, 1, 2],
"units": [40, 35, 21, 28, 12, None],
"price": [2.5, 2.5, 3.0, 3.0, 4.0, 4.0],
})
print(sales)
# dpyr dataframe · source: polars · showing 6 of 6 rows
shape: (6, 4)
┌───────────┬───────┬───────┬───────┐
│ city ┆ month ┆ units ┆ price │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 │
╞═══════════╪═══════╪═══════╪═══════╡
│ Hull ┆ 1 ┆ 40 ┆ 2.5 │
│ Hull ┆ 2 ┆ 35 ┆ 2.5 │
│ Aylmer ┆ 1 ┆ 21 ┆ 3.0 │
│ Aylmer ┆ 2 ┆ 28 ┆ 3.0 │
│ Wakefield ┆ 1 ┆ 12 ┆ 4.0 │
│ Wakefield ┆ 2 ┆ null ┆ 4.0 │
└───────────┴───────┴───────┴───────┘
read("x.parquet") and read("x.csv") are scans, not loads — the file is
only read when something materializes, and only the columns/rows the plan
needs:
import tempfile, pathlib
tmp = pathlib.Path(tempfile.mkdtemp())
sales.write(str(tmp / "sales.parquet"))
sales_pq = read(str(tmp / "sales.parquet"))
print(sales_pq.schema)
{'city': Str, 'month': Int64, 'units': Int64, 'price': Float64}
A whole chain compiles to one polars LazyFrame, so polars' optimizer
sees everything at once — filters get pushed into the parquet scan, unused
columns are never decoded, and no intermediate dataframe is allocated between
verbs:
revenue = (
sales_pq
.filter(~col.units.is_na())
.mutate(revenue = col.units * col.price)
.group_by(col.city)
.summarize(total = col.revenue.sum())
.arrange(desc(col.total))
)
print(revenue)
# dpyr dataframe · source: polars · showing 3 of 3 rows
shape: (3, 2)
┌───────────┬───────┐
│ city ┆ total │
│ --- ┆ --- │
│ str ┆ f64 │
╞═══════════╪═══════╡
│ Hull ┆ 187.5 │
│ Aylmer ┆ 147.0 │
│ Wakefield ┆ 48.0 │
└───────────┴───────┘
The duckdb backend
For data that already lives in a database — or is too big to pull into
memory — point dpyr at a duckdb connection. duckdb.connect() gives an
in-memory database; duckdb.connect("warehouse.db") opens a file.
read(con, table) wraps an existing table, read(con).sql(query)
wraps an arbitrary query as a source:
import duckdb
con = duckdb.connect() # in-memory; pass a path for a persistent file
con.execute("""
CREATE TABLE deliveries (city VARCHAR, month INTEGER, km DOUBLE);
INSERT INTO deliveries VALUES
('Hull', 1, 12.0), ('Hull', 2, 15.5),
('Aylmer', 1, 30.2), ('Aylmer', 2, 28.9),
('Wakefield', 1, 55.0), ('Wakefield', 2, 51.3);
""")
deliveries = read(con, "deliveries")
by_city = (
deliveries
.group_by(col.city)
.summarize(trips = n(), total_km = col.km.sum())
.arrange(desc(col.total_km))
)
print(by_city)
feb = read(con).sql("SELECT city, km FROM deliveries WHERE month = 2")
print(feb.schema)
# dpyr dataframe · source: duckdb · showing 3 of 3 rows
shape: (3, 3)
┌───────────┬───────┬──────────┐
│ city ┆ trips ┆ total_km │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞═══════════╪═══════╪══════════╡
│ Wakefield ┆ 2 ┆ 106.3 │
│ Aylmer ┆ 2 ┆ 59.1 │
│ Hull ┆ 2 ┆ 27.5 │
└───────────┴───────┴──────────┘
{'city': Str, 'km': Float64}
The verbs compile to a single SQL statement that duckdb executes entirely on
its side; only the result crosses back into Python. The generated SQL is an
implementation detail with no public API — it carries semantics shims
(NULLS-LAST stable sorts, BIGINT counts, floor-mod: SEMANTICS S3/S13/S23)
that make it noisier than hand-written SQL. Peeking via the internal
compiler entry point, by_city would send:
from dpyr.duckdb_backend import final_sql # private, may change
print(final_sql(by_city.plan))
SELECT *, row_number() OVER () AS "__rn1" FROM (SELECT "city", CAST(count(*) AS BIGINT) AS "trips", COALESCE(sum("km"), 0.0) AS "total_km" FROM (SELECT "city", "month", "km" FROM "deliveries") t GROUP BY "city") t ORDER BY "total_km" DESC NULLS LAST, "__rn1" ASC
One duckdb-side effect is observable: .persist() checkpoints a dataframe as
a TEMP TABLE on your connection (polars dataframes persist as in-memory
dataframes instead). Since 1.2.0 this runs as a single
CREATE TEMP TABLE ... AS <query> inside the engine — the rows never
pass through Python. Temp tables vanish when the connection closes:
snap = by_city.persist()
print(con.execute(
"SELECT table_name, temporary FROM duckdb_tables() ORDER BY table_name"
).fetchall())
[('deliveries', False), ('dpyr_persist_2', True)]
Landing results in the engine
collect() pulls rows out; these verbs leave them in. to_table()
materializes a chain as a real table (in-engine, no Python round trip) and
returns a dataframe bound to it; to_view() saves the lazy plan itself as a
named view — zero materialization, and any SQL client on that connection
can query it; write("x.parquet") compiles to an in-engine COPY ... TO:
top = by_city.filter(col.total_km > 100)
gold = top.to_table("gold_cities") # CREATE OR REPLACE TABLE ... AS <sql>
live = by_city.to_view("city_stats") # a saved query, always fresh
top.write("/tmp/gold_cities.parquet") # COPY (<sql>) TO '...' in-engine
print(sorted(t for t in con.execute(
"SELECT table_name FROM information_schema.tables").fetchall()))
Use show_query() whenever you want to see exactly what will reach the
engine — the SQL for duckdb plans, the optimized plan for polars ones:
print(top.show_query()[:80], "…")
In-memory dataframes can land too — give to_table() a connection
(to_table("name", con=con)), or skip connections entirely with
df.write(path, table), which creates the database file if needed.
Opening files the readr way
Two functions cover all file IO: read(path) and df.write(path), both
dispatching on the extension — every format from CSV to Excel to sqlite
has its own page in the reading guide. A duckdb file
needs a table name when writing, and opens as a catalog when reading:
gold.write("/tmp/shop.db", "gold_cities") # a table inside a duckdb file
db = read("/tmp/shop.db") # the whole catalog
print(db.tables)
city_frame = db.gold_cities # a lazy frame, schema known
one_table = read("/tmp/shop.db", "gold_cities") # shortcut
raw = db.sql("SELECT count(*) AS n FROM gold_cities") # escape hatch
Tab completion works on db. (table names come from the live catalog),
and a misspelled table gets a did-you-mean, just like columns do. Arrow
IPC files round-trip with write("x.arrow") / read("x.arrow") and are
memory-mapped on read, so opening even a huge file is instant. For a
fast structural look at any dataframe, glimpse() prints one line per column
with its dtype and leading values.
ML data: Hugging Face datasets, numpy, tensors
read() also ingests the machine-learning world. Hugging Face datasets
objects are Arrow tables under the hood — the same format dpyr lives on —
so they come in without conversion. A DatasetDict asks you to pick a
split (with a did-you-mean if you typo it). Bare numpy arrays become
columns; a 1-D array becomes a single value column:
import numpy as np
import datasets
embeddings = read(np.array([[0.1, 0.9], [0.4, 0.6], [0.8, 0.2]]))
print(embeddings.columns)
ds = datasets.Dataset.from_dict({
"text": ["great movie", "terrible plot", "fine I guess"],
"label": [1, 0, 1],
})
reviews = read(ds)
print(reviews.filter(col.label == 1).collect()["text"].to_list())
['column_0', 'column_1']
['great movie', 'fine I guess']
For datasets published on the Hub you often don't need the datasets
library at all: the Hub serves them as parquet, and both engines scan
hf://datasets/... parquet paths directly (set HF_TOKEN for gated
data), with the usual column/predicate pushdown.
Going the other way, to_numpy() / to_torch() / to_jax() collect a
dataframe into the array world (the latter two need torch or jax installed),
and read() accepts CPU torch tensors and jax arrays directly:
features = (reviews
.mutate(length = col.text.str_len())
.select(col.label, col.length)
.to_torch()) # torch.Tensor, shape (3, 2)
Both backends hand you polars
collect() returns a polars.DataFrame regardless of engine — polars is the
interchange format. to_pandas() converts from there (it needs pandas
installed: pip install 'dpyr[pandas]'):
print(type(revenue.collect()), type(by_city.collect()))
print(type(by_city.to_pandas()))
<class 'polars.dataframe.frame.DataFrame'> <class 'polars.dataframe.frame.DataFrame'>
<class 'pandas.DataFrame'>
The result cache (and when to clear it)
Results are cached keyed by plan hash, so re-displaying a dataframe in a notebook
costs nothing. The flip side: a duckdb table mutated outside dpyr has the
same plan hash, so cached results go stale. dpyr.cache_clear() drops the
cache; .persist() is the explicit "snapshot now" alternative:
print(len(deliveries)) # collects, caches
con.execute("INSERT INTO deliveries VALUES ('Chelsea', 2, 18.0)")
print(len(deliveries)) # cache hit: stale!
dpyr.cache_clear()
print(len(deliveries)) # recomputed
6
6
7
File-backed sources are not exempt. read() on a file path tags the
source with the file's path + mtime + size — but only once, at construction
(_file_token in dataframe.py). A dataframe you are already holding keeps that
original tag, so editing the file on disk does not change its plan hash:
it keeps returning the cached rows. cache_clear() doesn't rescue it
either — the captured scan pinned the old file's metadata, and collecting
now raises a polars ComputeError. To see the new contents, construct a
new source with read(path): the fresh mtime/size give it a
fresh plan hash that never collides with the stale entry. (This is why
re-running the notebook cell that creates the source picks up file edits —
that's reconstruction, not cache invalidation.)
pq = str(tmp / "sales.parquet")
held = read(pq)
print(len(held)) # collects and caches
sales.slice_head(n=2).write(pq) # rewrite the file: 2 rows now
print(len(held)) # same tag, same hash: stale
print(len(read(pq))) # new source, new tag: fresh
dpyr.cache_clear()
try:
len(held) # the held frame is unrecoverable
except pl.exceptions.ComputeError:
print("held frame: scan pinned the old file's metadata")
6
6
2
held frame: scan pinned the old file's metadata
Opting out of eager display
By default a dataframe's repr collects a preview (display-eager, DESIGN §3).
.lazy() returns a dataframe that never executes implicitly — only .collect()
(or another explicit export) runs it. .eager() flips it back, and
dpyr.options.interactive = False turns implicit execution off globally for
production pipelines:
prod = by_city.lazy()
print(prod) # schema only, nothing executed
fresh = prod.collect() # explicit is fine (now includes Chelsea: 4 rows)
dpyr.options.interactive = False # same effect, process-wide
print(by_city)
dpyr.options.interactive = True
# dpyr dataframe · source: duckdb (lazy)
# columns: city <Str>, trips <Int64>, total_km <Float64>
# dpyr dataframe · source: duckdb (lazy)
# columns: city <Str>, trips <Int64>, total_km <Float64>
One exception to pushdown: pivot_wider
pivot_wider's output columns come from data values, so its schema can't be
known from metadata alone. On either backend, dpyr implicitly persists the
input, then pivots with polars. It works the same on a duckdb dataframe — just
know that this step runs in-process, not in the database:
wide = deliveries.pivot_wider(names_from=col.month, values_from=col.km)
print(wide)
# dpyr dataframe · source: duckdb · showing 4 of 4 rows
shape: (4, 3)
┌───────────┬──────┬──────┐
│ city ┆ 1 ┆ 2 │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞═══════════╪══════╪══════╡
│ Hull ┆ 12.0 ┆ 15.5 │
│ Aylmer ┆ 30.2 ┆ 28.9 │
│ Wakefield ┆ 55.0 ┆ 51.3 │
│ Chelsea ┆ null ┆ 18.0 │
└───────────┴──────┴──────┘
Duplicate keys warn and keep the first value (see SEMANTICS S26).
Mixing engines: the bridge
A plan that touches both an in-memory dataframe and a duckdb table runs inside duckdb — and the in-memory side travels for free. duckdb scans the dataframe's Arrow data in place (zero copy), so joining your RAM against a warehouse table is just a join (SEMANTICS S34):
mixed = sales.inner_join(deliveries, on=[col.city, col.month]).arrange(col.city)
print(mixed.collect())
shape: (6, 5)
┌───────────┬───────┬───────┬───────┬──────┐
│ city ┆ month ┆ units ┆ price ┆ km │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ f64 │
╞═══════════╪═══════╪═══════╪═══════╪══════╡
│ Aylmer ┆ 1 ┆ 21 ┆ 3.0 ┆ 30.2 │
│ Aylmer ┆ 2 ┆ 28 ┆ 3.0 ┆ 28.9 │
│ Hull ┆ 1 ┆ 40 ┆ 2.5 ┆ 12.0 │
│ Hull ┆ 2 ┆ 35 ┆ 2.5 ┆ 15.5 │
│ Wakefield ┆ 1 ┆ 12 ┆ 4.0 ┆ 55.0 │
│ Wakefield ┆ 2 ┆ null ┆ 4.0 ┆ 51.3 │
└───────────┴───────┴───────┴───────┴──────┘
Engine choice is automatic — duckdb whenever a duckdb table is involved,
polars otherwise — and collect(engine=...) overrides it. The one
asymmetry: duckdb can read our RAM, but polars cannot reach into a duckdb
catalog, so engine="polars" on duckdb-resident data raises:
try:
deliveries.collect(engine="polars")
except dpyr.DpyrError as e:
print(e)
engine='polars' cannot read duckdb-resident tables; duckdb can read in-memory frames, not vice versa
Even tables from different connections meet (since 1.7.0): the foreign
side streams through Arrow onto the plan's primary connection. Because
that stream copies the foreign table's rows, dpyr emits a warning so
large-table copies never happen invisibly — land big tables once with
to_table("name", con=...) instead (SEMANTICS S27):
import warnings
con2 = duckdb.connect()
con2.execute(
"CREATE TABLE depots AS SELECT * FROM (VALUES ('Hull', 4), ('Aylmer', 2)) t(city, docks)"
)
with warnings.catch_warnings():
warnings.simplefilter("ignore") # "streaming ... through arrow"
cross = (deliveries.inner_join(read(con2, "depots"), on=col.city)
.arrange(col.city, col.month))
print(cross.collect().columns)
['city', 'month', 'km', 'docks']
This is what makes joins work from anything to anything: same-engine
pairs are free, RAM bridges into duckdb zero-copy, and cross-connection
pairs (a second .db file, a sqlite file, separate in-memory databases)
cost one visible streamed copy.
One dtype system
dpyr pins six data dtypes (plus Null for all-missing literals), and both
backends are normalized to them on ingest:
| dpyr dtype | from polars | from duckdb |
|---|---|---|
Int64 |
all int widths cast up (Int8…UInt64) |
TINYINT…HUGEINT, unsigned ints |
Float64 |
Float32 cast up |
FLOAT, DOUBLE, DECIMAL(…) |
Bool |
Boolean |
BOOLEAN |
Str |
String |
VARCHAR |
Date |
Date |
DATE |
Datetime |
cast to microsecond precision | TIMESTAMP |
odd = read(pl.DataFrame({
"tiny": pl.Series([1, 2], dtype=pl.Int8),
"f32": pl.Series([1.5, 2.5], dtype=pl.Float32),
}))
print(odd.schema)
print(deliveries.schema) # INTEGER and DOUBLE arrived as Int64/Float64 too
{'tiny': Int64, 'f32': Float64}
{'city': Str, 'month': Int64, 'km': Float64}
Columns outside this set are rejected up front rather than half-supported —
read(con, table) on a table with a BLOB or nested column raises
DpyrError: column 'payload' has unsupported duckdb type BLOB. Int64 is the
one integer type (S5), counts are Int64 (S13), int / int gives Float64
(S4), and datetimes are microsecond-precision (S16).
Which backend?
- polars — data already in memory, parquet/CSV files that fit on one machine, tight notebook loops. Lowest overhead per query.
- duckdb — data already in a
.dbfile or larger than RAM, sources you want to define in SQL (read(con).sql(...)), or pipelines where the heavy lifting should stay inside the database and only summaries come back.
Since both return polars dataframes and obey the same semantics, switching is a one-line change at the source constructor — the chain below it stays identical. The few places where engine behavior genuinely differs (and what dpyr does about each) are cataloged row by row in SEMANTICS.md.