Skip to content

Backends: polars and duckdb

Every dpyr chain builds the same logical plan + expression IR (DESIGN §4); the backend is decided by where the data came from. A polars-backed source compiles to a LazyFrame, a duckdb-backed source compiles to SQL — and the same chain returns the same rows on both. That promise is enforced by a Hypothesis fuzzer that runs random verb chains on both engines and compares bit-for-bit, plus differential tests against goldens produced by real dplyr (the oracle). You pick a backend for data location and performance, never for behavior.

The polars backend

Anything that starts from in-process data or local files runs on polars: read() on a dict, a polars or pandas dataframe, or a .parquet/.csv path.

import polars as pl
import dpyr
from dpyr import read, col, n, desc

sales = read({
    "city":  ["Hull", "Hull", "Aylmer", "Aylmer", "Wakefield", "Wakefield"],
    "month": [1, 2, 1, 2, 1, 2],
    "units": [40, 35, 21, 28, 12, None],
    "price": [2.5, 2.5, 3.0, 3.0, 4.0, 4.0],
})
print(sales)
# dpyr dataframe · source: polars · showing 6 of 6 rows
shape: (6, 4)
┌───────────┬───────┬───────┬───────┐
│ city      ┆ month ┆ units ┆ price │
│ ---       ┆ ---   ┆ ---   ┆ ---   │
│ str       ┆ i64   ┆ i64   ┆ f64   │
╞═══════════╪═══════╪═══════╪═══════╡
│ Hull      ┆ 1     ┆ 40    ┆ 2.5   │
│ Hull      ┆ 2     ┆ 35    ┆ 2.5   │
│ Aylmer    ┆ 1     ┆ 21    ┆ 3.0   │
│ Aylmer    ┆ 2     ┆ 28    ┆ 3.0   │
│ Wakefield ┆ 1     ┆ 12    ┆ 4.0   │
│ Wakefield ┆ 2     ┆ null  ┆ 4.0   │
└───────────┴───────┴───────┴───────┘

read("x.parquet") and read("x.csv") are scans, not loads — the file is only read when something materializes, and only the columns/rows the plan needs:

import tempfile, pathlib

tmp = pathlib.Path(tempfile.mkdtemp())
sales.write(str(tmp / "sales.parquet"))

sales_pq = read(str(tmp / "sales.parquet"))
print(sales_pq.schema)
{'city': Str, 'month': Int64, 'units': Int64, 'price': Float64}

A whole chain compiles to one polars LazyFrame, so polars' optimizer sees everything at once — filters get pushed into the parquet scan, unused columns are never decoded, and no intermediate dataframe is allocated between verbs:

revenue = (
    sales_pq
    .filter(~col.units.is_na())
    .mutate(revenue = col.units * col.price)
    .group_by(col.city)
    .summarize(total = col.revenue.sum())
    .arrange(desc(col.total))
)
print(revenue)
# dpyr dataframe · source: polars · showing 3 of 3 rows
shape: (3, 2)
┌───────────┬───────┐
│ city      ┆ total │
│ ---       ┆ ---   │
│ str       ┆ f64   │
╞═══════════╪═══════╡
│ Hull      ┆ 187.5 │
│ Aylmer    ┆ 147.0 │
│ Wakefield ┆ 48.0  │
└───────────┴───────┘

The duckdb backend

For data that already lives in a database — or is too big to pull into memory — point dpyr at a duckdb connection. duckdb.connect() gives an in-memory database; duckdb.connect("warehouse.db") opens a file. read(con, table) wraps an existing table, read(con).sql(query) wraps an arbitrary query as a source:

import duckdb

con = duckdb.connect()   # in-memory; pass a path for a persistent file
con.execute("""
    CREATE TABLE deliveries (city VARCHAR, month INTEGER, km DOUBLE);
    INSERT INTO deliveries VALUES
        ('Hull', 1, 12.0),      ('Hull', 2, 15.5),
        ('Aylmer', 1, 30.2),    ('Aylmer', 2, 28.9),
        ('Wakefield', 1, 55.0), ('Wakefield', 2, 51.3);
""")
deliveries = read(con, "deliveries")

by_city = (
    deliveries
    .group_by(col.city)
    .summarize(trips = n(), total_km = col.km.sum())
    .arrange(desc(col.total_km))
)
print(by_city)

feb = read(con).sql("SELECT city, km FROM deliveries WHERE month = 2")
print(feb.schema)
# dpyr dataframe · source: duckdb · showing 3 of 3 rows
shape: (3, 3)
┌───────────┬───────┬──────────┐
│ city      ┆ trips ┆ total_km │
│ ---       ┆ ---   ┆ ---      │
│ str       ┆ i64   ┆ f64      │
╞═══════════╪═══════╪══════════╡
│ Wakefield ┆ 2     ┆ 106.3    │
│ Aylmer    ┆ 2     ┆ 59.1     │
│ Hull      ┆ 2     ┆ 27.5     │
└───────────┴───────┴──────────┘
{'city': Str, 'km': Float64}

The verbs compile to a single SQL statement that duckdb executes entirely on its side; only the result crosses back into Python. The generated SQL is an implementation detail with no public API — it carries semantics shims (NULLS-LAST stable sorts, BIGINT counts, floor-mod: SEMANTICS S3/S13/S23) that make it noisier than hand-written SQL. Peeking via the internal compiler entry point, by_city would send:

from dpyr.duckdb_backend import final_sql   # private, may change
print(final_sql(by_city.plan))
SELECT *, row_number() OVER () AS "__rn1" FROM (SELECT "city", CAST(count(*) AS BIGINT) AS "trips", COALESCE(sum("km"), 0.0) AS "total_km" FROM (SELECT "city", "month", "km" FROM "deliveries") t GROUP BY "city") t ORDER BY "total_km" DESC NULLS LAST, "__rn1" ASC

One duckdb-side effect is observable: .persist() checkpoints a dataframe as a TEMP TABLE on your connection (polars dataframes persist as in-memory dataframes instead). Since 1.2.0 this runs as a single CREATE TEMP TABLE ... AS <query> inside the engine — the rows never pass through Python. Temp tables vanish when the connection closes:

snap = by_city.persist()
print(con.execute(
    "SELECT table_name, temporary FROM duckdb_tables() ORDER BY table_name"
).fetchall())
[('deliveries', False), ('dpyr_persist_2', True)]

Landing results in the engine

collect() pulls rows out; these verbs leave them in. to_table() materializes a chain as a real table (in-engine, no Python round trip) and returns a dataframe bound to it; to_view() saves the lazy plan itself as a named view — zero materialization, and any SQL client on that connection can query it; write("x.parquet") compiles to an in-engine COPY ... TO:

top = by_city.filter(col.total_km > 100)

gold = top.to_table("gold_cities")          # CREATE OR REPLACE TABLE ... AS <sql>
live = by_city.to_view("city_stats")        # a saved query, always fresh
top.write("/tmp/gold_cities.parquet")       # COPY (<sql>) TO '...' in-engine

print(sorted(t for t in con.execute(
    "SELECT table_name FROM information_schema.tables").fetchall()))

Use show_query() whenever you want to see exactly what will reach the engine — the SQL for duckdb plans, the optimized plan for polars ones:

print(top.show_query()[:80], "…")

In-memory dataframes can land too — give to_table() a connection (to_table("name", con=con)), or skip connections entirely with df.write(path, table), which creates the database file if needed.

Opening files the readr way

Two functions cover all file IO: read(path) and df.write(path), both dispatching on the extension — every format from CSV to Excel to sqlite has its own page in the reading guide. A duckdb file needs a table name when writing, and opens as a catalog when reading:

gold.write("/tmp/shop.db", "gold_cities")  # a table inside a duckdb file

db = read("/tmp/shop.db")                  # the whole catalog
print(db.tables)
city_frame = db.gold_cities                # a lazy frame, schema known
one_table  = read("/tmp/shop.db", "gold_cities")              # shortcut
raw        = db.sql("SELECT count(*) AS n FROM gold_cities")  # escape hatch

Tab completion works on db. (table names come from the live catalog), and a misspelled table gets a did-you-mean, just like columns do. Arrow IPC files round-trip with write("x.arrow") / read("x.arrow") and are memory-mapped on read, so opening even a huge file is instant. For a fast structural look at any dataframe, glimpse() prints one line per column with its dtype and leading values.

ML data: Hugging Face datasets, numpy, tensors

read() also ingests the machine-learning world. Hugging Face datasets objects are Arrow tables under the hood — the same format dpyr lives on — so they come in without conversion. A DatasetDict asks you to pick a split (with a did-you-mean if you typo it). Bare numpy arrays become columns; a 1-D array becomes a single value column:

import numpy as np
import datasets

embeddings = read(np.array([[0.1, 0.9], [0.4, 0.6], [0.8, 0.2]]))
print(embeddings.columns)

ds = datasets.Dataset.from_dict({
    "text":  ["great movie", "terrible plot", "fine I guess"],
    "label": [1, 0, 1],
})
reviews = read(ds)
print(reviews.filter(col.label == 1).collect()["text"].to_list())
['column_0', 'column_1']
['great movie', 'fine I guess']

For datasets published on the Hub you often don't need the datasets library at all: the Hub serves them as parquet, and both engines scan hf://datasets/... parquet paths directly (set HF_TOKEN for gated data), with the usual column/predicate pushdown.

Going the other way, to_numpy() / to_torch() / to_jax() collect a dataframe into the array world (the latter two need torch or jax installed), and read() accepts CPU torch tensors and jax arrays directly:

features = (reviews
            .mutate(length = col.text.str_len())
            .select(col.label, col.length)
            .to_torch())          # torch.Tensor, shape (3, 2)

Both backends hand you polars

collect() returns a polars.DataFrame regardless of engine — polars is the interchange format. to_pandas() converts from there (it needs pandas installed: pip install 'dpyr[pandas]'):

print(type(revenue.collect()), type(by_city.collect()))
print(type(by_city.to_pandas()))
<class 'polars.dataframe.frame.DataFrame'> <class 'polars.dataframe.frame.DataFrame'>
<class 'pandas.DataFrame'>

The result cache (and when to clear it)

Results are cached keyed by plan hash, so re-displaying a dataframe in a notebook costs nothing. The flip side: a duckdb table mutated outside dpyr has the same plan hash, so cached results go stale. dpyr.cache_clear() drops the cache; .persist() is the explicit "snapshot now" alternative:

print(len(deliveries))                                      # collects, caches
con.execute("INSERT INTO deliveries VALUES ('Chelsea', 2, 18.0)")
print(len(deliveries))                                      # cache hit: stale!
dpyr.cache_clear()
print(len(deliveries))                                      # recomputed
6
6
7

File-backed sources are not exempt. read() on a file path tags the source with the file's path + mtime + size — but only once, at construction (_file_token in dataframe.py). A dataframe you are already holding keeps that original tag, so editing the file on disk does not change its plan hash: it keeps returning the cached rows. cache_clear() doesn't rescue it either — the captured scan pinned the old file's metadata, and collecting now raises a polars ComputeError. To see the new contents, construct a new source with read(path): the fresh mtime/size give it a fresh plan hash that never collides with the stale entry. (This is why re-running the notebook cell that creates the source picks up file edits — that's reconstruction, not cache invalidation.)

pq = str(tmp / "sales.parquet")
held = read(pq)
print(len(held))                            # collects and caches
sales.slice_head(n=2).write(pq)             # rewrite the file: 2 rows now
print(len(held))                            # same tag, same hash: stale
print(len(read(pq)))                        # new source, new tag: fresh

dpyr.cache_clear()
try:
    len(held)                               # the held frame is unrecoverable
except pl.exceptions.ComputeError:
    print("held frame: scan pinned the old file's metadata")
6
6
2
held frame: scan pinned the old file's metadata

Opting out of eager display

By default a dataframe's repr collects a preview (display-eager, DESIGN §3). .lazy() returns a dataframe that never executes implicitly — only .collect() (or another explicit export) runs it. .eager() flips it back, and dpyr.options.interactive = False turns implicit execution off globally for production pipelines:

prod = by_city.lazy()
print(prod)            # schema only, nothing executed
fresh = prod.collect() # explicit is fine (now includes Chelsea: 4 rows)

dpyr.options.interactive = False   # same effect, process-wide
print(by_city)
dpyr.options.interactive = True
# dpyr dataframe · source: duckdb (lazy)
# columns: city <Str>, trips <Int64>, total_km <Float64>
# dpyr dataframe · source: duckdb (lazy)
# columns: city <Str>, trips <Int64>, total_km <Float64>

One exception to pushdown: pivot_wider

pivot_wider's output columns come from data values, so its schema can't be known from metadata alone. On either backend, dpyr implicitly persists the input, then pivots with polars. It works the same on a duckdb dataframe — just know that this step runs in-process, not in the database:

wide = deliveries.pivot_wider(names_from=col.month, values_from=col.km)
print(wide)
# dpyr dataframe · source: duckdb · showing 4 of 4 rows
shape: (4, 3)
┌───────────┬──────┬──────┐
│ city      ┆ 1    ┆ 2    │
│ ---       ┆ ---  ┆ ---  │
│ str       ┆ f64  ┆ f64  │
╞═══════════╪══════╪══════╡
│ Hull      ┆ 12.0 ┆ 15.5 │
│ Aylmer    ┆ 30.2 ┆ 28.9 │
│ Wakefield ┆ 55.0 ┆ 51.3 │
│ Chelsea   ┆ null ┆ 18.0 │
└───────────┴──────┴──────┘

Duplicate keys warn and keep the first value (see SEMANTICS S26).

Mixing engines: the bridge

A plan that touches both an in-memory dataframe and a duckdb table runs inside duckdb — and the in-memory side travels for free. duckdb scans the dataframe's Arrow data in place (zero copy), so joining your RAM against a warehouse table is just a join (SEMANTICS S34):

mixed = sales.inner_join(deliveries, on=[col.city, col.month]).arrange(col.city)
print(mixed.collect())
shape: (6, 5)
┌───────────┬───────┬───────┬───────┬──────┐
│ city      ┆ month ┆ units ┆ price ┆ km   │
│ ---       ┆ ---   ┆ ---   ┆ ---   ┆ ---  │
│ str       ┆ i64   ┆ i64   ┆ f64   ┆ f64  │
╞═══════════╪═══════╪═══════╪═══════╪══════╡
│ Aylmer    ┆ 1     ┆ 21    ┆ 3.0   ┆ 30.2 │
│ Aylmer    ┆ 2     ┆ 28    ┆ 3.0   ┆ 28.9 │
│ Hull      ┆ 1     ┆ 40    ┆ 2.5   ┆ 12.0 │
│ Hull      ┆ 2     ┆ 35    ┆ 2.5   ┆ 15.5 │
│ Wakefield ┆ 1     ┆ 12    ┆ 4.0   ┆ 55.0 │
│ Wakefield ┆ 2     ┆ null  ┆ 4.0   ┆ 51.3 │
└───────────┴───────┴───────┴───────┴──────┘

Engine choice is automatic — duckdb whenever a duckdb table is involved, polars otherwise — and collect(engine=...) overrides it. The one asymmetry: duckdb can read our RAM, but polars cannot reach into a duckdb catalog, so engine="polars" on duckdb-resident data raises:

try:
    deliveries.collect(engine="polars")
except dpyr.DpyrError as e:
    print(e)
engine='polars' cannot read duckdb-resident tables; duckdb can read in-memory frames, not vice versa

Even tables from different connections meet (since 1.7.0): the foreign side streams through Arrow onto the plan's primary connection. Because that stream copies the foreign table's rows, dpyr emits a warning so large-table copies never happen invisibly — land big tables once with to_table("name", con=...) instead (SEMANTICS S27):

import warnings

con2 = duckdb.connect()
con2.execute(
    "CREATE TABLE depots AS SELECT * FROM (VALUES ('Hull', 4), ('Aylmer', 2)) t(city, docks)"
)
with warnings.catch_warnings():
    warnings.simplefilter("ignore")  # "streaming ... through arrow"
    cross = (deliveries.inner_join(read(con2, "depots"), on=col.city)
             .arrange(col.city, col.month))
print(cross.collect().columns)
['city', 'month', 'km', 'docks']

This is what makes joins work from anything to anything: same-engine pairs are free, RAM bridges into duckdb zero-copy, and cross-connection pairs (a second .db file, a sqlite file, separate in-memory databases) cost one visible streamed copy.

One dtype system

dpyr pins six data dtypes (plus Null for all-missing literals), and both backends are normalized to them on ingest:

dpyr dtype from polars from duckdb
Int64 all int widths cast up (Int8UInt64) TINYINTHUGEINT, unsigned ints
Float64 Float32 cast up FLOAT, DOUBLE, DECIMAL(…)
Bool Boolean BOOLEAN
Str String VARCHAR
Date Date DATE
Datetime cast to microsecond precision TIMESTAMP
odd = read(pl.DataFrame({
    "tiny": pl.Series([1, 2], dtype=pl.Int8),
    "f32":  pl.Series([1.5, 2.5], dtype=pl.Float32),
}))
print(odd.schema)
print(deliveries.schema)   # INTEGER and DOUBLE arrived as Int64/Float64 too
{'tiny': Int64, 'f32': Float64}
{'city': Str, 'month': Int64, 'km': Float64}

Columns outside this set are rejected up front rather than half-supported — read(con, table) on a table with a BLOB or nested column raises DpyrError: column 'payload' has unsupported duckdb type BLOB. Int64 is the one integer type (S5), counts are Int64 (S13), int / int gives Float64 (S4), and datetimes are microsecond-precision (S16).

Which backend?

  • polars — data already in memory, parquet/CSV files that fit on one machine, tight notebook loops. Lowest overhead per query.
  • duckdb — data already in a .db file or larger than RAM, sources you want to define in SQL (read(con).sql(...)), or pipelines where the heavy lifting should stay inside the database and only summaries come back.

Since both return polars dataframes and obey the same semantics, switching is a one-line change at the source constructor — the chain below it stays identical. The few places where engine behavior genuinely differs (and what dpyr does about each) are cataloged row by row in SEMANTICS.md.