Expressions and autocompletion
Everything you pass to a verb — col.kg > 5, col.yield_kg.mean(),
if_else(...) — is an expression: a small immutable tree describing a
computation. Nothing runs when you build one. The dataframe validates it against
its schema the moment a verb receives it (so mistakes surface instantly), and
the work happens later, on polars or duckdb, when the dataframe materializes.
This page tours the expression toolkit and the three tiers of autocompletion
built on it: col, df.c, and dpyr stubgen.
col describes, the dataframe executes
col.<name> is a free-floating column reference — dpyr's counterpart of
pl.col("name") — whose operators and methods keep growing the tree. Print
one and you see the IR, not data:
from dpyr import col
bmi = (col.mass / (col.height / 100) ** 2).round(1)
print(repr(bmi))
round((col.mass / pow((col.height / lit(100)), lit(2))), lit(1))
That repr is canonical: dpyr hashes it to fingerprint plans and cache
results. Plain Python values (100, "kale", date(...)) become literals
automatically, and the same bmi object works on any dataframe with mass and
height columns, on either backend.
&, |, ~ — not and, or, not
Python's keyword operators force their operands through bool(), and a
description of a computation has no truth value. dpyr makes the failure loud
and names the fix:
try:
col.kg > 5 and col.organic
except TypeError as e:
print(f"{type(e).__name__}: {e}")
both = (col.kg > 5) & col.organic # AND
either = (col.kg > 5) | ~col.organic # OR, NOT
ExprTypeError: a dpyr expression is not a Python boolean. Use & | ~ instead of and/or/not, and is_in() instead of `in`.
Parenthesize comparisons next to &/| — the bitwise operators bind
tighter, so col.kg > 5 & col.organic parses as col.kg > (5 & col.organic).
And use .is_in([...]) where you'd reach for Python's in, which also
routes through bool() and hits the same error.
A dataframe to play with
from datetime import date
from dpyr import read
plants = read({
"plant": ["Roma tomato", "Cherry tomato", "Basil", "Squash", "Pepper"],
"family": ["nightshade", "nightshade", "herb", "cucurbit", "nightshade"],
"sown": [date(2026, 3, 14), date(2026, 3, 20), date(2026, 4, 2),
date(2026, 5, 1), date(2026, 3, 18)],
"rows": [4, 6, 2, 3, 5],
"yield_kg": [41.5, 33.2, None, float("nan"), 12.9],
})
print(plants)
# dpyr dataframe · source: polars · showing 5 of 5 rows
shape: (5, 5)
┌───────────────┬────────────┬────────────┬──────┬──────────┐
│ plant ┆ family ┆ sown ┆ rows ┆ yield_kg │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ date ┆ i64 ┆ f64 │
╞═══════════════╪════════════╪════════════╪══════╪══════════╡
│ Roma tomato ┆ nightshade ┆ 2026-03-14 ┆ 4 ┆ 41.5 │
│ Cherry tomato ┆ nightshade ┆ 2026-03-20 ┆ 6 ┆ 33.2 │
│ Basil ┆ herb ┆ 2026-04-02 ┆ 2 ┆ null │
│ Squash ┆ cucurbit ┆ 2026-05-01 ┆ 3 ┆ NaN │
│ Pepper ┆ nightshade ┆ 2026-03-18 ┆ 5 ┆ 12.9 │
└───────────────┴────────────┴────────────┴──────┴──────────┘
yield_kg carries both a null (a missing value) and a NaN (a real float,
the not-a-number value). dpyr keeps the two distinct (SEMANTICS S1) — they
behave differently below.
Methods follow the column's type
| Works on | Per-row | Aggregating |
|---|---|---|
| numeric | .abs() .round(digits) .floor() .ceiling() .log() .exp() .sqrt() |
.mean() .median() .sum() .std() .var() |
| string | .str_detect(pat) .str_replace(pat, repl) .str_to_lower() .str_to_upper() .str_len() |
|
| date / datetime | .year() .month() .day() |
|
| any | .is_na() .is_in(values) .between(lo, hi) .cast(dtype) |
.min() .max() .first() .last() .n_unique() |
Aggregates skip missing values by default; pass na_rm=False to propagate
them instead (SEMANTICS S2). String patterns are regular expressions on both
backends, and str_replace rewrites the first match (stringr-style, unlike
Python's replace-all str.replace). Temporal accessors return integers. A
quick pass over each family — note how the null and NaN rows flow through
arithmetic untouched:
print(plants.mutate(per_row = (col.yield_kg / col.rows).round(2)).pull(col.per_row))
print(plants.filter(col.plant.str_detect("tomato")).pull(col.plant))
print(plants.mutate(p = col.plant.str_replace(" tomato", "")).pull(col.p))
print(plants.filter(col.sown.month() == 3).pull(col.plant))
[10.38, 5.53, None, nan, 2.58]
['Roma tomato', 'Cherry tomato']
['Roma', 'Cherry', 'Basil', 'Squash', 'Pepper']
['Roma tomato', 'Cherry tomato', 'Pepper']
(.pull(col.x) collects one column as a Python list — handy for compact
output here.)
Missing values, membership, ranges, casts
.is_na() uses R's definition of "missing": true for null and for
NaN on float columns (SEMANTICS S1), on both backends — so it catches both
oddball rows:
print(plants.filter(col.yield_kg.is_na()).pull(col.plant))
['Basil', 'Squash']
Outside of .is_na(), though, NaN is an ordinary float that compares
greater than every number on both engines, while comparisons against null
yield null, which filter drops (SEMANTICS S12). A threshold filter
therefore keeps the NaN row and silently sheds the null one:
print(plants.filter(col.yield_kg > 30).pull(col.plant))
['Roma tomato', 'Cherry tomato', 'Squash']
If your floats may contain NaN, add ~col.yield_kg.is_na() before
thresholding. The remaining utilities:
from dpyr import FLOAT64
print(plants.filter(col.family.is_in(["herb", "cucurbit"])).pull(col.plant))
print(plants.filter(col.rows.between(3, 5)).pull(col.plant)) # inclusive ends
print(plants.mutate(rows = col.rows.cast(FLOAT64)).schema["rows"])
['Basil', 'Squash']
['Roma tomato', 'Squash', 'Pepper']
Float64
.is_in() on a missing value returns null rather than R's FALSE
(SEMANTICS S24), and .cast() takes the dtype constants dpyr exports:
INT64, FLOAT64, BOOL, STR, DATE, DATETIME.
Conditionals: if_else, case_when, coalesce, replace_na
case_when takes (condition, value) pairs, first match wins, and
default= covers the rest (no match and no default gives a missing value,
SEMANTICS S15). Branch dtypes must unify — mixing strings and ints across
branches is a build-time ExprTypeError, not a runtime surprise.
from dpyr import if_else, case_when, coalesce, replace_na
graded = plants.mutate(
scale = if_else(col.rows >= 5, "big", "small"),
grade = case_when(
(col.yield_kg >= 30, "great"),
(col.yield_kg >= 10, "fine"),
default = "unweighed",
),
filled = replace_na(col.yield_kg, 0.0),
capped = coalesce(col.yield_kg, col.rows * 5.0), # estimate when missing
)
print(graded.select(col.plant, col.scale, col.grade, col.filled, col.capped))
# dpyr dataframe · source: polars · showing 5 of 5 rows
shape: (5, 5)
┌───────────────┬───────┬───────────┬────────┬────────┐
│ plant ┆ scale ┆ grade ┆ filled ┆ capped │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ f64 │
╞═══════════════╪═══════╪═══════════╪════════╪════════╡
│ Roma tomato ┆ small ┆ great ┆ 41.5 ┆ 41.5 │
│ Cherry tomato ┆ big ┆ great ┆ 33.2 ┆ 33.2 │
│ Basil ┆ small ┆ unweighed ┆ 0.0 ┆ 10.0 │
│ Squash ┆ small ┆ great ┆ NaN ┆ NaN │
│ Pepper ┆ big ┆ fine ┆ 12.9 ┆ 12.9 │
└───────────────┴───────┴───────────┴────────┴────────┘
Look at the Squash row: it graded "great" (NaN ≥ 30 is true, as above), and
neither replace_na nor coalesce touched it — both fill nulls only,
while NaN is a value. To treat NaN as missing in a fill, route through
.is_na():
print(plants.mutate(
y0 = if_else(col.yield_kg.is_na(), 0.0, col.yield_kg),
).pull(col.y0))
[41.5, 33.2, 0.0, 0.0, 12.9]
The same expressions on duckdb
Expressions are backend-agnostic; the duckdb compiler turns the identical
tree into SQL (case_when → CASE WHEN, .is_na() →
IS NULL OR isnan(...)):
import duckdb
con = duckdb.connect() # in-memory
con.execute("""
CREATE TABLE sales AS SELECT * FROM (VALUES
('Roma tomato', 41.5), ('Basil', NULL), ('Pepper', 12.9)
) AS t(plant, yield_kg)
""")
sales = read(con, "sales")
print(sales.mutate(
grade = case_when(
(col.yield_kg >= 30, "great"),
(col.yield_kg >= 10, "fine"),
default = "unweighed",
),
missing = col.yield_kg.is_na(),
))
# dpyr dataframe · source: duckdb · showing 3 of 3 rows
shape: (3, 4)
┌─────────────┬──────────┬───────────┬─────────┐
│ plant ┆ yield_kg ┆ grade ┆ missing │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ str ┆ bool │
╞═════════════╪══════════╪═══════════╪═════════╡
│ Roma tomato ┆ 41.5 ┆ great ┆ false │
│ Basil ┆ null ┆ unweighed ┆ true │
│ Pepper ┆ 12.9 ┆ fine ┆ false │
└─────────────┴──────────┴───────────┴─────────┘
Mistakes surface on your line
Verbs validate every expression against the schema before returning — pure metadata work, so it's instant. A wrong column name raises with a did-you-mean suggestion, and dpyr strips its internals from the traceback: exactly two stack frames, your call plus one re-raise inside dpyr (paths below come from running this guide as a script):
import traceback
try:
plants.filter(col.yeild_kg > 10)
except Exception:
traceback.print_exc(chain=False)
Traceback (most recent call last):
File "/tmp/expr_full2.py", line 17, in <module>
plants.filter(col.yeild_kg > 10)
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
File "/home/maxime/Projects/r_ports_to_py/dpyr/src/dpyr/frame.py", line 51, in verb
raise err.with_traceback(None) from None
dpyr.errors.ColumnNotFoundError: column 'yeild_kg' not found in filter(). Did you mean 'yield_kg'? Available columns: plant, family, sown, rows, yield_kg
Type mistakes get the same treatment: comparing a string to an int, summing a
date, or mixing incompatible case_when branches all raise ExprTypeError
on the offending verb call.
df.c: the schema-bound, type-aware proxy
col accepts any name and any method, deferring all checks to the verb. The
dataframe-bound proxy df.c knows the live schema: in Jupyter or any REPL,
plants.c.<TAB> completes real column names, and what comes back is a
typed expression class:
print(type(plants.c.yield_kg).__name__, "/", type(plants.c.plant).__name__,
"/", type(plants.c.sown).__name__)
NumExpr / StrExpr / TemporalExpr
NumExpr has no .str_detect, StrExpr has no .mean — completion menus
only offer methods that make sense, and a wrong-type method fails immediately
at expression-build time, before any verb or backend is involved. Typos fail
at attribute access with the same did-you-mean:
try:
plants.c.plant.mean()
except TypeError as e:
print(f"{type(e).__name__}: {e}")
try:
plants.c.famly
except KeyError as e:
print(f"{type(e).__name__}: {e}")
ExprTypeError: .mean() is not available on a StrExpr
ColumnNotFoundError: column 'famly' not found in df.c. Did you mean 'family'? Available columns: plant, family, sown, rows, yield_kg
Lambda verbs
filter, mutate, and summarize accept callables and pass them df.c, so
you get the typed proxy without naming the dataframe twice — handy mid-chain,
where the intermediate dataframe has no variable name:
print(plants.filter(lambda c: c.rows >= 5).pull(col.plant))
print(plants.mutate(per_row = lambda c: (c.yield_kg / c.rows).round(1))
.slice_head(2).pull(col.per_row))
['Cherry tomato', 'Pepper']
[10.4, 5.5]
Static completion anywhere: dpyr stubgen
Runtime completion needs a live kernel. For static completion and
type-checking in any editor, the dpyr stubgen CLI reads parquet/csv schemas
and writes a typed module: one ColsProxy subclass per file with a typed
attribute per column, plus a loader returning DFrame[YourCols]. Shell usage
is dpyr stubgen data/*.parquet -o schemas.py; here it is end-to-end (temp
paths will differ on your machine):
import subprocess, sys, tempfile
from pathlib import Path
workdir = Path(tempfile.mkdtemp())
plants.write(workdir / "plants.parquet")
subprocess.run(["dpyr", "stubgen", str(workdir / "plants.parquet"),
"-o", str(workdir / "garden_schemas.py")], check=True)
print((workdir / "garden_schemas.py").read_text())
"""Generated by `dpyr stubgen` — do not edit by hand."""
from typing import cast
from dpyr import DFrame, read
from dpyr.expr import BoolExpr, NumExpr, StrExpr, TemporalExpr
from dpyr.frame import ColsProxy
class PlantsCols(ColsProxy):
plant: StrExpr
family: StrExpr
sown: TemporalExpr
rows: NumExpr
yield_kg: NumExpr
def load_plants() -> DFrame[PlantsCols]:
return cast(DFrame[PlantsCols], read('/tmp/tmpvbdbwibt/plants.parquet'))
plants: DFrame[PlantsCols] = load_plants()
Import from that module and the DFrame[PlantsCols] annotation flows through
the chain: pyright/mypy infer c: PlantsCols inside lambda verbs, so
c.yie<TAB> completes and c.plant.mean() is flagged in the editor,
before anything runs — and it works at runtime too:
sys.path.insert(0, str(workdir))
from garden_schemas import plants as typed_plants
print(typed_plants.filter(lambda c: c.rows >= 5).pull("plant"))
['Cherry tomato', 'Pepper']
dpyr ships a py.typed marker, so type checkers pick up its inline
annotations with zero configuration — generated schema modules, lambda verbs,
and ordinary chains all type-check out of the box.
Where next
- Grouped data — aggregates and windows per group
- Window functions —
lag, ranks, cumulative sums - Backends — polars vs duckdb, caching,
persist()