Grouped data
In pandas and polars, groupby/group_by is the entrance to an aggregation:
group, aggregate, back to a flat table. dpyr follows dplyr instead:
group_by attaches a persistent grouping to the dataframe, and every verb you
call afterwards interprets itself per group — mutate becomes a window
computation, filter tests rows against group-level values, slice_head
takes the first rows of each group. The grouping sticks until you
ungroup() or summarize consumes it. All blocks on this page run top to
bottom as one script.
from dpyr import read, col, n, across, where, is_numeric
sales = read({
"region": ["north", "north", "north", "south", "south", "south", "east", "east"],
"rep": ["ana", "ana", "bo", "cy", "cy", "dee", "ed", "ed"],
"amount": [100.0, 200.0, 150.0, 300.0, 250.0, 75.0, 500.0, 500.0],
"units": [1, 2, 2, 3, 2, 1, 5, 5],
})
g = sales.group_by(col.region)
print(type(g).__name__, g.groups)
print(repr(g).splitlines()[0])
print(type(g.ungroup()).__name__)
GroupedDFrame ('region',)
# dpyr dataframe · groups: region · source: polars · showing 8 of 8 rows
DFrame
group_by returns a GroupedDFrame — a distinct type, not a flag on the
dataframe. Your IDE's completion surface changes with it, the active keys are
always visible in the repr header and on the .groups property (a tuple of
column names), and ungroup() hands back a plain DFrame. No data moved in
any of this: like every dpyr verb, group_by only added a node to the lazy
plan.
summarize peels one grouping level
summarize collapses each group to a single row — and then drops only the
innermost grouping key, keeping the rest (see SEMANTICS S9). With two keys,
one summarize gives you per-(region, rep) rows still grouped by region:
by_rep = sales.group_by(col.region, col.rep).summarize(total = col.amount.sum())
print(type(by_rep).__name__, by_rep.groups)
print(by_rep.collect())
GroupedDFrame ('region',)
shape: (5, 3)
┌────────┬─────┬────────┐
│ region ┆ rep ┆ total │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞════════╪═════╪════════╡
│ east ┆ ed ┆ 1000.0 │
│ north ┆ ana ┆ 300.0 │
│ north ┆ bo ┆ 150.0 │
│ south ┆ cy ┆ 550.0 │
│ south ┆ dee ┆ 75.0 │
└────────┴─────┴────────┘
So layered roll-ups are just chained summarize calls. A second one
aggregates the per-rep totals within each region and, having peeled the last
key, returns an ungrouped DFrame:
by_region = by_rep.summarize(best = col.total.max(), n_reps = n())
print(type(by_region).__name__)
print(by_region.collect())
DFrame
shape: (3, 3)
┌────────┬────────┬────────┐
│ region ┆ best ┆ n_reps │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 │
╞════════╪════════╪════════╡
│ east ┆ 1000.0 ┆ 1 │
│ north ┆ 300.0 ┆ 2 │
│ south ┆ 550.0 ┆ 2 │
└────────┴────────┴────────┘
Two pinned details visible above: grouped results come back sorted by the
group keys on both backends (S7), and n() always produces Int64 (S13).
Grouped mutate and filter are window operations
On a grouped dataframe, aggregates inside mutate are evaluated per group and
broadcast back to every row — what SQL calls a window function, and what
pandas spells transform. The dataframe keeps its original height:
dev = sales.group_by(col.region).mutate(
region_mean = col.amount.mean(),
deviation = col.amount - col.amount.mean(),
)
print(dev.collect())
shape: (8, 6)
┌────────┬─────┬────────┬───────┬─────────────┬─────────────┐
│ region ┆ rep ┆ amount ┆ units ┆ region_mean ┆ deviation │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 ┆ f64 ┆ f64 │
╞════════╪═════╪════════╪═══════╪═════════════╪═════════════╡
│ north ┆ ana ┆ 100.0 ┆ 1 ┆ 150.0 ┆ -50.0 │
│ north ┆ ana ┆ 200.0 ┆ 2 ┆ 150.0 ┆ 50.0 │
│ north ┆ bo ┆ 150.0 ┆ 2 ┆ 150.0 ┆ 0.0 │
│ south ┆ cy ┆ 300.0 ┆ 3 ┆ 208.333333 ┆ 91.666667 │
│ south ┆ cy ┆ 250.0 ┆ 2 ┆ 208.333333 ┆ 41.666667 │
│ south ┆ dee ┆ 75.0 ┆ 1 ┆ 208.333333 ┆ -133.333333 │
│ east ┆ ed ┆ 500.0 ┆ 5 ┆ 500.0 ┆ 0.0 │
│ east ┆ ed ┆ 500.0 ┆ 5 ┆ 500.0 ┆ 0.0 │
└────────┴─────┴────────┴───────┴─────────────┴─────────────┘
Same rule for filter: predicates can compare each row against a per-group
aggregate. East's tie produces two rows — this is a row predicate, not a
"take 1":
best_rows = sales.group_by(col.region).filter(col.amount == col.amount.max())
print(best_rows.collect())
shape: (4, 4)
┌────────┬─────┬────────┬───────┐
│ region ┆ rep ┆ amount ┆ units │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 │
╞════════╪═════╪════════╪═══════╡
│ north ┆ ana ┆ 200.0 ┆ 2 │
│ south ┆ cy ┆ 300.0 ┆ 3 │
│ east ┆ ed ┆ 500.0 ┆ 5 │
│ east ┆ ed ┆ 500.0 ┆ 5 │
└────────┴─────┴────────┴───────┘
Both stay grouped, so you can keep chaining grouped verbs (or ungroup()).
The other verbs respect groups too
count(...) on a grouped dataframe adds its columns to the existing keys,
counts, and — via the peeling rule — comes back grouped by the original
keys, exactly as in dplyr:
per_rep = sales.group_by(col.region).count(col.rep)
print(type(per_rep).__name__, per_rep.groups)
print(per_rep.collect())
GroupedDFrame ('region',)
shape: (5, 3)
┌────────┬─────┬─────┐
│ region ┆ rep ┆ n │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞════════╪═════╪═════╡
│ east ┆ ed ┆ 2 │
│ north ┆ ana ┆ 2 │
│ north ┆ bo ┆ 1 │
│ south ┆ cy ┆ 2 │
│ south ┆ dee ┆ 1 │
└────────┴─────┴─────┘
slice_head(n) takes the first n rows of each group, with the
original column order restored even where the engine would move keys first
(S25). slice_min / slice_max rank within each group and keep ties by
default (with_ties=False for exactly n rows); their output is arranged by
the slicing key, which is why east's tied pair lands at the bottom below:
print(sales.group_by(col.region).slice_head(1).collect())
print(sales.group_by(col.region).slice_min(col.amount, n = 1).collect())
shape: (3, 4)
┌────────┬─────┬────────┬───────┐
│ region ┆ rep ┆ amount ┆ units │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 │
╞════════╪═════╪════════╪═══════╡
│ north ┆ ana ┆ 100.0 ┆ 1 │
│ south ┆ cy ┆ 300.0 ┆ 3 │
│ east ┆ ed ┆ 500.0 ┆ 5 │
└────────┴─────┴────────┴───────┘
shape: (4, 4)
┌────────┬─────┬────────┬───────┐
│ region ┆ rep ┆ amount ┆ units │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 │
╞════════╪═════╪════════╪═══════╡
│ south ┆ dee ┆ 75.0 ┆ 1 │
│ north ┆ ana ┆ 100.0 ┆ 1 │
│ east ┆ ed ┆ 500.0 ┆ 5 │
│ east ┆ ed ┆ 500.0 ┆ 5 │
└────────┴─────┴────────┴───────┘
distinct(...) automatically adds the group keys to the deduplication
set: distinct(col.rep) below really means "distinct (region, rep) pairs",
and region stays in the output — 5 unique pairs, not 5 unique reps:
pairs = sales.group_by(col.region).distinct(col.rep)
print(pairs.columns, len(pairs))
['region', 'rep'] 5
select(...) refuses to drop a group key — a dataframe can't end up grouped
by a column it no longer has. And across(...) goes the other way: its
selector never matches the keys, so where(is_numeric) means "every numeric
column except region" and you can't aggregate your own keys by accident:
print(sales.group_by(col.region).select(col.amount).columns)
print(sales.group_by(col.region).summarize(across(where(is_numeric), "mean")).collect())
['region', 'amount']
shape: (3, 3)
┌────────┬────────────┬──────────┐
│ region ┆ amount ┆ units │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞════════╪════════════╪══════════╡
│ east ┆ 500.0 ┆ 5.0 │
│ north ┆ 150.0 ┆ 1.666667 │
│ south ┆ 208.333333 ┆ 2.0 │
└────────┴────────────┴──────────┘
persist() keeps the grouping
persist() materializes the plan so far and rebinds the dataframe to the result
(on duckdb, a temp table). It's a checkpoint, not a semantic boundary: the
grouping is reattached to the snapshot, mirroring dplyr's compute().
checkpoint = sales.group_by(col.region).mutate(
deviation = col.amount - col.amount.mean(),
).persist()
print(type(checkpoint).__name__, checkpoint.groups)
GroupedDFrame ('region',)
The classic gotcha: forgetting ungroup()
Because grouping is sticky, an aggregate that you meant to run over the
whole dataframe quietly runs per group instead. The same mutate produces shares
of the region total or of the grand total depending on whether the
grouping is still active:
grouped_share = g.mutate(share = col.amount / col.amount.sum()).pull(col.share)
global_share = g.ungroup().mutate(share = col.amount / col.amount.sum()).pull(col.share)
print([round(s, 3) for s in grouped_share]) # sums to 1.0 per region
print([round(s, 3) for s in global_share]) # sums to 1.0 overall
[0.222, 0.444, 0.333, 0.48, 0.4, 0.12, 0.5, 0.5]
[0.048, 0.096, 0.072, 0.145, 0.12, 0.036, 0.241, 0.241]
The same trap exists for slice_head(5) (5 rows per group, not 5 rows
total). dpyr softens it where it can: the repr header always shows
groups: ..., GroupedDFrame is a separate type so it shows up in type
hints and tracebacks, and ambiguous operations fail loudly — passing a
still-grouped dataframe as the right-hand side of a join raises
GroupError: joining a grouped dataframe is not supported; ungroup() it first.
Same semantics on duckdb
When the source is a duckdb table, the same chains compile to SQL window
functions and aggregates, and the grouped behaviors are identical — enforced
by backend-agreement fuzzing. One caveat: row order after distinct is
unspecified on both backends, so pin it with arrange() (S21).
import duckdb
con = duckdb.connect() # in-memory
sales_pl = sales.collect() # polars DataFrame
con.execute("CREATE TABLE sales AS SELECT * FROM sales_pl")
tbl = read(con, "sales")
print(
tbl.group_by(col.region)
.filter(col.amount == col.amount.max())
.ungroup()
.distinct()
.arrange(col.region)
.collect()
)
shape: (3, 4)
┌────────┬─────┬────────┬───────┐
│ region ┆ rep ┆ amount ┆ units │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 │
╞════════╪═════╪════════╪═══════╡
│ east ┆ ed ┆ 500.0 ┆ 5 │
│ north ┆ ana ┆ 200.0 ┆ 2 │
│ south ┆ cy ┆ 300.0 ┆ 3 │
└────────┴─────┴────────┴───────┘