Skip to content

Column-wise operations

Sooner or later every pipeline hits the "do this to twenty columns" problem. In pandas you reach for df.filter(regex=...) or a dict passed to .agg; in polars, pl.col("^bill_.*$") or cs.numeric(). dpyr splits the problem into two composable pieces, both borrowed from dplyr:

  • selectors (starts_with, where(is_numeric), ...) answer which columns,
  • across() answers what to do to each of them.

Selectors work anywhere a column set is expected — select(), relocate(), and inside across() — and they resolve against the dataframe's schema at the moment the verb is called, so a bad pattern fails on that line, not at collect() time.

Everything below is one runnable script. Setup:

import duckdb
from dpyr import (
    read, col, n, across,
    starts_with, ends_with, contains, matches, where, everything,
    is_numeric, is_string,
)

penguins = read({
    "species":           ["Adelie", "Adelie", "Gentoo", "Gentoo", "Chinstrap", "Chinstrap"],
    "island":            ["Torgersen", "Dream", "Biscoe", "Biscoe", "Dream", "Dream"],
    "bill_length_mm":    [39.1, 37.8, 47.5, 49.9, 46.5, 51.3],
    "bill_depth_mm":     [18.7, 18.3, 15.0, 16.1, 17.9, 19.2],
    "flipper_length_mm": [181.0, 174.0, 217.0, 213.0, 192.0, 198.0],
    "year":              [2007, 2008, 2007, 2008, 2007, 2008],
})
print(penguins.columns)
['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'year']

Picking columns: tidyselect in select()

select() takes any mix of plain column references and selectors:

Selector Keeps columns whose...
starts_with("bill") name starts with the prefix
ends_with("_mm") name ends with the suffix
contains("length") name contains the substring
matches(r"^bill_.*_mm$") name matches the regex (Python re.search)
where(is_numeric) dtype satisfies the predicate (is_string, is_bool, or your own DType -> bool)
everything() always — useful for "and then the rest"
print(penguins.select(starts_with("bill")))
# dpyr dataframe · source: polars · showing 6 of 6 rows
shape: (6, 2)
┌────────────────┬───────────────┐
│ bill_length_mm ┆ bill_depth_mm │
│ ---            ┆ ---           │
│ f64            ┆ f64           │
╞════════════════╪═══════════════╡
│ 39.1           ┆ 18.7          │
│ 37.8           ┆ 18.3          │
│ 47.5           ┆ 15.0          │
│ 49.9           ┆ 16.1          │
│ 46.5           ┆ 17.9          │
│ 51.3           ┆ 19.2          │
└────────────────┴───────────────┘

To keep the page short we'll print .columns for the rest — the data comes along too, of course:

print(penguins.select(col.species, ends_with("_mm")).columns)
print(penguins.select(contains("length")).columns)
print(penguins.select(matches(r"^bill_.*_mm$")).columns)
print(penguins.select(where(is_numeric)).columns)
print(penguins.select(where(is_string)).columns)
print(penguins.select(col.year, everything()).columns)
['species', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']
['bill_length_mm', 'flipper_length_mm']
['bill_length_mm', 'bill_depth_mm']
['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'year']
['species', 'island']
['year', 'species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']

Two ordering rules worth knowing: a selector expands its matches in schema order, and the first mention of a column fixes its position — that's why select(col.year, everything()) puts year first without duplicating it.

Dropping columns with -

Negate a column reference or a whole selector with the unary minus. A call made only of negations means "everything except":

print(penguins.select(-col.island).columns)
print(penguins.select(-starts_with("bill"), -col.year).columns)
['species', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'year']
['species', 'island', 'flipper_length_mm']

Reordering with relocate()

select() reorders but forces you to spell out what to keep. relocate() moves the columns you name and leaves everything else alone. With no anchor the moved columns go to the front; before= / after= place them relative to another column (passing both raises an error):

print(penguins.relocate(col.year).columns)
print(penguins.relocate(starts_with("bill"), after=col.flipper_length_mm).columns)
print(penguins.relocate(col.island, before=col.year).columns)
['year', 'species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']
['species', 'island', 'flipper_length_mm', 'bill_length_mm', 'bill_depth_mm', 'year']
['species', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'island', 'year']

across(): one recipe, many columns

across(selector, fns) goes inside mutate() or summarize() as a positional argument (everything else in those verbs must be a keyword). The simplest form pairs a selector with a string shortcut:

print(penguins.summarize(across(where(is_numeric), "mean")))
# dpyr dataframe · source: polars · showing 1 of 1 rows
shape: (1, 4)
┌────────────────┬───────────────┬───────────────────┬────────┐
│ bill_length_mm ┆ bill_depth_mm ┆ flipper_length_mm ┆ year   │
│ ---            ┆ ---           ┆ ---               ┆ ---    │
│ f64            ┆ f64           ┆ f64               ┆ f64    │
╞════════════════╪═══════════════╪═══════════════════╪════════╡
│ 45.35          ┆ 17.533333     ┆ 195.833333        ┆ 2007.5 │
└────────────────┴───────────────┴───────────────────┴────────┘

Aggregations skip missing values by default, dplyr-style via na_rm=True (see SEMANTICS S2). The shortcut must be one of a fixed list — anything else fails immediately, while you're still building the expression:

try:
    across(everything(), "avg")
except Exception as e:
    print(e)
across() unknown function shortcut 'avg'; known: first, last, max, mean, median, min, n_unique, std, sum, var

For several functions per column, pass a list. Output columns are then named {col}_{fn} by default:

print(penguins.summarize(across(starts_with("bill"), ["mean", "std"])).columns)
['bill_length_mm_mean', 'bill_length_mm_std', 'bill_depth_mm_mean', 'bill_depth_mm_std']

Anywhere a shortcut fits, a lambda fits too — it receives a typed column expression. A dict picks your own {fn} labels:

print(penguins.summarize(across(
    starts_with("bill"),
    {"avg": "mean", "spread": lambda c: c.max() - c.min()},
)))
# dpyr dataframe · source: polars · showing 1 of 1 rows
shape: (1, 4)
┌─────────────────────┬───────────────────────┬────────────────────┬──────────────────────┐
│ bill_length_mm_mean ┆ bill_length_mm_spread ┆ bill_depth_mm_mean ┆ bill_depth_mm_spread │
│ ---                 ┆ ---                   ┆ ---                ┆ ---                  │
│ f64                 ┆ f64                   ┆ f64                ┆ f64                  │
╞═════════════════════╪═══════════════════════╪════════════════════╪══════════════════════╡
│ 45.35               ┆ 13.5                  ┆ 17.533333          ┆ 4.2                  │
└─────────────────────┴───────────────────────┴────────────────────┴──────────────────────┘

names= overrides the naming template entirely; {col} and {fn} are the available placeholders:

print(penguins.summarize(across(where(is_numeric), "mean", names="mean_of_{col}")).columns)
['mean_of_bill_length_mm', 'mean_of_bill_depth_mm', 'mean_of_flipper_length_mm', 'mean_of_year']

across() in mutate()

With a single function the default template is just {col}, so mutate transforms the columns in place. Note year below: it was Int64, and int / int promotes to float (SEMANTICS S4), so a 2007 becomes 200.7 — set names= if you want to keep the originals:

print(penguins.mutate(across(where(is_numeric), lambda c: c / 10)))
print(penguins.mutate(across(ends_with("_mm"), lambda c: c / 25.4, names="{col}_in")).columns)
# dpyr dataframe · source: polars · showing 6 of 6 rows
shape: (6, 6)
┌───────────┬───────────┬────────────────┬───────────────┬───────────────────┬───────┐
│ species   ┆ island    ┆ bill_length_mm ┆ bill_depth_mm ┆ flipper_length_mm ┆ year  │
│ ---       ┆ ---       ┆ ---            ┆ ---           ┆ ---               ┆ ---   │
│ str       ┆ str       ┆ f64            ┆ f64           ┆ f64               ┆ f64   │
╞═══════════╪═══════════╪════════════════╪═══════════════╪═══════════════════╪═══════╡
│ Adelie    ┆ Torgersen ┆ 3.91           ┆ 1.87          ┆ 18.1              ┆ 200.7 │
│ Adelie    ┆ Dream     ┆ 3.78           ┆ 1.83          ┆ 17.4              ┆ 200.8 │
│ Gentoo    ┆ Biscoe    ┆ 4.75           ┆ 1.5           ┆ 21.7              ┆ 200.7 │
│ Gentoo    ┆ Biscoe    ┆ 4.99           ┆ 1.61          ┆ 21.3              ┆ 200.8 │
│ Chinstrap ┆ Dream     ┆ 4.65           ┆ 1.79          ┆ 19.2              ┆ 200.7 │
│ Chinstrap ┆ Dream     ┆ 5.13           ┆ 1.92          ┆ 19.8              ┆ 200.8 │
└───────────┴───────────┴────────────────┴───────────────┴───────────────────┴───────┘
['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'year', 'bill_length_mm_in', 'bill_depth_mm_in', 'flipper_length_mm_in']

Grouping columns are off-limits

across() never touches a grouping column, even when the selector matches it. Here year is numeric, but because it's the grouping key it comes through untouched instead of being averaged into nonsense. Positional across() results come first, then keyword aggregates like n=n(); grouped results are sorted by key (SEMANTICS S7):

print(penguins.group_by(col.year).summarize(across(where(is_numeric), "mean"), n=n()))
# dpyr dataframe · source: polars · showing 2 of 2 rows
shape: (2, 5)
┌──────┬────────────────┬───────────────┬───────────────────┬─────┐
│ year ┆ bill_length_mm ┆ bill_depth_mm ┆ flipper_length_mm ┆ n   │
│ ---  ┆ ---            ┆ ---           ┆ ---               ┆ --- │
│ i64  ┆ f64            ┆ f64           ┆ f64               ┆ i64 │
╞══════╪════════════════╪═══════════════╪═══════════════════╪═════╡
│ 2007 ┆ 44.366667      ┆ 17.2          ┆ 196.666667        ┆ 3   │
│ 2008 ┆ 46.333333      ┆ 17.866667     ┆ 195.0             ┆ 3   │
└──────┴────────────────┴───────────────┴───────────────────┴─────┘

The same protection applies in a grouped mutate, where each function runs as a window per group — handy for centering within groups (the result is still grouped, as the header shows):

centered = (
    penguins
    .group_by(col.species)
    .mutate(across(starts_with("bill"), lambda c: c - c.mean(), names="{col}_centered"))
    .select(col.species, col.bill_length_mm_centered, col.bill_depth_mm_centered)
)
print(centered)
# dpyr dataframe · groups: species · source: polars · showing 6 of 6 rows
shape: (6, 3)
┌───────────┬─────────────────────────┬────────────────────────┐
│ species   ┆ bill_length_mm_centered ┆ bill_depth_mm_centered │
│ ---       ┆ ---                     ┆ ---                    │
│ str       ┆ f64                     ┆ f64                    │
╞═══════════╪═════════════════════════╪════════════════════════╡
│ Adelie    ┆ 0.65                    ┆ 0.2                    │
│ Adelie    ┆ -0.65                   ┆ -0.2                   │
│ Gentoo    ┆ -1.2                    ┆ -0.55                  │
│ Gentoo    ┆ 1.2                     ┆ 0.55                   │
│ Chinstrap ┆ -2.4                    ┆ -0.65                  │
│ Chinstrap ┆ 2.4                     ┆ 0.65                   │
└───────────┴─────────────────────────┴────────────────────────┘

rename()

Keyword form, new_name=old_column; the old column can be a col reference or a plain string. Unmentioned columns keep their names and positions:

print(penguins.rename(bill_len=col.bill_length_mm, group="species").columns)
['group', 'island', 'bill_len', 'bill_depth_mm', 'flipper_length_mm', 'year']

The same code on duckdb

None of the above is polars-specific. Point a dataframe at a duckdb table and the identical chain compiles to SQL instead — note source: duckdb in the header, and the group keys still sorted (S7):

con = duckdb.connect()          # in-memory database
con.register("penguins", penguins.collect())
tbl = read(con, "penguins")

print(tbl.group_by(col.species).summarize(across(starts_with("bill"), "mean"), n=n()))
# dpyr dataframe · source: duckdb · showing 3 of 3 rows
shape: (3, 4)
┌───────────┬────────────────┬───────────────┬─────┐
│ species   ┆ bill_length_mm ┆ bill_depth_mm ┆ n   │
│ ---       ┆ ---            ┆ ---           ┆ --- │
│ str       ┆ f64            ┆ f64           ┆ i64 │
╞═══════════╪════════════════╪═══════════════╪═════╡
│ Adelie    ┆ 38.45          ┆ 18.5          ┆ 2   │
│ Chinstrap ┆ 48.9           ┆ 18.55         ┆ 2   │
│ Gentoo    ┆ 48.7           ┆ 15.55         ┆ 2   │
└───────────┴────────────────┴───────────────┴─────┘

Where to next: Grouped data covers everything else group_by changes, and Expressions & autocompletion explains the typed column expressions that lambdas receive.