Skip to content

Joins

dpyr gives you dplyr's two join families as methods on any dataframe:

  • Mutating joinsinner_join, left_join, right_join, full_join — pull columns from a second table into the first.
  • Filtering joinssemi_join, anti_join — use the second table only to decide which rows of the first to keep. No columns are added, and rows are never duplicated.

Everything below runs identically on the polars and duckdb backends. We'll use two small tables: one row per customer, and one row per order (a customer can place several orders, and an order can reference a customer we don't know about).

from dpyr import read, col

customers = read({
    "cust_id": [1, 2, 3, 4],
    "name":    ["Ada", "Grace", "Linus", "Guido"],
    "city":    ["London", "Arlington", "Helsinki", "Haarlem"],
})

orders = read({
    "cust_id": [1, 1, 3, 5],
    "item":    ["keyboard", "monitor", "laptop", "mouse"],
    "total":   [55.0, 210.0, 1450.0, 25.0],
})

Mutating joins

Keys are passed as on=col.name (one key) or on=[col.a, col.b] (several). The key column appears once in the result, and left columns keep their original positions (see SEMANTICS S22).

inner_join keeps only rows whose key exists on both sides. Customer 1 matched two orders, so Ada appears twice:

customers.inner_join(orders, on=col.cust_id)
# dpyr dataframe · source: polars · showing 3 of 3 rows
shape: (3, 5)
┌─────────┬───────┬──────────┬──────────┬────────┐
│ cust_id ┆ name  ┆ city     ┆ item     ┆ total  │
╞═════════╪═══════╪══════════╪══════════╪════════╡
│ 1       ┆ Ada   ┆ London   ┆ keyboard ┆ 55.0   │
│ 1       ┆ Ada   ┆ London   ┆ monitor  ┆ 210.0  │
│ 3       ┆ Linus ┆ Helsinki ┆ laptop   ┆ 1450.0 │
└─────────┴───────┴──────────┴──────────┴────────┘

left_join keeps every left row; right columns fill with null where nothing matched. This is the workhorse — "decorate my table with whatever extra info exists":

customers.left_join(orders, on=col.cust_id)
shape: (5, 5)
┌─────────┬───────┬───────────┬──────────┬────────┐
│ cust_id ┆ name  ┆ city      ┆ item     ┆ total  │
╞═════════╪═══════╪═══════════╪══════════╪════════╡
│ 1       ┆ Ada   ┆ London    ┆ keyboard ┆ 55.0   │
│ 1       ┆ Ada   ┆ London    ┆ monitor  ┆ 210.0  │
│ 2       ┆ Grace ┆ Arlington ┆ null     ┆ null   │
│ 3       ┆ Linus ┆ Helsinki  ┆ laptop   ┆ 1450.0 │
│ 4       ┆ Guido ┆ Haarlem   ┆ null     ┆ null   │
└─────────┴───────┴───────────┴──────────┴────────┘

right_join is the mirror image (every order survives, including the one from unknown customer 5), and full_join keeps everything from both sides:

customers.right_join(orders, on=col.cust_id)   # 4 rows: cust_id 5 has null name/city
customers.full_join(orders, on=col.cust_id)    # 6 rows: union of both sides

Running that full join, we got cust_ids in the order 1, 1, 3, 5, 2, 4 — matched pairs first, unmatched leftovers appended. That ordering is not a guarantee; see below.

Multiple keys

Pass a list to match on a composite key:

sales = read({
    "year":    [2024, 2024, 2025],
    "region":  ["east", "west", "east"],
    "revenue": [120.0, 95.0, 140.0],
})
targets = read({
    "year":   [2024, 2024, 2025, 2025],
    "region": ["east", "west", "east", "west"],
    "target": [100.0, 100.0, 130.0, 110.0],
})

sales.inner_join(targets, on=[col.year, col.region])
# 3 rows × 4 cols (year, region, revenue, target);
# the 2025/west target had no matching sales row, so it's dropped

Overlapping column names get .x / .y suffixes

When both tables have a non-key column with the same name, dpyr keeps both and disambiguates with dplyr's suffixes: .x for the left table, .y for the right (SEMANTICS S11) — not pandas' _x/_y or polars' _right:

jan = read({"sku": ["A1", "B2"], "price": [9.99, 24.00]})
feb = read({"sku": ["A1", "B2"], "price": [10.49, 22.50]})

jan.inner_join(feb, on=col.sku)
shape: (2, 3)
┌─────┬─────────┬─────────┐
│ sku ┆ price.x ┆ price.y │
╞═════╪═════════╪═════════╡
│ A1  ┆ 9.99    ┆ 10.49   │
│ B2  ┆ 24.0    ┆ 22.5    │
└─────┴─────────┴─────────┘

Dots in column names are awkward to type with col., so for downstream work you may prefer your own suffixes:

jan.inner_join(feb, on=col.sku, suffix=("_jan", "_feb"))
# same two rows, columns now: sku, price_jan, price_feb

Missing values in join keys: na_matches

SQL says NULL = NULL is unknown, so SQL joins silently drop rows whose key is missing. dplyr instead treats two missing keys as equal — and dpyr follows dplyr by default on both backends (SEMANTICS S10):

left  = read({"k": ["x", None], "lv": [1, 2]})
right = read({"k": ["x", None], "rv": [10, 20]})

left.inner_join(right, on=col.k)              # default: na_matches="na"
shape: (2, 3)
┌──────┬─────┬─────┐
│ k    ┆ lv  ┆ rv  │
╞══════╪═════╪═════╡
│ x    ┆ 1   ┆ 10  │
│ null ┆ 2   ┆ 20  │
└──────┴─────┴─────┘

Pass na_matches="never" to get the SQL behavior — the null keys no longer pair up:

left.inner_join(right, on=col.k, na_matches="never")
# 1 row: only k="x" survives

On duckdb, the default compiles to IS NOT DISTINCT FROM so both modes are available there too. All four mutating joins accept na_matches (and suffix); the filtering joins below take only on=.

Filtering joins: semi_join and anti_join

A semi_join keeps the left rows that have at least one match; an anti_join keeps the ones that have none. Neither adds columns:

customers.semi_join(orders, on=col.cust_id)
# 2 rows, same 3 columns as customers: Ada and Linus (they ordered)

customers.anti_join(orders, on=col.cust_id)
# 2 rows: Grace and Guido (they never ordered)

The key difference from inner_join + select: a semi join never multiplies rows. Customer 1 has two orders but still appears once. Prefer semi_join whenever the question is "which rows of A are in B?" rather than "what does B say about A?".

Duplicate keys multiply rows

Mutating joins emit one output row per pair of matching rows. If a key is duplicated on both sides, you get a per-key cross product — 2 × 2 = 4 rows here:

runs   = read({"day": ["mon", "mon"], "run":   ["r1", "r2"]})
alerts = read({"day": ["mon", "mon"], "alert": ["disk", "cpu"]})

runs.inner_join(alerts, on=col.day)
shape: (4, 3)
┌─────┬─────┬───────┐
│ day ┆ run ┆ alert │
╞═════╪═════╪═══════╡
│ mon ┆ r1  ┆ disk  │
│ mon ┆ r2  ┆ disk  │
│ mon ┆ r1  ┆ cpu   │
│ mon ┆ r2  ┆ cpu   │
└─────┴─────┴───────┘

This is correct join algebra, but when it's a surprise it usually means the right table wasn't unique per key. Compare tbl.distinct(col.k).count() against tbl.count() to check, or reach for semi_join if you only wanted existence.

Row order is unspecified — pin it with arrange()

dpyr deliberately does not promise any row order after a join (SEMANTICS S21). dplyr preserves left order, but hash joins in polars and duckdb don't, and forcing it would cost a sort on every join. The 1, 1, 3, 5, 2, 4 order of the earlier full join is just what the engine emitted that time. If order matters, say so explicitly:

from dpyr import desc

(
    customers
    .full_join(orders, on=col.cust_id)
    .arrange(col.cust_id, desc(col.total))
)
shape: (6, 5)
┌─────────┬───────┬───────────┬──────────┬────────┐
│ cust_id ┆ name  ┆ city      ┆ item     ┆ total  │
╞═════════╪═══════╪═══════════╪══════════╪════════╡
│ 1       ┆ Ada   ┆ London    ┆ monitor  ┆ 210.0  │
│ 1       ┆ Ada   ┆ London    ┆ keyboard ┆ 55.0   │
│ 2       ┆ Grace ┆ Arlington ┆ null     ┆ null   │
│ 3       ┆ Linus ┆ Helsinki  ┆ laptop   ┆ 1450.0 │
│ 4       ┆ Guido ┆ Haarlem   ┆ null     ┆ null   │
│ 5       ┆ null  ┆ null      ┆ mouse    ┆ 25.0   │
└─────────┴───────┴───────────┴──────────┴────────┘

Joins on duckdb: one connection per plan

Joining two duckdb-backed dataframes pushes the whole thing down as a single SQL query — as long as both dataframes come from the same connection:

import duckdb

con = duckdb.connect()
con.execute("CREATE TABLE people AS SELECT * FROM (VALUES (1, 'Ada'), (2, 'Grace')) t(pid, name)")
con.execute("CREATE TABLE badges AS SELECT * FROM (VALUES (1, 'gold'), (2, 'silver'), (2, 'bronze')) t(pid, badge)")

people = read(con, "people")
badges = read(con, "badges")

people.inner_join(badges, on=col.pid).arrange(col.pid, col.badge)
# dpyr dataframe · source: duckdb · showing 3 of 3 rows
shape: (3, 3)
┌─────┬───────┬────────┐
│ pid ┆ name  ┆ badge  │
╞═════╪═══════╪════════╡
│ 1   ┆ Ada   ┆ gold   │
│ 2   ┆ Grace ┆ bronze │
│ 2   ┆ Grace ┆ silver │
└─────┴───────┴────────┘

Mixing a duckdb dataframe with an in-memory dataframe just works: the plan runs inside duckdb, which scans the in-memory dataframe's Arrow data in place — no copy, no staging step:

from dpyr import read

scores = read({"pid": [1, 2], "score": [9.5, 8.0]})  # plain in-memory frame
people.inner_join(scores, on=col.pid).arrange(col.pid)
# dpyr dataframe · source: duckdb · showing 2 of 2 rows
shape: (2, 3)
┌─────┬───────┬───────┐
│ pid ┆ name  ┆ score │
╞═════╪═══════╪═══════╡
│ 1   ┆ Ada   ┆ 9.5   │
│ 2   ┆ Grace ┆ 8.0   │
└─────┴───────┴───────┘

And since 1.7.0, even dataframes from different connections — a second duckdb file, a sqlite file, separate in-memory databases — join too. The foreign side is streamed through Arrow onto the plan's primary connection. That stream is a copy of that table's rows, so dpyr warns to keep it visible; for a large table, land it once with to_table("name", con=...) and join the landed copy instead:

other_con = duckdb.connect()
other_con.execute("CREATE TABLE badges2 AS SELECT * FROM (VALUES (1, 'gold')) t(pid, badge)")
stray = read(other_con, "badges2")

import warnings
with warnings.catch_warnings(record=True) as caught:
    warnings.simplefilter("always")
    out = people.inner_join(stray, on=col.pid).collect()
print(out.to_dicts())
print(caught[0].message)
[{'pid': 1, 'name': 'Ada', 'badge': 'gold'}]
joining across database connections: streaming "badges2" through arrow onto the primary connection; persist()/to_table() it there first if it is large

In short: any source joins any source (SEMANTICS S27/S34). Same engine is free, RAM into duckdb is zero-copy, and across database connections costs one streamed copy of the smaller side — which you can see, because it warns.

Cheat sheet

Verb Rows kept Columns added Can duplicate rows
inner_join matches only yes yes
left_join all left yes (null-filled) yes
right_join all right yes (null-filled) yes
full_join all of both yes (null-filled) yes
semi_join left rows with a match no no
anti_join left rows without a match no no

All four mutating joins accept suffix=(".x", ".y") and na_matches="na" | "never"; the filtering joins take only on=.