JSON
Two different things share the name, and the extension tells them apart:
.json— one JSON document, typically a list of records:[{"plot": "north", "acres": 3.2}, ...].jsonl/.ndjson— one JSON object per line. This is what logging systems and APIs usually emit, and it's the better format for data: it can be scanned lazily, line by line.
Reading
from dpyr import read, col
read("plots.json") # a single document, parsed up front
read("events.jsonl") # newline-delimited, lazy scan
read("events.ndjson") # same thing, other common extension
No second argument for either — each file is one table. Nested fields arrive as struct/list columns, which work but are nicer to flatten early in the chain.
Writing
result.write("plots.json") # one document: a JSON array of objects
result.write("events.jsonl") # one object per line
When things go wrong
read() can't infer a format from 'data.json5'— only.json,.jsonl, and.ndjsonare recognized; the error lists every readable extension.- A
.jsonfile that's actually line-delimited (or vice versa) — the parse error will look cryptic. Peek at the file: starts with[→ it's.json; one{...}per line → rename to.jsonl. - Inconsistent records (a field present in some objects, missing in
others) — missing fields become nulls, which is usually what you
want. Records whose types disagree (
"x": 1then"x": "high") fail the parse; fix the producer or pre-clean the file.
Good to know
.jsonlscans lazily and, on duckdb, writes in-engine viaCOPY— it behaves like CSV with better types..jsondocuments are read eagerly, all at once.- For data you'll re-read, parquet remains the better resting place.