How jsonata_query decomposes expressions into streaming accumulators — matching native SQL performance

Query Planner

jsonata_query() includes a query planner that decomposes JSONata expressions into streaming accumulators at compile time. Result: 83ms vs 84ms native SQL on 100K rows.

Full optimization story with benchmarks: sqlite/BLOGPOST.md
Complete pattern reference: sqlite/OPTIMIZATION.md

AST classification

The planner walks the JSONata AST and classifies every node:

Classification	Example	Per-row cost	Memory
CONSTANT	`"Q1 Report"`, `42`	None	O(1)
ACCUMULATOR	`$sum(path)`, `$count()`	Extract + accumulate	O(1)
FILTERED ACCUMULATOR	`$sum($filter($, fn).path)`	Predicate + conditional accumulate	O(1)
DERIVED	`$sum(x) - $count($)`	None (post-aggregate)	O(1)
OPAQUE	`$sort(...)`, `$reduce(...)`	Accumulate row	O(n)

Example decomposition:

{
  "revenue":   $sum($filter($, fn).amount),
  "refunds":   $sum($filter($, fn).amount),
  "net":       $sum(...) - $sum(...),
  "avg":       $round($average(amount), 2),
  "customers": $count($distinct(customer))
}

→ 5 streaming accumulators + post-aggregate arithmetic. net reuses the same accumulators as revenue and refunds (common subexpression elimination). Final object assembled once.

Key optimizations

Batch GJSON extraction (4.2x improvement)

Inspired by DuckDB's JSON shredding. All unique field paths collected at plan time. Per row, one gjson.GetManyBytes call extracts all values in a single scan.

Before: 5 accumulators x 1 GJSON scan = 5 scans/row
After: 1 scan/row, results distributed by index

Inspired by ClickHouse's -If combinator. Identical predicates deduplicated at plan time. Each unique predicate evaluates once per row. Multiple accumulators reference the boolean result by index.

Constant folding

Inspired by Spark Catalyst.

$sum(amount * 1.1) → $sum(amount) * 1.1 (multiplicative constant to post-aggregate)
$sum(amount + 5) → $sum(amount) + 5 * $count($) (algebraic identity for additive constants)

Streaming reference

Pattern	Streams	Memory
`$sum(path)`, `$count()`, `$max`, `$min`, `$average`	yes	O(1)
`$sum($filter($, fn).path)`	yes	O(1)
`$count($distinct(path))`	yes	O(unique)
`$sum(path * constant)`	yes (constant folded)	O(1)
`{ key: agg, key: agg, ... }`	yes (parallel)	O(1)
`$round($average(x), 2)`	yes (finalize once)	O(1)
`$sum(x) - $count($)`	yes (post-aggregate)	O(1)
`"Q1 Report"`	yes (constant)	O(1)
`$sort(...)`	no	O(n)
`$reduce($, fn, init)`	no	O(n)

Mixed expressions: only opaque subtrees pay O(n). Streaming accumulators run in constant memory.

Why it matches native SQL

Native SQL calls json_extract per aggregate per row. With 5 aggregates: 8+ json_extract calls/row.

jsonata_query does 1 gjson.GetManyBytes per row for all fields, evaluates each predicate once, updates all accumulators from pre-extracted values.

At 9+ aggregates, jsonata_query is faster (95ms vs 102ms) — SQL's per-aggregate extraction grows linearly while batch extraction stays nearly flat.

Performance progression

Method	Time (100K rows)	Memory
`jsonata_query` v2 (current)	83ms	O(unique)
SQL (single-scan FILTER/CASE)	84ms	O(1)
SQL (5 subqueries)	150ms	O(1)
`jsonata_query` v1	349ms	O(unique)
`jsonata_query` (accumulate)	439ms	O(100K)

Query Planner

On this page