gnata-sqlite
How-to Guides

Optimizing Queries

How to write jsonata_query expressions that stream in constant memory — and what happens when they don't

Optimizing Queries

jsonata_query includes a query planner that decomposes expressions into streaming accumulators at compile time. When it can decompose, queries run in constant memory with a single table scan — matching native SQL performance. When it can't, it falls back to accumulating all rows in memory.

The difference is significant: 83ms vs 439ms on 100K rows for the same 5-aggregate report. On larger datasets, the gap widens further since streaming is O(1) memory while accumulation is O(n).

What streams

These patterns are recognized at compile time and never buffer rows:

PatternExampleMemory
Simple aggregates$sum(amount), $count($), $max(price)O(1)
Filtered aggregates$sum($filter($, function($v){$v.status = "completed"}).amount)O(1)
Count distinct$count($distinct(region))O(unique)
Object/array constructors{ "a": $sum(x), "b": $max(y) }O(1)
Post-aggregate arithmetic$sum(x) - $count($)O(1)
Finalizer functions$round($average(x), 2)O(1)
Constants"Q1 Report", 42O(1)
Constant folding$sum(amount * 1.1)$sum(amount) * 1.1O(1)

What falls back to O(n)

These patterns require all rows in memory. They work correctly, but memory and time scale linearly with row count:

PatternWhy it can't stream
$sort($, function($a,$b){...})Needs all rows to determine order
$reduce($, function($a,$v){...}, init)Each step depends on all previous rows
$map($, function($v){...})Output is one element per row — O(n) by definition
Variable bindings + nested lambdas($x := $sum(amount); $map($, function($v){$v.amount / $x})) — two-pass dependency

What O(n) costs in practice

On 100K rows with a 5-aggregate report:

ModeTimeMemory
Streaming83msO(1)
Accumulating439msO(100K rows)

At 1M rows, accumulation means holding every row in memory before evaluation begins. Streaming processes each row once and discards it.

Mixed expressions: partial fallback

When streaming and opaque patterns coexist, only the opaque keys pay the O(n) cost:

jsonata_query('{
  "total":  $sum(amount),         -- streams: O(1)
  "avg":    $average(amount),     -- streams: O(1)
  "top_5":  $sort($, fn)[0..4]   -- accumulates: O(n)
}', data)

total and avg run in constant memory regardless. The planner doesn't give up on the entire expression because one key is expensive.

Keeping expressions on the fast path

Use identical predicate text for shared filters

Predicates are deduplicated by string equality. Identical text shares one evaluation per row; rephrased predicates evaluate separately:

-- Shared: one predicate evaluation per row
$sum($filter($, function($v){$v.status = "completed"}).amount)
$average($filter($, function($v){$v.status = "completed"}).amount)

-- NOT shared: different parameter name → two evaluations per row
$sum($filter($, function($v){$v.status = "completed"}).amount)
$average($filter($, function($row){$row.status = "completed"}).amount)

Push sorting and filtering into SQL

If you need the top N results, filter in SQL before the expression touches rows:

-- Instead of jsonata_query('$sort($, fn)[0..4]', data) over 100K rows:
SELECT jsonata('...', data) FROM orders
ORDER BY json_extract(data, '$.amount') DESC LIMIT 5;

Use json_each for simple array expansion

jsonata_each evaluates a full JSONata expression per row. For simple array expansion, json_each is ~6x faster:

-- Simple expand: prefer json_each
SELECT j.value FROM events, json_each(data, '$.items') j;

-- Filter + transform: jsonata_each earns its cost
SELECT * FROM events, jsonata_each('items[price > 100].{
  "name": product, "total": price * qty
}', data);

Use json_set for simple mutations

jsonata_set re-parses the entire document. For simple path updates, json_set is 5-7x faster:

-- Simple: prefer json_set
SELECT json_set(data, '$.status', 'done') FROM events;

-- Nested creation: jsonata_set earns its cost (creates intermediate objects)
SELECT jsonata_set(data, 'meta.source.type', '"import"') FROM events;

Watch for format functions

$base64, $urlencode, $htmlescape, and other format functions bypass the GJSON fast path, requiring full JSONata evaluation (~8-18 us/row vs ~0.25 us/row for simple paths). In mixed expressions, only the key using the format function pays this cost.

Quick reference

ExpressionStreams?Notes
$sum(amount)yesSimple path accumulator
$sum(amount * 1.1)yesConstant folded
$sum($filter($, fn).amount)yesPredicate + conditional accumulator
$count($distinct(region))yesO(unique) memory
{ "a": $sum(x), "b": $max(y) }yesParallel accumulators, batch extraction
$round($average(x), 2)yesFinalizer on streaming average
$sum(x) - $count($)yesPost-aggregate arithmetic
$sort(...)noO(n) — needs all data
$reduce($, fn, init)noO(n) — cross-row state
$map($, fn)noO(n) — output is one element per row

See the query planner for the full decomposition model and internal optimization details.

On this page