One harness, two engines, identical workload: 91 measured dimensions across ingest, every query / aggregation / pipeline family, mixed read-under-write, kNN, and disk. As of 2026-07-04 the score is 43 WIN · 33 LOSE · 15 N/A for XERJ. The losses are published next to the wins — this run was truncated by a search_after defect that the benchmark itself uncovered, and every casualty row is shown below, scored against us.
Both engines run as a single node on the same machine, security off, queried over localhost. No containers, no network hop, no cluster coordination on either side. Whatever this box gives one engine, it gives the other.
The harness is demo/playbooks/bench-matrix.mjs — one file, Node builtins only, checked into the repo. Its rules:
Earlier head-to-heads showed XERJ winning reads 1.3–2.2× — and the numbers were a mirage. Those benchmarks repeated the same query against a static index, so XERJ's result cache served every call after the first. We were measuring cache hits, not query execution: uncached, a match_all size:10 actually took 2.28 seconds, because hit materialization scanned every matching document instead of the top from+size. We published the finding (demo/playbooks/CRITICAL_FINDING_read_perf_cache_mirage.md), fixed the O(N) path to O(from+size), and hardened the harness with the mismatch detection and identical-work rules above. That is the point of printing the LOSE column: a benchmark that can embarrass you is the only kind that can be trusted when it doesn't.
All 91 rows from demo/playbooks/SCORECARD.md, unedited. Latency rows are p50 in milliseconds; ingest rows are docs/s. Read this run's caveat first: partway through, the search_after family (its 9,924 ms row below) triggered a defect that drove XERJ to an out-of-memory kill. Rows recorded as collapsed or unsupported on the XERJ side after that point are casualties of that crash — the engine was down or dying when those families ran, not measurably slower. They are scored LOSE anyway, because an engine that dies mid-benchmark loses those rows. A fix is in flight (see What's Next); the matrix will be re-run and this page updated when it lands.
Beyond the crash casualties, the honest latency losses in this run: ingest 1M×1 client (0.94×), ingest 1M×8 clients (0.55×), rare_terms (0.97×), auto_date_histogram (0.97×), and search_after itself. All five are tracked work items.
Everything on this page regenerates from the repo. No hosted harness, no private dataset, no hand-tuned engine flags.
$ git clone https://github.com/xerj-org/xerj && cd xerj $ cargo build --release --manifest-path engine/Cargo.toml $ bash scratchpad/es_up.sh $ bash scratchpad/run_scorecard.sh --docs 100k,1m --clients 1,8 --knn --mixed
Your absolute numbers will differ with hardware; the ratios and verdicts are the claim. If your run disagrees with this page, file an issue with your SCORECARD.md — that is precisely what the harness is for.
The deep-pagination search_after family exposed a defect that ballooned memory until the kernel killed XERJ mid-run — the single largest distortion in this scorecard (28 of the 33 LOSE rows are its casualties). The fix is in flight on a dedicated branch; the matrix re-runs, and this page is republished, when it lands. Finding this class of bug is what the benchmark exists to do.
XERJ wins ingest at 100k docs (1.32× single-client, 1.17× at 8 clients) but loses at 1M: 0.94× single-client and 0.55× at 8 clients. The phased plan (demo/playbooks/BEAT_ES_MASTER_PLAN.md): a no-reparse flush that threads already-parsed documents into segment building instead of re-parsing them twice; search-pool isolation so background flush and merge can't starve foreground queries; and a freeze-and-swap flush that swaps in a fresh memtable atomically so writers never stall behind a drain. Definition of done: every cell in the scorecard green, enforced by CI — any new LOSE fails the build.
Skipped, not hidden — each needs a purpose-built index the flat telemetry corpus lacks: geo_* queries and aggregations (no geo_point/geo_shape field), ip_range / ip_prefix (no ip field), nested / has_child / has_parent (flat corpus, no join mapping), span_* (needs a positional text field), significant_text (corpus fields are keyword), semantic / hybrid retriever (needs a dense_vector field — kNN is covered separately by --knn on a purpose-built index), and percolate (parses but no-ops — not benchmarkable for correctness). Purpose-built corpora for these families are planned follow-ups.
Every performance change must hold ES-YAML REST conformance at 1326 passed / 0 failed. Speed bought with correctness is not a win, and it does not merge.