Keyboard shortcuts

Press ← or β†’ to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Benchmarks

For the raw research artifacts and patch backlog that produced the current public numbers, see:

  • docs/benchmarks/BENCHMARK_RESEARCH_2026-03-24.md
  • docs/benchmarks/BENCHMARK_PATCH_PLAN_2026-03-24.md
  • docs/benchmarks/BENCHMARK_HARNESS_SPEC_2026-03-24.md

This page is the short product-truth layer for the current benchmark corpus.

What We Measure Now

The current benchmark system is not only about token proxy. It also tracks whether m1nd improves workflow behavior:

  • token proxy / context churn
  • false_starts
  • guided follow-through
  • recovery loops
  • proof-state progression
  • progress observability on long-running writes

That matters because some of m1nd’s strongest wins are continuity, repair, and execution clarity rather than raw compression in every single scenario.

Current Warm-Graph Corpus

The current recorded aggregate warm-graph corpus shows:

MetricManualwarmResult
Aggregate token proxy10518518250.73% reduction
False starts140m1nd eliminates the recorded false starts
Guided follow-throughs031guided next-step behavior is being followed in real runs
Successful recovery loops012repair loops are closing instead of restarting from scratch

These are the public numbers reflected in the README and landing. They are the benchmark truth to mirror across docs.

Representative Engine Timings

The underlying engine remains fast on the measured production backend (~335 files, ~52K lines, 9,767 nodes, 26,557 edges):

OperationTimeNotes
Full ingest~910msWalk + extract + resolve + finalize
Activate query~31msFour-dimensional ranking
Impact analysis~5msBlast-radius path
Trace analysis~3.5msStacktrace to suspects
Trail resume~0.2msContinuity restore + hints
Apply batch~165msAtomic multi-file write before deeper verification

These timings are useful, but they are no longer the whole story. Current m1nd is also measured on guided behavior and recovery quality.

Where m1nd Wins

m1nd wins most clearly when the task is structural, stateful, or risky:

  • stacktrace triage with trace
  • blast-radius analysis with impact
  • continuity restoration with trail_resume
  • edit preparation with surgical_context_v2 and validate_plan
  • long-running writes with apply_batch
  • repair loops after invalid regex, stale route sets, stale trails, protected writes, and stale edit previews

Where Plain Tools Still Win

m1nd is not the headline tool for:

  • exact text search
  • one-file lookup when you already know the file
  • compiler truth
  • runtime logs and debugger work

Use rg, the compiler, the test runner, and logs when execution truth is the question. Use m1nd when navigation and connected structure are the bottleneck.

Why The Corpus Matters

The benchmark corpus is now part of product development, not just a marketing appendix.

Recent runtime and UX improvements were driven directly by measured benchmark pain:

  • proof_state and next-step guidance on core flows
  • more actionable trail_resume
  • better seek handling for natural-language prompts
  • reduced validate_plan noise
  • more useful surgical_context_v2
  • observable apply_batch progress and SSE handoff
  • recovery-oriented error payloads for invalid or stale tool calls

Reproducibility

To inspect the current benchmark system:

git clone https://github.com/maxkle1nz/m1nd.git
cd m1nd
cargo build --release --workspace
python3 scripts/benchmark/run_benchmark.py --help
python3 scripts/benchmark/summarize_benchmarks.py --help

The versioned scenarios, events, and run outputs live under docs/benchmarks/.

Memory

ComponentSize
Graph (9,767 nodes, 26,557 edges)~2MB
Plasticity state~500KB
Perspective state (per active perspective)~100KB
Lock baselines (per lock)~200KB
Trail storage (per saved trail)~50KB
JSON-RPC server overhead~5MB
Typical total~50MB

Memory scales linearly with graph size. A 100K-node graph would use approximately 20MB for the graph alone, with similar overhead for the server.

CPU

m1nd is single-threaded for graph operations (no lock contention, deterministic results). Ingest uses Rayon for parallel file parsing. During query serving, CPU usage is negligible between queries and spikes briefly during activation (31-77ms of computation).

On an Apple M2, the server at idle uses <0.1% CPU. During a burst of queries, it peaks at ~5% of a single core.

Scaling Characteristics

Ingest Time vs Codebase Size

Ingest scales linearly with file count. Reference resolution is roughly O(n log n) where n is the number of cross-file references.

FilesEstimated Ingest TimeEstimated Nodes
100~270ms~3,000
335910ms (measured)9,767 (measured)
1,000~2.7s~29,000
10,000~27s~290,000
100,000~4.5min~2,900,000

Activation Time vs Graph Size

Spreading activation is bounded by the number of edges traversed, which depends on graph density and query specificity rather than total graph size. Activation in a 100K-node graph is estimated at 100-200ms.

Persistence Time vs State Size

JSON serialization scales linearly with state size. A 10K-node graph persists in under 100ms. A 100K-node graph would take approximately 1 second.

Reproducibility

To reproduce these benchmarks:

git clone https://github.com/maxkle1nz/m1nd.git
cd m1nd
cargo build --release

# Start the server
./target/release/m1nd-mcp

Then send the JSON-RPC calls from the Examples document against your own codebase. Times will vary based on:

  • Hardware (CPU speed, memory bandwidth)
  • Codebase size and language
  • Graph density (codebases with many cross-references produce denser graphs)
  • Plasticity state (learned weights affect activation propagation paths)

Report your benchmarks via GitHub Issues with the benchmark label.