Benchmarks

For the raw research artifacts and patch backlog that produced the current public numbers, see:

docs/benchmarks/BENCHMARK_RESEARCH_2026-03-24.md
docs/benchmarks/BENCHMARK_PATCH_PLAN_2026-03-24.md
docs/benchmarks/BENCHMARK_HARNESS_SPEC_2026-03-24.md

This page is the short product-truth layer for the current benchmark corpus.

What We Measure Now

The current benchmark system is not only about token proxy. It also tracks whether m1nd improves workflow behavior:

token proxy / context churn
false_starts
guided follow-through
recovery loops
proof-state progression
progress observability on long-running writes

That matters because some of m1nd’s strongest wins are continuity, repair, and execution clarity rather than raw compression in every single scenario.

Current Warm-Graph Corpus

The current recorded aggregate warm-graph corpus shows:

Metric	Manual	`warm`	Result
Aggregate token proxy	`10518`	`5182`	`50.73%` reduction
False starts	`14`	`0`	m1nd eliminates the recorded false starts
Guided follow-throughs	`0`	`31`	guided next-step behavior is being followed in real runs
Successful recovery loops	`0`	`12`	repair loops are closing instead of restarting from scratch

These are the public numbers reflected in the README and landing. They are the benchmark truth to mirror across docs.

Representative Engine Timings

The underlying engine remains fast on the measured production backend (~335 files, ~52K lines, 9,767 nodes, 26,557 edges):

Operation	Time	Notes
Full ingest	~910ms	Walk + extract + resolve + finalize
Activate query	~31ms	Four-dimensional ranking
Impact analysis	~5ms	Blast-radius path
Trace analysis	~3.5ms	Stacktrace to suspects
Trail resume	~0.2ms	Continuity restore + hints
Apply batch	~165ms	Atomic multi-file write before deeper verification

These timings are useful, but they are no longer the whole story. Current m1nd is also measured on guided behavior and recovery quality.

Where m1nd Wins

m1nd wins most clearly when the task is structural, stateful, or risky:

stacktrace triage with trace
blast-radius analysis with impact
continuity restoration with trail_resume
edit preparation with surgical_context_v2 and validate_plan
long-running writes with apply_batch
repair loops after invalid regex, stale route sets, stale trails, protected writes, and stale edit previews

Where Plain Tools Still Win

m1nd is not the headline tool for:

exact text search
one-file lookup when you already know the file
compiler truth
runtime logs and debugger work

Use rg, the compiler, the test runner, and logs when execution truth is the question. Use m1nd when navigation and connected structure are the bottleneck.

Why The Corpus Matters

The benchmark corpus is now part of product development, not just a marketing appendix.

Recent runtime and UX improvements were driven directly by measured benchmark pain:

proof_state and next-step guidance on core flows
more actionable trail_resume
better seek handling for natural-language prompts
reduced validate_plan noise
more useful surgical_context_v2
observable apply_batch progress and SSE handoff
recovery-oriented error payloads for invalid or stale tool calls

Reproducibility

To inspect the current benchmark system:

git clone https://github.com/maxkle1nz/m1nd.git
cd m1nd
cargo build --release --workspace
python3 scripts/benchmark/run_benchmark.py --help
python3 scripts/benchmark/summarize_benchmarks.py --help

The versioned scenarios, events, and run outputs live under docs/benchmarks/.

Memory

Component	Size
Graph (9,767 nodes, 26,557 edges)	~2MB
Plasticity state	~500KB
Perspective state (per active perspective)	~100KB
Lock baselines (per lock)	~200KB
Trail storage (per saved trail)	~50KB
JSON-RPC server overhead	~5MB
Typical total	~50MB

Memory scales linearly with graph size. A 100K-node graph would use approximately 20MB for the graph alone, with similar overhead for the server.

m1nd is single-threaded for graph operations (no lock contention, deterministic results). Ingest uses Rayon for parallel file parsing. During query serving, CPU usage is negligible between queries and spikes briefly during activation (31-77ms of computation).

On an Apple M2, the server at idle uses <0.1% CPU. During a burst of queries, it peaks at ~5% of a single core.

Scaling Characteristics

Ingest Time vs Codebase Size

Ingest scales linearly with file count. Reference resolution is roughly O(n log n) where n is the number of cross-file references.

Files	Estimated Ingest Time	Estimated Nodes
100	~270ms	~3,000
335	910ms (measured)	9,767 (measured)
1,000	~2.7s	~29,000
10,000	~27s	~290,000
100,000	~4.5min	~2,900,000

git clone https://github.com/maxkle1nz/m1nd.git
cd m1nd
cargo build --release

# Start the server
./target/release/m1nd-mcp

Then send the JSON-RPC calls from the Examples document against your own codebase. Times will vary based on:

Hardware (CPU speed, memory bandwidth)
Codebase size and language
Graph density (codebases with many cross-references produce denser graphs)
Plasticity state (learned weights affect activation propagation paths)

Report your benchmarks via GitHub Issues with the benchmark label.

m1nd Documentation

Benchmarks

What We Measure Now

Current Warm-Graph Corpus

Representative Engine Timings

Where m1nd Wins

Where Plain Tools Still Win

Why The Corpus Matters

Reproducibility

Memory

CPU

Scaling Characteristics

Ingest Time vs Codebase Size

Activation Time vs Graph Size

Persistence Time vs State Size

Reproducibility

Keyboard shortcuts

m1nd Documentation