Ingestion Pipeline (m1nd-ingest)
m1nd-ingest transforms codebases and structured documents into the property graph consumed by m1nd-core. It handles file discovery, language-specific code extraction, cross-file reference resolution, incremental diff computation, and memory/markdown ingestion.
Source: m1nd-ingest/src/
Module Map
| Module | Purpose |
|---|---|
lib.rs | Ingestor pipeline, IngestAdapter trait, config, stats |
walker.rs | DirectoryWalker, binary detection, git enrichment |
extract/mod.rs | Extractor trait, comment stripping, CommentSyntax |
extract/python.rs | Python: classes, functions, decorators, imports |
extract/typescript.rs | TypeScript/JS: classes, functions, interfaces, imports |
extract/rust_lang.rs | Rust: structs, enums, impls, traits, functions, mods |
extract/go.rs | Go: structs, interfaces, functions, packages |
extract/java.rs | Java: classes, interfaces, methods, packages |
extract/generic.rs | Fallback: file-level node with tag extraction |
resolve.rs | ReferenceResolver, proximity disambiguation |
cargo_workspace.rs | Cargo workspace/crate/dependency enrichment for Rust repos |
cross_file.rs | Python-weighted cross-file enrichment (imports, tests, registers) |
diff.rs | GraphDiff for incremental updates |
json_adapter.rs | Generic JSON-to-graph adapter |
memory_adapter.rs | Markdown/memory document adapter |
canonical.rs | Canonical document substrate used by the universal lane |
merge.rs | Graph merge utilities |
patent_adapter.rs | USPTO/EPO patent XML ingestion |
jats_adapter.rs | PubMed/JATS scientific article XML ingestion |
bibtex_adapter.rs | BibTeX bibliography file ingestion |
rfc_adapter.rs | IETF RFC XML v3 ingestion |
crossref_adapter.rs | CrossRef API JSON (DOI metadata) ingestion |
document_router.rs | Auto-detect document format and route to correct adapter |
universal_adapter.rs | Best-effort document canonicalization, provider routing, and graphification |
cross_domain.rs | Cross-domain edge resolution (DOI, ORCID, keyword bridges) |
Pipeline Overview
flowchart TD
subgraph "Phase 1: Walk"
DIR["Directory Root"]
WALK["DirectoryWalker<br/>walkdir + skip rules"]
BIN["Binary Detection<br/>NUL in first 8KB"]
GIT["Git Enrichment<br/>commit counts + timestamps"]
FILES["DiscoveredFile[]"]
end
subgraph "Phase 2: Extract (parallel)"
RAYON["rayon::par_iter"]
PY["PythonExtractor"]
TS["TypeScriptExtractor"]
RS["RustExtractor"]
GO["GoExtractor"]
JAVA["JavaExtractor"]
GEN["GenericExtractor"]
RESULTS["ExtractionResult[]"]
end
subgraph "Phase 3: Build Graph"
NODES["Add all nodes<br/>with provenance + timestamps"]
EDGES["Add resolved edges<br/>with causal strengths"]
CAUSAL["Causal strength assignment<br/>contains=0.8, imports=0.6, etc."]
end
subgraph "Phase 4: Resolve + Enrich"
REFS["ReferenceResolver<br/>ref:: → actual nodes"]
PROX["Proximity Disambiguation<br/>same file > same dir > same project"]
HINTS["Import Hints<br/>module path → target preference"]
CARGO["Cargo Workspace<br/>workspace/crate/dependency graph"]
XFILE["Cross-File Enrichment<br/>currently strongest for Python"]
end
subgraph "Phase 5: Finalize"
CSR["Build CSR<br/>sort edges, compact arrays"]
BIDIR["Expand Bidirectional<br/>contains, implements"]
REV["Build Reverse CSR"]
PR["PageRank<br/>power iteration"]
end
DIR --> WALK --> BIN --> GIT --> FILES
FILES --> RAYON
RAYON --> PY & TS & RS & GO & JAVA & GEN
PY & TS & RS & GO & JAVA & GEN --> RESULTS
RESULTS --> NODES --> EDGES --> CAUSAL
CAUSAL --> REFS --> PROX
PROX --> HINTS --> CARGO --> XFILE
XFILE --> CSR --> BIDIR --> REV --> PR
Phase 1: Directory Walking
DirectoryWalker uses the walkdir crate to traverse the filesystem. It applies skip rules, detects binary files, and enriches results with git metadata.
Skip Rules
Default skip directories (configured via IngestConfig):
#![allow(unused)]
fn main() {
skip_dirs: vec![
".git", "node_modules", "__pycache__", ".venv",
"target", "dist", "build", ".next", "vendor",
],
skip_files: vec![
"package-lock.json", "yarn.lock", "Cargo.lock", "poetry.lock",
],
}
Hidden directories (starting with .) are skipped unless they are the root. Symlinks are not followed.
Binary Detection (FM-ING-004)
After discovering a file, the walker reads the first 8KB and checks for NUL bytes (0x00). Any file containing NUL is classified as binary and skipped. This prevents feeding compiled binaries, images, or other non-text files into the language extractors.
Git Enrichment
If the root is inside a git repository, the walker runs git log --format=%at --name-only to extract:
- Commit count per file: Used to compute
change_frequency(1 commit = 0.1, 50+ = 1.0, capped). - Most recent commit timestamp: Used for temporal decay scoring.
- Commit groups: Sets of files that changed together in the same commit. Fed into the
CoChangeMatrixafter graph finalization.
The result is a WalkResult containing Vec<DiscoveredFile> and Vec<Vec<String>> commit groups.
#![allow(unused)]
fn main() {
pub struct DiscoveredFile {
pub path: PathBuf,
pub relative_path: String,
pub extension: Option<String>,
pub size_bytes: u64,
pub last_modified: f64,
pub commit_count: u32,
pub last_commit_time: f64,
}
}
Phase 2: Parallel Extraction
Files are distributed across rayon’s thread pool for concurrent extraction. Each file is assigned a language-specific extractor based on its extension:
| Extension | Extractor | Extracted Entities |
|---|---|---|
.py, .pyi | PythonExtractor | Classes, functions, decorators, imports, global assignments |
.ts, .tsx, .js, .jsx, .mjs, .cjs | TypeScriptExtractor | Classes, functions, interfaces, type aliases, imports, exports |
.rs | RustExtractor | Structs, enums, traits, impls, functions, modules, macros |
.go | GoExtractor | Structs, interfaces, functions, methods, packages |
.java | JavaExtractor | Classes, interfaces, methods, fields, packages |
| everything else | GenericExtractor | File-level node with tag extraction from content |
This stack is hybrid:
- native/manual extractors for Python, TypeScript/JavaScript, Rust, Go, and Java
- tree-sitter-backed tiers for additional languages
- generic fallback for unsupported files
Extractor Interface
All extractors implement the Extractor trait:
#![allow(unused)]
fn main() {
pub trait Extractor: Send + Sync {
fn extract(&self, content: &[u8], file_id: &str) -> M1ndResult<ExtractionResult>;
fn extensions(&self) -> &[&str];
}
}
An ExtractionResult contains:
#![allow(unused)]
fn main() {
pub struct ExtractionResult {
pub nodes: Vec<ExtractedNode>,
pub edges: Vec<ExtractedEdge>,
pub unresolved_refs: Vec<String>,
}
}
Each extracted node carries:
#![allow(unused)]
fn main() {
pub struct ExtractedNode {
pub id: String, // e.g. "file::backend/chat_handler.py::ChatHandler"
pub label: String, // e.g. "ChatHandler"
pub node_type: NodeType,
pub tags: Vec<String>,
pub line: u32,
pub end_line: u32,
}
}
Comment and String Stripping
Before extraction, each file’s content passes through strip_comments_and_strings() which removes comments and string literals to prevent false-positive matches from regex extractors. The function preserves import line string content (so from "react" still resolves) but strips string bodies elsewhere.
Comment syntax is per-language:
#![allow(unused)]
fn main() {
pub struct CommentSyntax {
pub line: &'static str, // e.g. "//" or "#"
pub block_open: &'static str, // e.g. "/*"
pub block_close: &'static str, // e.g. "*/"
}
}
Supported: Rust (//, /* */), Python (#, """ """), C-style (//, /* */), Go (//, /* */), Generic (#, none).
Node ID Format
Extracted nodes use a hierarchical ID scheme: file::{relative_path}::{entity_name}. For example:
file::backend/chat_handler.py(file node)file::backend/chat_handler.py::ChatHandler(class node)file::backend/chat_handler.py::ChatHandler::handle_message(method node)
Unresolved references use the prefix ref::: ref::Config, ref::react. These are resolved to actual nodes in Phase 4.
Phase 3: Graph Building
After parallel extraction completes, results are collected and processed sequentially (graph mutation is single-threaded).
Node Creation
For each ExtractedNode:
- Look up the file timestamp from git enrichment (or filesystem mtime).
- Compute
change_frequencyfrom git commit count:(commits / 50).clamp(0.1, 1.0). Default 0.3 for non-git repos. - Call
graph.add_node()with the external ID, label, node type, tags, timestamp, and change frequency. - Set provenance:
source_path,line_start,line_end,namespace="code". - On
DuplicateNodeerror, increment collision counter and continue.
Edge Creation
For each ExtractedEdge (skipping ref:: targets, which are deferred):
- Resolve source and target IDs to
NodeId. - Assign causal strength by relation type:
| Relation | Causal Strength | Direction |
|---|---|---|
contains | 0.8 | Bidirectional |
implements | 0.7 | Bidirectional |
imports | 0.6 | Forward |
calls | 0.5 | Forward |
references | 0.3 | Forward |
| other | 0.4 | Forward |
contains and implements edges are bidirectional so that both parent-to-child and child-to-parent navigation work.
Safety Guards (FM-ING-002)
Two budget checks run between sequential file processing:
- Timeout: If
start.elapsed() > config.timeout(default 300s), stop processing. - Node budget: If
nodes >= config.max_nodes(default 500K), stop processing.
Both log a warning and break from the build loop, producing a partial but consistent graph from whatever was processed.
Phase 4: Reference Resolution And Enrichment
ReferenceResolver
Unresolved references (ref::Config, ref::FastAPI, etc.) are resolved to actual graph nodes using the ReferenceResolver.
Multi-value label index (FM-ING-008): The resolver builds a HashMap from labels to lists of matching NodeIds. When multiple nodes share a label (e.g., multiple files define a Config class), proximity disambiguation selects the best match:
| Proximity | Score | Condition |
|---|---|---|
| Same file | 100 | Source and target share the same file:: prefix |
| Same directory | 50 | Source and target share the same directory |
| Same project | 10 | Default (both exist in the graph) |
Import hint disambiguation: When the extractor sees from foo.bar import Baz, it records an import hint mapping (source_file, "ref::Baz") to the module path foo.bar. The resolver uses this hint to prefer the Baz node under foo/bar/ over a same-named node elsewhere.
Resolution outcome per reference:
- Resolved: Exactly one match (or best proximity match). Edge created with resolved
NodeId. - Ambiguous: Multiple matches with equal proximity. Best guess selected, counted in stats.
- Unresolved: No match found. Counted in stats, no edge created.
Cargo Workspace Enrichment
For Rust repos, cargo_workspace.rs adds a workspace-aware layer before finalization:
- workspace nodes
- crate nodes
- crate -> file
containsedges - crate -> crate
depends_onedges for internal workspace dependencies - external dependency nodes for non-workspace dependencies
This means Rust repos are no longer represented only as file graphs.
Cross-File Enrichment
After reference resolution and Cargo enrichment, cross_file.rs adds a narrower set of shipped cross-file edges.
Today this pass is strongest for Python and focuses on:
importstestsregisters
It should not be described as a language-uniform cross-file engine yet.
Phase 5: Finalization
Graph.finalize() transforms the mutable graph into its read-optimized CSR form:
- Sort edges by source: All pending edges are sorted by
source.0(node index). - Build forward CSR: Compute offsets array, pack targets/weights/relations/etc into parallel arrays.
- Expand bidirectional edges: For each bidirectional edge
(A, B), ensure bothA->BandB->Aexist in the CSR. - Build reverse CSR: Sort edges by target, build
rev_offsets,rev_sources,rev_edge_idx(mapping back to forward array indices). - Rebuild plasticity arrays: Allocate
PlasticityNodefor each node with default ceiling. - Compute PageRank: Power iteration with damping 0.85, max 50 iterations, convergence 1e-6.
After finalization, the graph is immutable (except for atomic weight updates by plasticity).
Incremental Ingestion (GraphDiff)
diff.rs enables incremental updates without full re-ingestion.
#![allow(unused)]
fn main() {
pub enum DiffAction {
AddNode(ExtractedNode),
RemoveNode(String),
ModifyNode { external_id, new_label, new_tags, new_last_modified },
AddEdge(ExtractedEdge),
RemoveEdge { source_id, target_id, relation },
ModifyEdgeWeight { source_id, target_id, relation, new_weight },
}
}
GraphDiff::compute() compares old and new extraction results by indexing both into HashMaps by external ID, then classifying each node/edge as added, removed, or modified.
GraphDiff::apply() executes the diff against a live graph. Note: CSR does not support true node/edge removal. “Removed” nodes are tombstoned (zero weight, empty label) rather than physically deleted. A full re-ingest is needed to reclaim space.
When to use incremental vs full:
| Scenario | Strategy |
|---|---|
| Single file changed | Incremental diff (fast, ~10ms) |
| Many files changed (>20%) | Full re-ingest (cleaner CSR, correct PageRank) |
| New codebase | Full ingest |
| Plasticity state important | Full ingest + plasticity reimport (triple matching) |
Memory Adapter
MemoryIngestAdapter converts markdown documents into graph nodes. It implements the IngestAdapter trait with domain "memory".
Supported Formats
Files with extensions .md, .markdown, or .txt are accepted. The adapter walks a directory of memory files and parses each one into:
- Section nodes (
NodeType::Concept): Created from markdown headings (#,##, etc.). - Entry nodes (
NodeType::Process): Created from list items under sections. - File reference nodes (
NodeType::Reference): Created from file paths mentioned in content.
Entry Classification
List items are classified by content patterns:
| Classification | Pattern | Example |
|---|---|---|
| Task | Contains “TODO”, “FIXME”, “pending”, “implement” | “- TODO: add tests” |
| Decision | Contains “decision:”, “decided:”, “chose” | “- Decision: use CSR format” |
| State | Contains “status:”, “state:”, “current:” | “- Status: in progress” |
| Event | Contains date pattern (YYYY-MM-DD) | “- 2026-03-12: deployed” |
| Note | Default | “- Config lives in settings.py” |
Edges
The adapter creates:
containsedges from section to child entries.referencesedges from entries to file reference nodes.followsedges between sequential entries in the same section.
This allows the activation engine to traverse from a concept (“plasticity”) through memory entries to referenced code files, bridging the semantic gap between human notes and source code.
IngestAdapter Trait
The IngestAdapter trait enables domain-specific ingestion beyond code:
#![allow(unused)]
fn main() {
pub trait IngestAdapter: Send + Sync {
fn domain(&self) -> &str;
fn ingest(&self, root: &Path) -> M1ndResult<(Graph, IngestStats)>;
}
}
Implemented adapters:
| Adapter | Domain | Input | MCP adapter= |
|---|---|---|---|
Ingestor | "code" | Source code directories | code |
MemoryIngestAdapter | "memory" | Markdown/text documents | memory |
JsonIngestAdapter | "generic" | Arbitrary JSON with nodes[] and edges[] | json |
PatentIngestAdapter | "patent" | USPTO/EPO patent XML | patent |
JatsArticleAdapter | "article" | PubMed NLM / JATS Z39.96 XML | article |
BibTexAdapter | "bibtex" | BibTeX bibliography files | bibtex, bib |
RfcAdapter | "rfc" | IETF RFC XML v3 | rfc |
CrossRefAdapter | "crossref" | CrossRef API JSON (DOI metadata) | crossref, doi |
L1ghtIngestAdapter | "light" | L1GHT protocol Markdown | light |
The JSON adapter is the escape hatch for importing graphs from external tools. It expects a JSON document with nodes (array of {id, label, type, tags}) and edges (array of {source, target, relation, weight}).
Universal Document Lane
The universal lane is the best-effort document path for sources that are not authored in L1GHT and are not already handled by a stronger native structured adapter.
Its flow is:
- detect document family
- normalize into a
CanonicalDocument - graphify sections, blocks, tables, citations, entities, and claims
Optional providers can enrich the lane when available:
Trafilaturafor HTML/wiki/article extractionDoclingfor office and broad document canonicalizationMarkItDownas a lightweight fallback laneGROBIDfor scholarly PDFs
This provider stack is intentionally optional. The default green path does not require these providers; richer extraction appears only when the environment supports it.
Document Router (Auto-Detection)
DocumentRouter inspects file content and extension to auto-detect the correct adapter:
#![allow(unused)]
fn main() {
let (format, adapter) = DocumentRouter::detect(path);
let (format, adapter) = DocumentRouter::detect_directory(root); // samples ≤20 files
}
| Detection Method | Format | Heuristic |
|---|---|---|
Extension .bib / .bibtex | BibTeX | Extension only |
Extension .md + Protocol: L1GHT | L1GHT | Content check |
Extension .md without L1GHT, .txt, .rst, .adoc, .html, .pdf, .docx, .pptx, .xlsx | Universal | Extension + universal lane |
Extension .xml / .nxml | Patent, JATS, or RFC | Root element inspection |
Extension .json | CrossRef | Checks for DOI + publisher + type keys |
| Fallback | Code | Default pipeline |
Used via MCP: ingest(adapter="auto"), adapter="document", or adapter="universal" when you want best-effort document normalization directly.
For directory detection, the router samples up to 20 files and returns the dominant format.
Cross-Domain Resolution
CrossDomainResolver merges multiple adapter outputs and discovers cross-domain connections automatically.
Bridge Strategies
| Bridge | Weight | Source | Description |
|---|---|---|---|
same_as | 1.0 | DOI/PMID | Same identifier in different domains → identity edge |
cross_cites | 0.95 | Citation edges | Citation target exists as a full node in another domain |
same_orcid | 0.95 | ORCID tags | Same researcher ORCID across different domains |
same_author | 0.7 | Author name | Same author name across different namespaces |
shared_keyword | 0.6 | Keyword tags | Shared keyword:, article:keyword:, or subject: tags |
citation_chain | 0.5 | Citation adjacency | Transitive A→B→C bridging with decayed weight |
Safety Guards
- Keyword cap: Keywords shared by >20 nodes are ignored to prevent hub explosion.
- Cross-domain only: All bridges require nodes from ≥2 different namespaces. Same-domain matches are skipped.
- Self-loop prevention: Citation chains A→B→A do not generate self-loop edges.
- Deduplication: Nodes with identical external IDs are deduplicated (first wins).
Resolution Statistics
#![allow(unused)]
fn main() {
pub struct ResolutionStats {
pub graphs_merged: usize,
pub total_nodes: u32,
pub total_edges: usize,
pub cross_edges_created: usize,
pub identity_matches: usize,
pub author_bridges: usize,
pub keyword_bridges: usize,
pub orcid_bridges: usize,
pub citation_chains: usize,
}
}
Configuration Reference
#![allow(unused)]
fn main() {
pub struct IngestConfig {
pub root: PathBuf,
pub timeout: Duration, // default: 300s
pub max_nodes: u64, // default: 500_000
pub skip_dirs: Vec<String>, // default: [".git", "node_modules", ...]
pub skip_files: Vec<String>, // default: ["package-lock.json", ...]
pub parallelism: usize, // default: 8 (rayon threads)
}
}
Statistics
Every ingest run produces IngestStats:
#![allow(unused)]
fn main() {
pub struct IngestStats {
pub files_scanned: u64,
pub files_parsed: u64,
pub files_skipped_binary: u64,
pub files_skipped_encoding: u64,
pub nodes_created: u64,
pub edges_created: u64,
pub references_resolved: u64,
pub references_unresolved: u64,
pub label_collisions: u64,
pub elapsed_ms: f64,
pub commit_groups: Vec<Vec<String>>,
}
}
commit_groups is passed to the CoChangeMatrix in m1nd-core after graph finalization, seeding the temporal co-change model with real git history.