Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Ingestion Pipeline (m1nd-ingest)

m1nd-ingest transforms codebases and structured documents into the property graph consumed by m1nd-core. It handles file discovery, language-specific code extraction, cross-file reference resolution, incremental diff computation, and memory/markdown ingestion.

Source: m1nd-ingest/src/

Module Map

ModulePurpose
lib.rsIngestor pipeline, IngestAdapter trait, config, stats
walker.rsDirectoryWalker, binary detection, git enrichment
extract/mod.rsExtractor trait, comment stripping, CommentSyntax
extract/python.rsPython: classes, functions, decorators, imports
extract/typescript.rsTypeScript/JS: classes, functions, interfaces, imports
extract/rust_lang.rsRust: structs, enums, impls, traits, functions, mods
extract/go.rsGo: structs, interfaces, functions, packages
extract/java.rsJava: classes, interfaces, methods, packages
extract/generic.rsFallback: file-level node with tag extraction
resolve.rsReferenceResolver, proximity disambiguation
cargo_workspace.rsCargo workspace/crate/dependency enrichment for Rust repos
cross_file.rsPython-weighted cross-file enrichment (imports, tests, registers)
diff.rsGraphDiff for incremental updates
json_adapter.rsGeneric JSON-to-graph adapter
memory_adapter.rsMarkdown/memory document adapter
canonical.rsCanonical document substrate used by the universal lane
merge.rsGraph merge utilities
patent_adapter.rsUSPTO/EPO patent XML ingestion
jats_adapter.rsPubMed/JATS scientific article XML ingestion
bibtex_adapter.rsBibTeX bibliography file ingestion
rfc_adapter.rsIETF RFC XML v3 ingestion
crossref_adapter.rsCrossRef API JSON (DOI metadata) ingestion
document_router.rsAuto-detect document format and route to correct adapter
universal_adapter.rsBest-effort document canonicalization, provider routing, and graphification
cross_domain.rsCross-domain edge resolution (DOI, ORCID, keyword bridges)

Pipeline Overview

flowchart TD
    subgraph "Phase 1: Walk"
        DIR["Directory Root"]
        WALK["DirectoryWalker<br/>walkdir + skip rules"]
        BIN["Binary Detection<br/>NUL in first 8KB"]
        GIT["Git Enrichment<br/>commit counts + timestamps"]
        FILES["DiscoveredFile[]"]
    end

    subgraph "Phase 2: Extract (parallel)"
        RAYON["rayon::par_iter"]
        PY["PythonExtractor"]
        TS["TypeScriptExtractor"]
        RS["RustExtractor"]
        GO["GoExtractor"]
        JAVA["JavaExtractor"]
        GEN["GenericExtractor"]
        RESULTS["ExtractionResult[]"]
    end

    subgraph "Phase 3: Build Graph"
        NODES["Add all nodes<br/>with provenance + timestamps"]
        EDGES["Add resolved edges<br/>with causal strengths"]
        CAUSAL["Causal strength assignment<br/>contains=0.8, imports=0.6, etc."]
    end

    subgraph "Phase 4: Resolve + Enrich"
        REFS["ReferenceResolver<br/>ref:: → actual nodes"]
        PROX["Proximity Disambiguation<br/>same file > same dir > same project"]
        HINTS["Import Hints<br/>module path → target preference"]
        CARGO["Cargo Workspace<br/>workspace/crate/dependency graph"]
        XFILE["Cross-File Enrichment<br/>currently strongest for Python"]
    end

    subgraph "Phase 5: Finalize"
        CSR["Build CSR<br/>sort edges, compact arrays"]
        BIDIR["Expand Bidirectional<br/>contains, implements"]
        REV["Build Reverse CSR"]
        PR["PageRank<br/>power iteration"]
    end

    DIR --> WALK --> BIN --> GIT --> FILES
    FILES --> RAYON
    RAYON --> PY & TS & RS & GO & JAVA & GEN
    PY & TS & RS & GO & JAVA & GEN --> RESULTS
    RESULTS --> NODES --> EDGES --> CAUSAL
    CAUSAL --> REFS --> PROX
    PROX --> HINTS --> CARGO --> XFILE
    XFILE --> CSR --> BIDIR --> REV --> PR

Phase 1: Directory Walking

DirectoryWalker uses the walkdir crate to traverse the filesystem. It applies skip rules, detects binary files, and enriches results with git metadata.

Skip Rules

Default skip directories (configured via IngestConfig):

#![allow(unused)]
fn main() {
skip_dirs: vec![
    ".git", "node_modules", "__pycache__", ".venv",
    "target", "dist", "build", ".next", "vendor",
],
skip_files: vec![
    "package-lock.json", "yarn.lock", "Cargo.lock", "poetry.lock",
],
}

Hidden directories (starting with .) are skipped unless they are the root. Symlinks are not followed.

Binary Detection (FM-ING-004)

After discovering a file, the walker reads the first 8KB and checks for NUL bytes (0x00). Any file containing NUL is classified as binary and skipped. This prevents feeding compiled binaries, images, or other non-text files into the language extractors.

Git Enrichment

If the root is inside a git repository, the walker runs git log --format=%at --name-only to extract:

  • Commit count per file: Used to compute change_frequency (1 commit = 0.1, 50+ = 1.0, capped).
  • Most recent commit timestamp: Used for temporal decay scoring.
  • Commit groups: Sets of files that changed together in the same commit. Fed into the CoChangeMatrix after graph finalization.

The result is a WalkResult containing Vec<DiscoveredFile> and Vec<Vec<String>> commit groups.

#![allow(unused)]
fn main() {
pub struct DiscoveredFile {
    pub path: PathBuf,
    pub relative_path: String,
    pub extension: Option<String>,
    pub size_bytes: u64,
    pub last_modified: f64,
    pub commit_count: u32,
    pub last_commit_time: f64,
}
}

Phase 2: Parallel Extraction

Files are distributed across rayon’s thread pool for concurrent extraction. Each file is assigned a language-specific extractor based on its extension:

ExtensionExtractorExtracted Entities
.py, .pyiPythonExtractorClasses, functions, decorators, imports, global assignments
.ts, .tsx, .js, .jsx, .mjs, .cjsTypeScriptExtractorClasses, functions, interfaces, type aliases, imports, exports
.rsRustExtractorStructs, enums, traits, impls, functions, modules, macros
.goGoExtractorStructs, interfaces, functions, methods, packages
.javaJavaExtractorClasses, interfaces, methods, fields, packages
everything elseGenericExtractorFile-level node with tag extraction from content

This stack is hybrid:

  • native/manual extractors for Python, TypeScript/JavaScript, Rust, Go, and Java
  • tree-sitter-backed tiers for additional languages
  • generic fallback for unsupported files

Extractor Interface

All extractors implement the Extractor trait:

#![allow(unused)]
fn main() {
pub trait Extractor: Send + Sync {
    fn extract(&self, content: &[u8], file_id: &str) -> M1ndResult<ExtractionResult>;
    fn extensions(&self) -> &[&str];
}
}

An ExtractionResult contains:

#![allow(unused)]
fn main() {
pub struct ExtractionResult {
    pub nodes: Vec<ExtractedNode>,
    pub edges: Vec<ExtractedEdge>,
    pub unresolved_refs: Vec<String>,
}
}

Each extracted node carries:

#![allow(unused)]
fn main() {
pub struct ExtractedNode {
    pub id: String,        // e.g. "file::backend/chat_handler.py::ChatHandler"
    pub label: String,     // e.g. "ChatHandler"
    pub node_type: NodeType,
    pub tags: Vec<String>,
    pub line: u32,
    pub end_line: u32,
}
}

Comment and String Stripping

Before extraction, each file’s content passes through strip_comments_and_strings() which removes comments and string literals to prevent false-positive matches from regex extractors. The function preserves import line string content (so from "react" still resolves) but strips string bodies elsewhere.

Comment syntax is per-language:

#![allow(unused)]
fn main() {
pub struct CommentSyntax {
    pub line: &'static str,         // e.g. "//" or "#"
    pub block_open: &'static str,   // e.g. "/*"
    pub block_close: &'static str,  // e.g. "*/"
}
}

Supported: Rust (//, /* */), Python (#, """ """), C-style (//, /* */), Go (//, /* */), Generic (#, none).

Node ID Format

Extracted nodes use a hierarchical ID scheme: file::{relative_path}::{entity_name}. For example:

  • file::backend/chat_handler.py (file node)
  • file::backend/chat_handler.py::ChatHandler (class node)
  • file::backend/chat_handler.py::ChatHandler::handle_message (method node)

Unresolved references use the prefix ref::: ref::Config, ref::react. These are resolved to actual nodes in Phase 4.

Phase 3: Graph Building

After parallel extraction completes, results are collected and processed sequentially (graph mutation is single-threaded).

Node Creation

For each ExtractedNode:

  1. Look up the file timestamp from git enrichment (or filesystem mtime).
  2. Compute change_frequency from git commit count: (commits / 50).clamp(0.1, 1.0). Default 0.3 for non-git repos.
  3. Call graph.add_node() with the external ID, label, node type, tags, timestamp, and change frequency.
  4. Set provenance: source_path, line_start, line_end, namespace="code".
  5. On DuplicateNode error, increment collision counter and continue.

Edge Creation

For each ExtractedEdge (skipping ref:: targets, which are deferred):

  1. Resolve source and target IDs to NodeId.
  2. Assign causal strength by relation type:
RelationCausal StrengthDirection
contains0.8Bidirectional
implements0.7Bidirectional
imports0.6Forward
calls0.5Forward
references0.3Forward
other0.4Forward

contains and implements edges are bidirectional so that both parent-to-child and child-to-parent navigation work.

Safety Guards (FM-ING-002)

Two budget checks run between sequential file processing:

  • Timeout: If start.elapsed() > config.timeout (default 300s), stop processing.
  • Node budget: If nodes >= config.max_nodes (default 500K), stop processing.

Both log a warning and break from the build loop, producing a partial but consistent graph from whatever was processed.

Phase 4: Reference Resolution And Enrichment

ReferenceResolver

Unresolved references (ref::Config, ref::FastAPI, etc.) are resolved to actual graph nodes using the ReferenceResolver.

Multi-value label index (FM-ING-008): The resolver builds a HashMap from labels to lists of matching NodeIds. When multiple nodes share a label (e.g., multiple files define a Config class), proximity disambiguation selects the best match:

ProximityScoreCondition
Same file100Source and target share the same file:: prefix
Same directory50Source and target share the same directory
Same project10Default (both exist in the graph)

Import hint disambiguation: When the extractor sees from foo.bar import Baz, it records an import hint mapping (source_file, "ref::Baz") to the module path foo.bar. The resolver uses this hint to prefer the Baz node under foo/bar/ over a same-named node elsewhere.

Resolution outcome per reference:

  • Resolved: Exactly one match (or best proximity match). Edge created with resolved NodeId.
  • Ambiguous: Multiple matches with equal proximity. Best guess selected, counted in stats.
  • Unresolved: No match found. Counted in stats, no edge created.

Cargo Workspace Enrichment

For Rust repos, cargo_workspace.rs adds a workspace-aware layer before finalization:

  • workspace nodes
  • crate nodes
  • crate -> file contains edges
  • crate -> crate depends_on edges for internal workspace dependencies
  • external dependency nodes for non-workspace dependencies

This means Rust repos are no longer represented only as file graphs.

Cross-File Enrichment

After reference resolution and Cargo enrichment, cross_file.rs adds a narrower set of shipped cross-file edges.

Today this pass is strongest for Python and focuses on:

  • imports
  • tests
  • registers

It should not be described as a language-uniform cross-file engine yet.

Phase 5: Finalization

Graph.finalize() transforms the mutable graph into its read-optimized CSR form:

  1. Sort edges by source: All pending edges are sorted by source.0 (node index).
  2. Build forward CSR: Compute offsets array, pack targets/weights/relations/etc into parallel arrays.
  3. Expand bidirectional edges: For each bidirectional edge (A, B), ensure both A->B and B->A exist in the CSR.
  4. Build reverse CSR: Sort edges by target, build rev_offsets, rev_sources, rev_edge_idx (mapping back to forward array indices).
  5. Rebuild plasticity arrays: Allocate PlasticityNode for each node with default ceiling.
  6. Compute PageRank: Power iteration with damping 0.85, max 50 iterations, convergence 1e-6.

After finalization, the graph is immutable (except for atomic weight updates by plasticity).

Incremental Ingestion (GraphDiff)

diff.rs enables incremental updates without full re-ingestion.

#![allow(unused)]
fn main() {
pub enum DiffAction {
    AddNode(ExtractedNode),
    RemoveNode(String),
    ModifyNode { external_id, new_label, new_tags, new_last_modified },
    AddEdge(ExtractedEdge),
    RemoveEdge { source_id, target_id, relation },
    ModifyEdgeWeight { source_id, target_id, relation, new_weight },
}
}

GraphDiff::compute() compares old and new extraction results by indexing both into HashMaps by external ID, then classifying each node/edge as added, removed, or modified.

GraphDiff::apply() executes the diff against a live graph. Note: CSR does not support true node/edge removal. “Removed” nodes are tombstoned (zero weight, empty label) rather than physically deleted. A full re-ingest is needed to reclaim space.

When to use incremental vs full:

ScenarioStrategy
Single file changedIncremental diff (fast, ~10ms)
Many files changed (>20%)Full re-ingest (cleaner CSR, correct PageRank)
New codebaseFull ingest
Plasticity state importantFull ingest + plasticity reimport (triple matching)

Memory Adapter

MemoryIngestAdapter converts markdown documents into graph nodes. It implements the IngestAdapter trait with domain "memory".

Supported Formats

Files with extensions .md, .markdown, or .txt are accepted. The adapter walks a directory of memory files and parses each one into:

  • Section nodes (NodeType::Concept): Created from markdown headings (#, ##, etc.).
  • Entry nodes (NodeType::Process): Created from list items under sections.
  • File reference nodes (NodeType::Reference): Created from file paths mentioned in content.

Entry Classification

List items are classified by content patterns:

ClassificationPatternExample
TaskContains “TODO”, “FIXME”, “pending”, “implement”“- TODO: add tests”
DecisionContains “decision:”, “decided:”, “chose”“- Decision: use CSR format”
StateContains “status:”, “state:”, “current:”“- Status: in progress”
EventContains date pattern (YYYY-MM-DD)“- 2026-03-12: deployed”
NoteDefault“- Config lives in settings.py”

Edges

The adapter creates:

  • contains edges from section to child entries.
  • references edges from entries to file reference nodes.
  • follows edges between sequential entries in the same section.

This allows the activation engine to traverse from a concept (“plasticity”) through memory entries to referenced code files, bridging the semantic gap between human notes and source code.

IngestAdapter Trait

The IngestAdapter trait enables domain-specific ingestion beyond code:

#![allow(unused)]
fn main() {
pub trait IngestAdapter: Send + Sync {
    fn domain(&self) -> &str;
    fn ingest(&self, root: &Path) -> M1ndResult<(Graph, IngestStats)>;
}
}

Implemented adapters:

AdapterDomainInputMCP adapter=
Ingestor"code"Source code directoriescode
MemoryIngestAdapter"memory"Markdown/text documentsmemory
JsonIngestAdapter"generic"Arbitrary JSON with nodes[] and edges[]json
PatentIngestAdapter"patent"USPTO/EPO patent XMLpatent
JatsArticleAdapter"article"PubMed NLM / JATS Z39.96 XMLarticle
BibTexAdapter"bibtex"BibTeX bibliography filesbibtex, bib
RfcAdapter"rfc"IETF RFC XML v3rfc
CrossRefAdapter"crossref"CrossRef API JSON (DOI metadata)crossref, doi
L1ghtIngestAdapter"light"L1GHT protocol Markdownlight

The JSON adapter is the escape hatch for importing graphs from external tools. It expects a JSON document with nodes (array of {id, label, type, tags}) and edges (array of {source, target, relation, weight}).

Universal Document Lane

The universal lane is the best-effort document path for sources that are not authored in L1GHT and are not already handled by a stronger native structured adapter.

Its flow is:

  1. detect document family
  2. normalize into a CanonicalDocument
  3. graphify sections, blocks, tables, citations, entities, and claims

Optional providers can enrich the lane when available:

  • Trafilatura for HTML/wiki/article extraction
  • Docling for office and broad document canonicalization
  • MarkItDown as a lightweight fallback lane
  • GROBID for scholarly PDFs

This provider stack is intentionally optional. The default green path does not require these providers; richer extraction appears only when the environment supports it.

Document Router (Auto-Detection)

DocumentRouter inspects file content and extension to auto-detect the correct adapter:

#![allow(unused)]
fn main() {
let (format, adapter) = DocumentRouter::detect(path);
let (format, adapter) = DocumentRouter::detect_directory(root); // samples ≤20 files
}
Detection MethodFormatHeuristic
Extension .bib / .bibtexBibTeXExtension only
Extension .md + Protocol: L1GHTL1GHTContent check
Extension .md without L1GHT, .txt, .rst, .adoc, .html, .pdf, .docx, .pptx, .xlsxUniversalExtension + universal lane
Extension .xml / .nxmlPatent, JATS, or RFCRoot element inspection
Extension .jsonCrossRefChecks for DOI + publisher + type keys
FallbackCodeDefault pipeline

Used via MCP: ingest(adapter="auto"), adapter="document", or adapter="universal" when you want best-effort document normalization directly.

For directory detection, the router samples up to 20 files and returns the dominant format.

Cross-Domain Resolution

CrossDomainResolver merges multiple adapter outputs and discovers cross-domain connections automatically.

Bridge Strategies

BridgeWeightSourceDescription
same_as1.0DOI/PMIDSame identifier in different domains → identity edge
cross_cites0.95Citation edgesCitation target exists as a full node in another domain
same_orcid0.95ORCID tagsSame researcher ORCID across different domains
same_author0.7Author nameSame author name across different namespaces
shared_keyword0.6Keyword tagsShared keyword:, article:keyword:, or subject: tags
citation_chain0.5Citation adjacencyTransitive A→B→C bridging with decayed weight

Safety Guards

  • Keyword cap: Keywords shared by >20 nodes are ignored to prevent hub explosion.
  • Cross-domain only: All bridges require nodes from ≥2 different namespaces. Same-domain matches are skipped.
  • Self-loop prevention: Citation chains A→B→A do not generate self-loop edges.
  • Deduplication: Nodes with identical external IDs are deduplicated (first wins).

Resolution Statistics

#![allow(unused)]
fn main() {
pub struct ResolutionStats {
    pub graphs_merged: usize,
    pub total_nodes: u32,
    pub total_edges: usize,
    pub cross_edges_created: usize,
    pub identity_matches: usize,
    pub author_bridges: usize,
    pub keyword_bridges: usize,
    pub orcid_bridges: usize,
    pub citation_chains: usize,
}
}

Configuration Reference

#![allow(unused)]
fn main() {
pub struct IngestConfig {
    pub root: PathBuf,
    pub timeout: Duration,          // default: 300s
    pub max_nodes: u64,             // default: 500_000
    pub skip_dirs: Vec<String>,     // default: [".git", "node_modules", ...]
    pub skip_files: Vec<String>,    // default: ["package-lock.json", ...]
    pub parallelism: usize,         // default: 8 (rayon threads)
}
}

Statistics

Every ingest run produces IngestStats:

#![allow(unused)]
fn main() {
pub struct IngestStats {
    pub files_scanned: u64,
    pub files_parsed: u64,
    pub files_skipped_binary: u64,
    pub files_skipped_encoding: u64,
    pub nodes_created: u64,
    pub edges_created: u64,
    pub references_resolved: u64,
    pub references_unresolved: u64,
    pub label_collisions: u64,
    pub elapsed_ms: f64,
    pub commit_groups: Vec<Vec<String>>,
}
}

commit_groups is passed to the CoChangeMatrix in m1nd-core after graph finalization, seeding the temporal co-change model with real git history.