The Lang Graph Librarian

The 10-Stage Enrichment Pipeline

Each stage transforms, validates, and passes data forward — with error handling at every step

Stage 01

📥 INGEST

Raw input arrives from any source: chat message, uploaded file, webhook payload, or cron trigger. The Librarian normalizes it into a standard document envelope.

chat file webhook cron

~5ms

↓

Stage 02

🔍 PARSE

Extract plain text, structure metadata (author, date, language, mime-type). Strip formatting, normalize whitespace, detect charset.

text metadata parse_fail

~20ms

↓

Stage 03

🏷 ENTITY EXTRACT

Scan for [[Entity Name]] wiki-style patterns. Extracts all named entities and their types into a structured list.

entities[] regex_err

~15ms

↓

Stage 04

📝 SUMMARIZE

Generate a structured summary with key points. Optional: bullet extraction, TL;DR generation, key quotes identification.

summary bullets[] llm_timeout

~800ms

↓

Stage 05

📂 CLASSIFY (PARA)

Categorize into Projects, Areas, Resources, or Archive. Uses content heuristics + prior routing history to pick the right bucket.

P A R Archive ambig

~50ms

Stage 06

🔗 RELATE

Map [[A]] VERB [[B]] relationship patterns. Builds the knowledge graph edges between entities.

edges[] triplets[] ambiguous_rel

~100ms

↓

Stage 07

🧠 EMBED

Send text to Qwen3-4B embedder at /opt/qwen3-embedder/server.py. Produces a 2560-dimensional vector.

text vector[2560] embedder_timeout

~300ms

↓

Stage 08

🚦 ROUTE

Determine which Qdrant collection(s) to store in. Checks loom confidence scores (Facts 0.95, Decisions 0.90, Reports 0.85, Predictions 0.50).

collections[] no_match

~10ms

↓

Stage 09

💾 STORE

Write to Qdrant at localhost:6333. Upserts point with vector + full payload (text, metadata, entities, summary, edges).

point_id collections[] qdrant_err

~25ms

↓

Stage 10

✅ CONFIRM

Return the stored point ID, update document tracking, fire webhooks for downstream consumers. Pipeline complete.

doc_id point_ids[] darwin_score

~5ms

⚡ Stages 3 & 4 run in PARALLEL — Entity extraction + Summarization ⚡

Parallel Processing & Swim Lanes

Where the pipeline splits, processes concurrently, and rejoins

STAGE

PARALLEL TRACK A

PARALLEL TRACK B

PARSE (Stage 2)

TRACK A

Text Extraction

Strip markup, extract body text, normalize encoding

TRACK B

Metadata Extraction

Author, date, language, mime-type, file size

TRACK A

→ PASSES TO

Entity Extract (Stage 3) + Embed (Stage 7)

TRACK B

→ PASSES TO

Classify (Stage 5) + Store payload

ENTITY + SUMMARY
(Stages 3 & 4)

TRACK A — ENTITY

[[Entity]] Extraction

Regex scan for [[Name]] patterns → entity list with types

TRACK B — SUMMARY

LLM Summarization

Key points + bullets via enrichment-v2 LLM call

REJOIN

⟵ JOIN Entity list + Summary merge into payload

JOIN ⟶ Full payload → Stage 7 Embed

Core Pipeline Logic

Actual pseudocode from our enrichment system

        entity_extractor.py
        python
      

# [[Entity Name]] pattern extraction — Stage 3
import re

ENTITY_PATTERN = r'\[\[([^\]]+)\]\]+'

def extract_entities(text: str) -> List[dict]:
    # Find all [[Entity]] mentions
    raw_matches = re.findall(ENTITY_PATTERN, text)

    entities = []
    for name in raw_matches:
        entity = {
            "name": name.strip(),
            "type": infer_type(name),
            "confidence": 0.95,
            "aliases": []
        }
        entities.append(entity)

    return entities

# [[A]] VERB [[B]] relationship extraction — Stage 6
RELATION_PATTERN = r'\[\[([^\]]+)\]\]\s+(\w+)\s+\[\[([^\]]+)\]\]+'

def extract_relations(text: str) -> List[dict]:
    matches = re.finditer(RELATION_PATTERN, text)
    triplets = []
    for m in matches:
        triplets.append({
            "subject": m.group(1),
            "verb": m.group(2),
            "object": m.group(3),
            "loom": classify_verb(m.group(2))
        })
    return triplets
      

        para_classifier.py
        python
      

# PARA classification — Stage 5
from enum import Enum

class PARA(Enum):
    PROJECT = "P"
    AREA     = Enum.value("A")
    RESOURCE = "R"
    ARCHIVE  = "Archive"

def classify_para(doc: Document) -> PARA:
    score = { PARA.PROJECT: 0, PARA.AREA: 0, PARA.RESOURCE: 0 }

    # Heuristic scoring
    if doc.has_verbs("plan,build,launch,create,ship"):
        score[PARA.PROJECT] += 3
    if doc.has_verbs("manage,maintain,support,oversee"):
        score[PARA.AREA] += 3
    if doc.has_keywords("guide,reference,tutorial,docs"):
        score[PARA.RESOURCE] += 3
    if doc.is_stale(): # >90 days no update
        return PARA.ARCHIVE

    best = max(score, key=score.get)
    return best
      

        router.py
        python
      

# Collection routing — Stage 8
from dataclasses import dataclass

@dataclass
class Loom:
    name: str
    collection: str
    threshold: float

LOOMS = [
    Loom("facts",      "exocortex-facts",      0.95),
    Loom("decisions",  "exocortex-decisions",  0.90),
    Loom("reports",    "exocortex-reports",     0.85),
    Loom("predictions","exocortex-predictions", 0.50),
]

def route_collections(doc: Document) -> List[str]:
    scores = score_looms(doc)   # LLM classifies loom scores
    hits = []
    for loom in LOOMS:
        if scores.get(loom.name, 0) >= loom.threshold:
            hits.append(loom.collection)

    # Always store in PARA collection too
    hits.append(para_collection(doc.para))
    return list(set(hits))   # deduplicate
      

        darwinism.py
        python
      

# Darwinism scoring — quality gate at end of pipeline
def darwinism_score(doc: Document) -> float:
    score = 1.0

    # Penalize missing components
    if not doc.summary:    score -= 0.2
    if not doc.entities:    score -= 0.1
    if not doc.edges:       score -= 0.15

    # Penalize short content
    if len(doc.text) < 100:    score -= 0.25
    if len(doc.text) < 50:     score -= 0.15

    # Bonus for high loom confidence
    if doc.loom_score and doc.loom_score >= 0.90:
        score += 0.1

    return max(0.0, min(1.0, score))
      

The Storage Layer — Qdrant Collections

Each document is vectorized and stored in one or more collections based on its loom + PARA classification

exocortex-decisions

Decisions Loom

Stores resolved decisions with context, alternatives considered, and outcome. Confidence threshold: 0.90

2,847 points

2560 dims

HNSW index

🎯 Confidence: 0.90+

para-projects

PARA: Projects

Active projects with goals, milestones, owners, and status. Short-lived, high-action items.

1,204 points

2560 dims

PQ index

📁 PARA Category P

para-areas

PARA: Areas

Ongoing responsibilities and focus areas — work, health, finances, relationships. Medium-term horizon.

3,521 points

2560 dims

HNSW index

📁 PARA Category A

para-resources

PARA: Resources

Reference material, topics of interest, tutorials, documentation. Long-term, rarely changes.

8,943 points

2560 dims

HNSW index

📁 PARA Category R

exocortex-facts

Facts Loom

Verifiable facts, definitions, measurements, and objective data. Highest confidence threshold: 0.95

12,456 points

2560 dims

HNSW index

🎯 Confidence: 0.95+

para-archive

PARA: Archive

Completed, abandoned, or stale documents. Kept for reference but not actively surfaced.

5,678 points

2560 dims

plain index

📁 PARA Category Archive

Qdrant Point Structure — Vector + Payload

Every stored point contains the 2560-dim embedding plus the full enriched document payload

QDRANT POINT STRUCTURE

vector [-0.0231, 0.0847, -0.0412, 0.0193, ... x2556 more] 2560 dims

id uuid-v4 string string

text Full document text (raw + normalized) text

summary LLM-generated summary + key bullets text

entities [{"name":"...","type":"person","conf":0.95}, ...] array

edges [{"from":"A","verb":"owns","to":"B"}, ...] array

para "P" | "A" | "R" | "Archive" enum

loom "facts" | "decisions" | "reports" | "predictions" string

darwin_score 0.0 — 1.0 quality gate score float

created_at ISO-8601 timestamp datetime

source "telegram" | "file" | "webhook" | "cron" string

DARWINISM SCORING

0.85+

🌟 Excellent

Full enrichment, all components present

0.70–0.84

✅ Good

Minor gaps, still high quality

0.50–0.69

⚠️ Fair

Missing components, consider enrich

<0.50

❌ Poor

Quarantine — needs manual review

THE 4 LOOMS

🔬

Facts threshold: 0.95

Verifiable, objective, measurable data

⚖️

Decisions threshold: 0.90

Resolved choices with context and outcome

📊

Reports threshold: 0.85

Analysis, summaries, structured findings

🔮

Predictions threshold: 0.50

Forecasts, hypotheses, speculative content

Error Handling & Resilience

Every stage has a fallback path — nothing silently fails

🔁

Retry with Exponential Backoff

Transient failures (network blip, Qdrant timeout, LLM rate limit) trigger a retry: 1s → 2s → 4s → 8s → 16s max. After 5 attempts, escalate.

RETRY

🚫

Quarantine — Manual Review

Documents that fail parse (corrupt file, encoding error) or score below 0.50 Darwinism are moved to quarantine. An alert is fired to notify operators.

QUARANTINE

📮

Dead Letter Queue (DLQ)

Unrecoverable failures after max retries go to DLQ at /opt/enrichment-v2/dlq/. Stored as JSON with full error context for later reprocessing or investigation.

DEAD LETTER

🛡️

Circuit Breaker

If Qwen3 embedder or Qdrant exceed 50% error rate in a 30s window, the circuit trips. New requests are fast-failed immediately without hammering the service.

CIRCUIT BREAKER

Multi-Source Document Ingestion

The 10-Stage Enrichment Pipeline

Parallel Processing & Swim Lanes

Core Pipeline Logic

The Storage Layer — Qdrant Collections

Qdrant Point Structure — Vector + Payload

Error Handling & Resilience

Pipeline Monitoring & Observability