The Lang Graph Librarian

Hive Mind Document Processing Pipeline

Pipeline Active — 10 Stages — Real-time Enrichment
📥
INGEST
🔍
PARSE
🏷
ENTITY
📝
SUMMARIZE
📂
CLASSIFY
🔗
RELATE
🧠
EMBED
🚦
ROUTE
💾
STORE
CONFIRM
SCROLL TO EXPLORE

Multi-Source Document Ingestion

Documents enter the pipeline from anywhere — chat, files, webhooks, or scheduled jobs

💬
Chat Messages
Telegram, Discord, direct input — text captured from conversations
📄
File Upload
PDF, MD, TXT, DOCX — parsed by /opt/librarian/server.py
🪝
Webhooks
External systems push data via HTTP webhook endpoints
Cron Jobs
Scheduled enrichment — periodic doc refresh & re-scoring
🔄
API Calls
Direct enrichment-v2 API at /opt/enrichment-v2/server.py

The 10-Stage Enrichment Pipeline

Each stage transforms, validates, and passes data forward — with error handling at every step

Stage 01
📥 INGEST
Raw input arrives from any source: chat message, uploaded file, webhook payload, or cron trigger. The Librarian normalizes it into a standard document envelope.
chat file webhook cron
~5ms
Stage 02
🔍 PARSE
Extract plain text, structure metadata (author, date, language, mime-type). Strip formatting, normalize whitespace, detect charset.
text metadata parse_fail
~20ms
Stage 03
🏷 ENTITY EXTRACT
Scan for [[Entity Name]] wiki-style patterns. Extracts all named entities and their types into a structured list.
entities[] regex_err
~15ms
Stage 04
📝 SUMMARIZE
Generate a structured summary with key points. Optional: bullet extraction, TL;DR generation, key quotes identification.
summary bullets[] llm_timeout
~800ms
Stage 05
📂 CLASSIFY (PARA)
Categorize into Projects, Areas, Resources, or Archive. Uses content heuristics + prior routing history to pick the right bucket.
P A R Archive ambig
~50ms
Stage 06
🔗 RELATE
Map [[A]] VERB [[B]] relationship patterns. Builds the knowledge graph edges between entities.
edges[] triplets[] ambiguous_rel
~100ms
Stage 07
🧠 EMBED
Send text to Qwen3-4B embedder at /opt/qwen3-embedder/server.py. Produces a 2560-dimensional vector.
text vector[2560] embedder_timeout
~300ms
Stage 08
🚦 ROUTE
Determine which Qdrant collection(s) to store in. Checks loom confidence scores (Facts 0.95, Decisions 0.90, Reports 0.85, Predictions 0.50).
collections[] no_match
~10ms
Stage 09
💾 STORE
Write to Qdrant at localhost:6333. Upserts point with vector + full payload (text, metadata, entities, summary, edges).
point_id collections[] qdrant_err
~25ms
Stage 10
✅ CONFIRM
Return the stored point ID, update document tracking, fire webhooks for downstream consumers. Pipeline complete.
doc_id point_ids[] darwin_score
~5ms
Stages 3 & 4 run in PARALLEL — Entity extraction + Summarization

Parallel Processing & Swim Lanes

Where the pipeline splits, processes concurrently, and rejoins

STAGE
PARALLEL TRACK A
PARALLEL TRACK B
PARSE (Stage 2)
TRACK A
Text Extraction
Strip markup, extract body text, normalize encoding
TRACK B
Metadata Extraction
Author, date, language, mime-type, file size
TRACK A
→ PASSES TO
Entity Extract (Stage 3) + Embed (Stage 7)
TRACK B
→ PASSES TO
Classify (Stage 5) + Store payload
ENTITY + SUMMARY
(Stages 3 & 4)
TRACK A — ENTITY
[[Entity]] Extraction
Regex scan for [[Name]] patterns → entity list with types
TRACK B — SUMMARY
LLM Summarization
Key points + bullets via enrichment-v2 LLM call
REJOIN
⟵ JOIN Entity list + Summary merge into payload
JOIN ⟶ Full payload → Stage 7 Embed

Core Pipeline Logic

Actual pseudocode from our enrichment system

entity_extractor.py python
# [[Entity Name]] pattern extraction — Stage 3
import re

ENTITY_PATTERN = r'\[\[([^\]]+)\]\]+'

def extract_entities(text: str) -> List[dict]:
    # Find all [[Entity]] mentions
    raw_matches = re.findall(ENTITY_PATTERN, text)

    entities = []
    for name in raw_matches:
        entity = {
            "name": name.strip(),
            "type": infer_type(name),
            "confidence": 0.95,
            "aliases": []
        }
        entities.append(entity)

    return entities

# [[A]] VERB [[B]] relationship extraction — Stage 6
RELATION_PATTERN = r'\[\[([^\]]+)\]\]\s+(\w+)\s+\[\[([^\]]+)\]\]+'

def extract_relations(text: str) -> List[dict]:
    matches = re.finditer(RELATION_PATTERN, text)
    triplets = []
    for m in matches:
        triplets.append({
            "subject": m.group(1),
            "verb": m.group(2),
            "object": m.group(3),
            "loom": classify_verb(m.group(2))
        })
    return triplets
para_classifier.py python
# PARA classification — Stage 5
from enum import Enum

class PARA(Enum):
    PROJECT = "P"
    AREA     = Enum.value("A")
    RESOURCE = "R"
    ARCHIVE  = "Archive"

def classify_para(doc: Document) -> PARA:
    score = { PARA.PROJECT: 0, PARA.AREA: 0, PARA.RESOURCE: 0 }

    # Heuristic scoring
    if doc.has_verbs("plan,build,launch,create,ship"):
        score[PARA.PROJECT] += 3
    if doc.has_verbs("manage,maintain,support,oversee"):
        score[PARA.AREA] += 3
    if doc.has_keywords("guide,reference,tutorial,docs"):
        score[PARA.RESOURCE] += 3
    if doc.is_stale(): # >90 days no update
        return PARA.ARCHIVE

    best = max(score, key=score.get)
    return best
router.py python
# Collection routing — Stage 8
from dataclasses import dataclass

@dataclass
class Loom:
    name: str
    collection: str
    threshold: float

LOOMS = [
    Loom("facts",      "exocortex-facts",      0.95),
    Loom("decisions",  "exocortex-decisions",  0.90),
    Loom("reports",    "exocortex-reports",     0.85),
    Loom("predictions","exocortex-predictions", 0.50),
]

def route_collections(doc: Document) -> List[str]:
    scores = score_looms(doc)   # LLM classifies loom scores
    hits = []
    for loom in LOOMS:
        if scores.get(loom.name, 0) >= loom.threshold:
            hits.append(loom.collection)

    # Always store in PARA collection too
    hits.append(para_collection(doc.para))
    return list(set(hits))   # deduplicate
darwinism.py python
# Darwinism scoring — quality gate at end of pipeline
def darwinism_score(doc: Document) -> float:
    score = 1.0

    # Penalize missing components
    if not doc.summary:    score -= 0.2
    if not doc.entities:    score -= 0.1
    if not doc.edges:       score -= 0.15

    # Penalize short content
    if len(doc.text) < 100:    score -= 0.25
    if len(doc.text) < 50:     score -= 0.15

    # Bonus for high loom confidence
    if doc.loom_score and doc.loom_score >= 0.90:
        score += 0.1

    return max(0.0, min(1.0, score))

The Storage Layer — Qdrant Collections

Each document is vectorized and stored in one or more collections based on its loom + PARA classification

exocortex-decisions
Decisions Loom
Stores resolved decisions with context, alternatives considered, and outcome. Confidence threshold: 0.90
2,847 points
2560 dims
HNSW index
🎯 Confidence: 0.90+
para-projects
PARA: Projects
Active projects with goals, milestones, owners, and status. Short-lived, high-action items.
1,204 points
2560 dims
PQ index
📁 PARA Category P
para-areas
PARA: Areas
Ongoing responsibilities and focus areas — work, health, finances, relationships. Medium-term horizon.
3,521 points
2560 dims
HNSW index
📁 PARA Category A
para-resources
PARA: Resources
Reference material, topics of interest, tutorials, documentation. Long-term, rarely changes.
8,943 points
2560 dims
HNSW index
📁 PARA Category R
exocortex-facts
Facts Loom
Verifiable facts, definitions, measurements, and objective data. Highest confidence threshold: 0.95
12,456 points
2560 dims
HNSW index
🎯 Confidence: 0.95+
para-archive
PARA: Archive
Completed, abandoned, or stale documents. Kept for reference but not actively surfaced.
5,678 points
2560 dims
plain index
📁 PARA Category Archive

Qdrant Point Structure — Vector + Payload

Every stored point contains the 2560-dim embedding plus the full enriched document payload

QDRANT POINT STRUCTURE
vector [-0.0231, 0.0847, -0.0412, 0.0193, ... x2556 more] 2560 dims
id uuid-v4 string string
text Full document text (raw + normalized) text
summary LLM-generated summary + key bullets text
entities [{"name":"...","type":"person","conf":0.95}, ...] array
edges [{"from":"A","verb":"owns","to":"B"}, ...] array
para "P" | "A" | "R" | "Archive" enum
loom "facts" | "decisions" | "reports" | "predictions" string
darwin_score 0.0 — 1.0 quality gate score float
created_at ISO-8601 timestamp datetime
source "telegram" | "file" | "webhook" | "cron" string
DARWINISM SCORING
0.85+
🌟 Excellent
Full enrichment, all components present
0.70–0.84
✅ Good
Minor gaps, still high quality
0.50–0.69
⚠️ Fair
Missing components, consider enrich
<0.50
❌ Poor
Quarantine — needs manual review
THE 4 LOOMS
🔬
Facts threshold: 0.95
Verifiable, objective, measurable data
⚖️
Decisions threshold: 0.90
Resolved choices with context and outcome
📊
Reports threshold: 0.85
Analysis, summaries, structured findings
🔮
Predictions threshold: 0.50
Forecasts, hypotheses, speculative content

Error Handling & Resilience

Every stage has a fallback path — nothing silently fails

🔁
Retry with Exponential Backoff
Transient failures (network blip, Qdrant timeout, LLM rate limit) trigger a retry: 1s → 2s → 4s → 8s → 16s max. After 5 attempts, escalate.
RETRY
🚫
Quarantine — Manual Review
Documents that fail parse (corrupt file, encoding error) or score below 0.50 Darwinism are moved to quarantine. An alert is fired to notify operators.
QUARANTINE
📮
Dead Letter Queue (DLQ)
Unrecoverable failures after max retries go to DLQ at /opt/enrichment-v2/dlq/. Stored as JSON with full error context for later reprocessing or investigation.
DEAD LETTER
🛡️
Circuit Breaker
If Qwen3 embedder or Qdrant exceed 50% error rate in a 30s window, the circuit trips. New requests are fast-failed immediately without hammering the service.
CIRCUIT BREAKER

Pipeline Monitoring & Observability

Real-time health metrics from the running enrichment system

Success Rate (24h)
97.3%
▼ 0.2% vs yesterday
Avg Latency (p50)
1.3s
p95: 3.2s · p99: 6.8s
Queue Backlog
24
↑ 8 since 1h ago
Error Rate (24h)
2.7%
↑ 0.2% vs yesterday
Docs Processed Today
1,847
↑ 12% vs same time yesterday
Embedder Latency
~300ms
Qwen3-4B @ /opt/qwen3-embedder
STAGE LATENCY BREAKDOWN
INGEST
~5ms
PARSE
~20ms
ENTITY
~15ms
SUMMARIZE
~800ms
CLASSIFY
~50ms
RELATE
~100ms
EMBED
~300ms
ROUTE
~10ms
STORE
~25ms
CONFIRM
~5ms
STATUS ACTIVE
DOCS TODAY 1,847
AVG LATENCY 1.3s
QUEUE 24
SUCCESS 97.3%
QDRANT localhost:6333
EMBEDDER Qwen3-4B
TOTAL POINTS 34,649