Daniel Kliewer

SOVEREIGN: The Unified Architecture — A Magnum Opus for Local-First AI Systems That Think for Themselves

by Daniel Kliewer48 min read
sovereign AIlocal-firstMoE RAGknowledge graphagentic orchestrationdata sovereigntyOllamaNeo4jChromaDBFastAPINext.jslocal LLMControl Boundaryaudit-ready AIautonomous agentspersona engineeringSpecGenarchitecturecapstonePythonTypeScript
SOVEREIGN: The Unified Architecture — A Magnum Opus for Local-First AI Systems That Think for Themselves

SOVEREIGN: The Unified Architecture

A Magnum Opus for Local-First AI Systems That Think for Themselves

"The mind that runs on borrowed infrastructure answers to its landlord. Build your own floor."


Preface: Why This Post Exists

Every system I have built over the last several years was an answer to a problem I could not ignore.

SynthInt answered the problem of opaque identity: why should the values baked into an AI's persona belong to someone else? Dynamic Persona MoE RAG answered the problem of context drift: why should yesterday's dead context contaminate today's reasoning? The Private Knowledge Graph answered the problem of relational amnesia: why should the connections between ideas collapse into similarity scores that lose their meaning? DeerFlow 2.0 answered the problem of isolated execution: why should agents be monoliths when they can be swarms? OpenClaw answered the problem of cloud dependency: why should inference require a network request? SpecGen answered the problem of the blank page: why should code generation be non-deterministic when the specification is precise? mcbot01 answered the problem of foundation: why should every project rebuild the local-first scaffold from scratch?

Each of these was a partial answer. A module. A proof-of-concept that one piece of the sovereignty puzzle could be built, deployed, and owned.

This post is the synthesis.

SOVEREIGNSelf-owned Orchestration of Versatile Expert Reasoning, Evaluation, Intelligence, Governance, and Network — is the unified architecture that collapses all of these systems into a single coherent project. It is not a rewrite. It is an integration. Every module you have read about on this site is a subsystem in the larger machine. This post is the blueprint for assembling that machine.

I am writing this for myself first. Then for you — the person who read the Sovereignty Manifesto, who runs Ollama on local hardware, who understands intuitively that the architecture you choose encodes your values. You already know why this matters. This post is about how to build it.

And specifically: this post is written so that a coding agent — given nothing but this document as context — can construct the entire SOVEREIGN system from scratch. The architecture is fully specified here. The scaffolding is complete. The philosophy is embedded in the structure itself, because in sovereign AI, the code is always the philosophy.


I. The Thesis: One Problem, Seven Partial Answers, One Synthesis

The core problem of AI in 2026 is not capability. It is ownership.

The most capable models in the world run on hardware you do not control, store context you did not authorize, evolve in directions you did not choose, and serve objectives that were never yours. You interact with them through an interface that was designed to maximize your dependency, not your agency. The extraction is architectural. It was designed in.

I have spent the better part of a decade building the counter-architecture. Not as a rejection of capability — the sovereign stack I describe here is extraordinarily capable — but as a rejection of the trade embedded in every cloud AI interaction: your context in exchange for their compute.

The seven systems that SOVEREIGN synthesizes each resolved one dimension of this problem:

SystemProblem SolvedCore Contribution
SynthInt / Dynamic Persona MoE RAGOpaque identity, static personasPersonas as versioned, auditable JSON; MoE routing to specialized reasoning agents
Private Knowledge GraphRelational amnesia, flat vector retrievalExplicit semantic relationships via NetworkX/Neo4j; provenance-tracked multi-hop reasoning
DeerFlow 2.0Monolithic agent executionSuperAgent harness; AIO sandbox; persistent memory across agent invocations
OpenClawCloud inference dependencyFully local agent runtime via Ollama + llama.cpp; zero-telemetry execution paths
SpecGenNon-deterministic code generationSpec-driven, RAG-grounded code generation; deterministic output from structured input
mcbot01Fragmented local-first scaffoldingReactive UI + async FastAPI backend as the reusable foundation layer
Control Boundary EngineNo governance in the execution pathIntent evaluation before execution; audit-ready pipelines; Colorado AI Act "Reasonable Care" compliance

SOVEREIGN does not replace these systems. It is the environment in which they all run together, passing context between each other through a shared memory substrate, governed by a unified evaluation loop, exposed through a single interface.

The result is not merely a better RAG system. It is a local-first AI operating system — a platform for thought that you own completely.


II. Architecture Overview: The Seven Layers

SOVEREIGN is organized as seven concentric layers. Each layer is independently deployable, testable, and replaceable. The boundaries between layers are explicit interfaces, not implementation assumptions. This is the sovereignty principle applied to architecture itself: no layer should be dependent on the internal implementation of another.

┌─────────────────────────────────────────────────────────────────────┐
│  LAYER 7: INTERFACE LAYER                                           │
│  Next.js 16 (App Router) + React + TypeScript                       │
│  Conversational UI · Session Management · Persona Selector          │
├─────────────────────────────────────────────────────────────────────┤
│  LAYER 6: API GATEWAY LAYER                                         │
│  FastAPI · REST/GraphQL · WebSocket streaming · Auth middleware      │
│  Request validation · Rate limiting · Audit log emission            │
├─────────────────────────────────────────────────────────────────────┤
│  LAYER 5: ORCHESTRATION LAYER                                       │
│  MoE Orchestrator · Agent Swarm Router · DeerFlow SuperAgent        │
│  Intent classification · Persona activation · Result aggregation    │
├─────────────────────────────────────────────────────────────────────┤
│  LAYER 4: GOVERNANCE LAYER                                          │
│  Control Boundary Engine · Evaluation Loop · Audit Trail            │
│  Intent evaluation · Output scoring · Hallucination detection       │
├─────────────────────────────────────────────────────────────────────┤
│  LAYER 3: REASONING LAYER                                           │
│  Dynamic Persona Engine · Specialist Agent Pool · SpecGen           │
│  Persona lifecycle · Bounded trait evolution · Code synthesis       │
├─────────────────────────────────────────────────────────────────────┤
│  LAYER 2: MEMORY LAYER                                              │
│  Knowledge Graph (Neo4j/NetworkX) · Vector Store (ChromaDB)         │
│  Episodic memory · Semantic graph · Embedding index · Pruning       │
├─────────────────────────────────────────────────────────────────────┤
│  LAYER 1: INFERENCE LAYER                                           │
│  Ollama · llama.cpp · Local model registry                          │
│  On-prem inference · Zero telemetry · Reproducible seeds            │
└─────────────────────────────────────────────────────────────────────┘

Every request in SOVEREIGN flows downward through these layers and returns upward. The path is never short-circuited. There is no "fast path" that skips governance. There is no "trusted caller" that bypasses the evaluation loop. The architecture enforces the principle that accountability is not optional — it is structural.


III. The Memory Substrate: Dual-Layer Sovereign Memory

The most important architectural decision in SOVEREIGN is the structure of memory. Memory determines what the system knows, what it can reason about, and what it forgets.

SOVEREIGN uses a dual-substrate memory architecture: a semantic knowledge graph for relational, provenance-tracked long-term memory, and a vector store for high-dimensional similarity retrieval. These are not interchangeable. They are complementary, and the architecture uses them for different reasoning tasks.

3.1 The Semantic Knowledge Graph

The knowledge graph in SOVEREIGN is a persistent, typed, directional graph built on Neo4j (for production persistence) with a NetworkX in-memory layer for query-scoped reasoning. The graph is not a flat document store. It is a living model of your knowledge domain.

Every node in the graph carries:

  • A unique identifier and type
  • A source document reference (provenance)
  • A creation timestamp and last-accessed timestamp
  • A relevance decay coefficient (used by the pruning engine)
  • A confidence weight (updated by the evaluation loop)

Every edge in the graph carries:

  • A typed relationship label (CAUSES, SUPPORTS, CONTRADICTS, PRECEDES, DERIVES_FROM, etc.)
  • A weight (0.0–1.0) representing relationship strength
  • A source (which agent or document established this relationship)
  • A timestamp

This structure makes multi-hop reasoning explicit and auditable. When the system traces a path from Concept A to Claim B through Relationship R, that path is a first-class data structure you can inspect, export, and challenge. It is not a black-box attention pattern.

Python
# sovereign/memory/knowledge_graph.py

from dataclasses import dataclass, field
from datetime import datetime
from typing import Dict, List, Optional, Any
import networkx as nx
import uuid


@dataclass
class KGNode:
    """A typed, provenance-tracked node in the sovereign knowledge graph."""
    id: str
    label: str                          # Entity type: CONCEPT, CLAIM, DOCUMENT, AGENT, EVENT
    content: str                        # Human-readable representation
    source_document_id: str             # Provenance anchor
    confidence: float = 1.0             # Updated by evaluation loop
    access_count: int = 0               # Used by LRU-style pruning
    decay_coefficient: float = 0.95     # Per-session relevance decay
    created_at: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    last_accessed_at: Optional[str] = None
    metadata: Dict[str, Any] = field(default_factory=dict)


@dataclass
class KGEdge:
    """A typed, weighted, traceable relationship in the sovereign knowledge graph."""
    id: str
    source_id: str
    target_id: str
    relationship: str                   # CAUSES, SUPPORTS, CONTRADICTS, PRECEDES, DERIVES_FROM
    weight: float = 1.0
    established_by: str = "system"      # Agent ID or document ID that created this edge
    created_at: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    metadata: Dict[str, Any] = field(default_factory=dict)


class SovereignKnowledgeGraph:
    """
    Dual-substrate knowledge graph: persistent Neo4j backend with
    NetworkX in-memory layer for query-scoped reasoning.
    
    Design principle: every reasoning path is traceable.
    Every node has provenance. Every edge has an author.
    Nothing is inferred without a trail.
    """

    def __init__(self, config: Dict[str, Any]):
        self.config = config
        self.in_memory_graph = nx.DiGraph()
        self.nodes: Dict[str, KGNode] = {}
        self.edges: List[KGEdge] = []
        self._neo4j_driver = None
        self._init_neo4j()

    def _init_neo4j(self):
        """Initialize Neo4j connection if configured; fall back to pure NetworkX."""
        try:
            from neo4j import GraphDatabase
            self._neo4j_driver = GraphDatabase.driver(
                self.config.get("neo4j_uri", "bolt://localhost:7687"),
                auth=(
                    self.config.get("neo4j_user", "neo4j"),
                    self.config.get("neo4j_password", "sovereign")
                )
            )
        except Exception:
            # Graceful degradation: operate as pure in-memory graph
            self._neo4j_driver = None

    def add_node(self, label: str, content: str, source_document_id: str,
                 confidence: float = 1.0, metadata: Optional[Dict] = None) -> KGNode:
        node = KGNode(
            id=str(uuid.uuid4()),
            label=label,
            content=content,
            source_document_id=source_document_id,
            confidence=confidence,
            metadata=metadata or {}
        )
        self.nodes[node.id] = node
        self.in_memory_graph.add_node(
            node.id,
            label=label,
            content=content,
            confidence=confidence
        )
        if self._neo4j_driver:
            self._persist_node_to_neo4j(node)
        return node

    def add_edge(self, source_id: str, target_id: str, relationship: str,
                 weight: float = 1.0, established_by: str = "system") -> Optional[KGEdge]:
        if source_id not in self.nodes or target_id not in self.nodes:
            return None
        edge = KGEdge(
            id=str(uuid.uuid4()),
            source_id=source_id,
            target_id=target_id,
            relationship=relationship,
            weight=weight,
            established_by=established_by
        )
        self.edges.append(edge)
        self.in_memory_graph.add_edge(
            source_id, target_id,
            relationship=relationship,
            weight=weight
        )
        if self._neo4j_driver:
            self._persist_edge_to_neo4j(edge)
        return edge

    def find_reasoning_path(self, source_id: str, target_id: str,
                             relationship_filter: Optional[List[str]] = None) -> List[KGNode]:
        """
        Find an explicit, auditable reasoning path between two nodes.
        
        This is not similarity search. This is structured inference.
        The path returned is a chain of evidence, not a probability distribution.
        """
        try:
            path_ids = nx.shortest_path(self.in_memory_graph, source_id, target_id)
            path_nodes = [self.nodes[nid] for nid in path_ids if nid in self.nodes]
            if relationship_filter:
                # Filter edges along the path to the specified relationship types
                path_nodes = self._filter_path_by_relationship(path_ids, relationship_filter)
            # Update access counts — the memory knows it has been used
            for node in path_nodes:
                node.access_count += 1
                node.last_accessed_at = datetime.utcnow().isoformat()
            return path_nodes
        except (nx.NetworkXNoPath, nx.NodeNotFound):
            return []

    def apply_temporal_decay(self, decay_factor: float = 0.95):
        """
        Apply temporal decay to all node confidence scores.
        
        Design philosophy: memory that is never accessed should fade.
        The system forgets gracefully, not catastrophically.
        Forgetting is not failure. It is discernment.
        """
        for node in self.nodes.values():
            if node.last_accessed_at is None:
                node.confidence *= decay_factor
                node.confidence = max(0.01, node.confidence)

    def prune_low_confidence_nodes(self, threshold: float = 0.1) -> List[str]:
        """
        Remove nodes whose confidence has decayed below the threshold.
        Returns list of pruned node IDs for audit logging.
        
        What is pruned is not destroyed — it is archived.
        Sovereignty includes the right to forget deliberately.
        """
        pruned_ids = []
        nodes_to_prune = [
            nid for nid, node in self.nodes.items()
            if node.confidence < threshold
        ]
        for nid in nodes_to_prune:
            self.in_memory_graph.remove_node(nid)
            pruned_ids.append(nid)
            del self.nodes[nid]
        return pruned_ids

    def export_subgraph(self, node_ids: List[str]) -> Dict[str, Any]:
        """Export a subgraph for inspection, audit, or external analysis."""
        subgraph_nodes = {nid: self.nodes[nid] for nid in node_ids if nid in self.nodes}
        subgraph_edges = [
            e for e in self.edges
            if e.source_id in node_ids and e.target_id in node_ids
        ]
        return {
            "nodes": [vars(n) for n in subgraph_nodes.values()],
            "edges": [vars(e) for e in subgraph_edges],
            "exported_at": datetime.utcnow().isoformat()
        }

    def _persist_node_to_neo4j(self, node: KGNode):
        with self._neo4j_driver.session() as session:
            session.run(
                "MERGE (n:Node {id: $id}) "
                "SET n.label = $label, n.content = $content, "
                "n.source_document_id = $source_document_id, "
                "n.confidence = $confidence, n.created_at = $created_at",
                id=node.id, label=node.label, content=node.content,
                source_document_id=node.source_document_id,
                confidence=node.confidence, created_at=node.created_at
            )

    def _persist_edge_to_neo4j(self, edge: KGEdge):
        with self._neo4j_driver.session() as session:
            session.run(
                "MATCH (a:Node {id: $source_id}), (b:Node {id: $target_id}) "
                f"MERGE (a)-[r:{edge.relationship} {{id: $edge_id}}]->(b) "
                "SET r.weight = $weight, r.established_by = $established_by",
                source_id=edge.source_id, target_id=edge.target_id,
                edge_id=edge.id, weight=edge.weight,
                established_by=edge.established_by
            )

    def _filter_path_by_relationship(self, path_ids: List[str],
                                      allowed_relationships: List[str]) -> List[KGNode]:
        filtered = []
        for i in range(len(path_ids) - 1):
            edge_data = self.in_memory_graph.get_edge_data(path_ids[i], path_ids[i + 1])
            if edge_data and edge_data.get("relationship") in allowed_relationships:
                if path_ids[i] in self.nodes:
                    filtered.append(self.nodes[path_ids[i]])
        return filtered

3.2 The Vector Store Integration

The vector store (ChromaDB in development, Qdrant in production) handles the similarity retrieval that the knowledge graph cannot: dense semantic search across large document corpora where the exact relational structure is not yet known.

The critical design decision here is that the vector store feeds the knowledge graph, not the other way around. Vector retrieval surfaces candidate documents. The knowledge graph determines how those documents relate to each other and to the current query context. The vector store is a search index. The knowledge graph is the mind.

Python
# sovereign/memory/vector_store.py

from typing import List, Dict, Any, Optional
import chromadb
from chromadb.config import Settings


class SovereignVectorStore:
    """
    Local-first vector store with zero cloud dependency.
    
    ChromaDB in development (file-backed, no server required).
    Qdrant in production (local server, same guarantee).
    
    The embeddings are yours. The index is yours.
    Nothing is sent to an external endpoint.
    """

    def __init__(self, config: Dict[str, Any]):
        self.persist_directory = config.get("persist_directory", "./data/chromadb")
        self.collection_name = config.get("collection_name", "sovereign_documents")
        self.embedding_model = config.get("embedding_model", "nomic-embed-text")
        
        # File-backed persistence: data survives restarts on your hardware
        self.client = chromadb.PersistentClient(
            path=self.persist_directory,
            settings=Settings(anonymized_telemetry=False)  # Explicit: no telemetry
        )
        self.collection = self.client.get_or_create_collection(
            name=self.collection_name,
            metadata={"hnsw:space": "cosine"}
        )

    def embed_and_store(self, documents: List[Dict[str, Any]]) -> List[str]:
        """
        Embed documents and persist to local vector store.
        Returns document IDs for graph node linkage.
        """
        doc_ids = []
        for doc in documents:
            doc_id = doc.get("id", str(uuid.uuid4()))
            self.collection.add(
                documents=[doc["content"]],
                metadatas=[{
                    "source": doc.get("source", "unknown"),
                    "doc_type": doc.get("doc_type", "text"),
                    "created_at": datetime.utcnow().isoformat(),
                    "provenance": doc.get("provenance", "")
                }],
                ids=[doc_id]
            )
            doc_ids.append(doc_id)
        return doc_ids

    def query(self, query_text: str, n_results: int = 10,
              where_filter: Optional[Dict] = None) -> List[Dict[str, Any]]:
        """
        Semantic search over local embeddings.
        Returns results with full provenance metadata.
        """
        results = self.collection.query(
            query_texts=[query_text],
            n_results=n_results,
            where=where_filter,
            include=["documents", "metadatas", "distances"]
        )
        return [
            {
                "id": results["ids"][0][i],
                "content": results["documents"][0][i],
                "metadata": results["metadatas"][0][i],
                "relevance_score": 1.0 - results["distances"][0][i]
            }
            for i in range(len(results["ids"][0]))
        ]

IV. The Inference Layer: Local Execution, Zero Dependency

The inference layer is non-negotiable. It is the foundation of every sovereignty guarantee in the system. If inference is remote, the entire stack is a thin wrapper over someone else's infrastructure. Sovereignty is not a frontend feature. It begins at the model.

SOVEREIGN's inference layer supports three execution modes:

Mode 1: Ollama (Primary) — HTTP interface to locally served models. Fast, easy to configure, supports quantized variants of Llama, Qwen, Mistral, Phi, and Gemma families.

Mode 2: llama.cpp (Fallback/Air-Gap) — Direct binary execution. No server process. No HTTP overhead. Used when network interface is unacceptable (air-gapped environments, maximum-security deployments).

Mode 3: Hybrid — Different specialist agents use different models. The orchestrator routes to the fastest suitable model for the current task. Code tasks go to a code-optimized model. Long-context tasks go to a high-context-window model. All models are local.

Python
# sovereign/inference/local_engine.py

from typing import Dict, Any, Optional, Generator
import requests
import subprocess
import json


class LocalInferenceEngine:
    """
    Unified interface to local model execution.
    
    Design invariant: no request leaves this machine.
    The api_endpoint, even in Ollama mode, resolves to localhost.
    There is no fallback to a cloud endpoint.
    If local inference fails, the system fails loudly — not silently to the cloud.
    """

    EXECUTION_MODES = ["ollama", "llama_cpp", "hybrid"]

    def __init__(self, config: Dict[str, Any]):
        self.mode = config.get("execution_mode", "ollama")
        self.ollama_endpoint = config.get("ollama_endpoint", "http://localhost:11434")
        self.llama_cpp_binary = config.get("llama_cpp_binary", "./bin/llama-cli")
        self.model_registry = config.get("model_registry", {})
        self.default_model = config.get("default_model", "llama3.2")
        self.seed = config.get("seed", 42)             # Reproducibility by default
        self.default_temperature = config.get("temperature", 0.1)
        
        self._validate_local_availability()

    def _validate_local_availability(self):
        """
        Refuse to initialize if no local inference backend is reachable.
        
        This is a hard failure, not a warning.
        Failing loudly protects sovereignty — a silent fallback would not.
        """
        if self.mode in ("ollama", "hybrid"):
            try:
                response = requests.get(f"{self.ollama_endpoint}/api/tags", timeout=5)
                response.raise_for_status()
            except Exception as e:
                raise RuntimeError(
                    f"SOVEREIGN requires local inference. Ollama is not reachable at "
                    f"{self.ollama_endpoint}. Start Ollama with `ollama serve` and retry.\n"
                    f"Original error: {e}"
                )

    def generate(self, prompt: str, system_prompt: str = "",
                 model: Optional[str] = None, temperature: Optional[float] = None,
                 max_tokens: int = 2000, seed: Optional[int] = None) -> str:
        """
        Generate a response from the local model.
        Returns the complete response text.
        """
        effective_model = model or self.default_model
        effective_temperature = temperature if temperature is not None else self.default_temperature
        effective_seed = seed if seed is not None else self.seed

        if self.mode == "ollama":
            return self._generate_ollama(
                prompt, system_prompt, effective_model,
                effective_temperature, max_tokens, effective_seed
            )
        elif self.mode == "llama_cpp":
            return self._generate_llama_cpp(
                prompt, system_prompt, effective_model,
                effective_temperature, max_tokens
            )
        else:
            raise ValueError(f"Unknown execution mode: {self.mode}")

    def generate_stream(self, prompt: str, system_prompt: str = "",
                        model: Optional[str] = None) -> Generator[str, None, None]:
        """
        Stream tokens from local inference for real-time UI updates.
        Every token comes from your hardware.
        """
        effective_model = model or self.default_model
        payload = {
            "model": effective_model,
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": prompt}
            ],
            "options": {"temperature": self.default_temperature, "seed": self.seed},
            "stream": True
        }
        with requests.post(
            f"{self.ollama_endpoint}/api/chat",
            json=payload,
            stream=True,
            timeout=120
        ) as response:
            for line in response.iter_lines():
                if line:
                    chunk = json.loads(line)
                    if not chunk.get("done"):
                        yield chunk.get("message", {}).get("content", "")

    def route_to_specialist(self, task_type: str, prompt: str,
                             system_prompt: str = "") -> str:
        """
        Route to the best local model for the given task type.
        
        The routing table is yours. You decide which model handles what.
        The routing logic is explicit, auditable, and modifiable.
        """
        routing_table = self.model_registry.get("routing", {})
        specialist_model = routing_table.get(task_type, self.default_model)
        return self.generate(prompt, system_prompt, model=specialist_model)

    def _generate_ollama(self, prompt: str, system_prompt: str, model: str,
                          temperature: float, max_tokens: int, seed: int) -> str:
        payload = {
            "model": model,
            "messages": [
                {"role": "system", "content": system_prompt or "You are a helpful, precise assistant."},
                {"role": "user", "content": prompt}
            ],
            "options": {
                "temperature": temperature,
                "seed": seed,
                "num_predict": max_tokens
            },
            "stream": False
        }
        response = requests.post(
            f"{self.ollama_endpoint}/api/chat",
            json=payload,
            timeout=120
        )
        response.raise_for_status()
        return response.json()["message"]["content"]

    def _generate_llama_cpp(self, prompt: str, system_prompt: str, model: str,
                              temperature: float, max_tokens: int) -> str:
        model_path = self.model_registry.get("paths", {}).get(model, model)
        full_prompt = f"<|system|>{system_prompt}<|user|>{prompt}<|assistant|>"
        result = subprocess.run(
            [
                self.llama_cpp_binary,
                "-m", model_path,
                "-p", full_prompt,
                "--temp", str(temperature),
                "-n", str(max_tokens),
                "--silent-prompt",
                "--no-display-prompt"
            ],
            capture_output=True, text=True, timeout=300
        )
        if result.returncode != 0:
            raise RuntimeError(f"llama.cpp execution failed: {result.stderr}")
        return result.stdout.strip()

V. The Persona Engine: Identity as a First-Class Data Structure

Every prior system I have built has wrestled with the same question: what is an AI persona, exactly? In corporate systems, it is a system prompt — a string of text injected at the top of the context window, ephemeral, invisible, unversioned, unauditable. You accept it as a default and interact with a character whose values you did not choose.

In SOVEREIGN, a persona is a typed, versioned, evolvable data structure with a complete lifecycle. It has traits (numeric weights that shape how the reasoning engine processes queries), expertise domains (which determine routing priority), an activation cost (used by the MoE orchestrator to balance resource allocation), and a performance history (updated by the evaluation loop after every query).

The persona is not the model. The model is a reasoning engine. The persona is a constraint vector applied to that engine. You can have dozens of personas sharing a single model instance. You can swap personas without changing the model. You can evolve a persona's trait weights based on its performance without retraining anything. The separation is total.

Python
# sovereign/reasoning/persona_engine.py

from dataclasses import dataclass, field
from typing import Dict, List, Optional, Any
from datetime import datetime
import json
import os
import uuid


@dataclass
class PersonaTrait:
    name: str
    weight: float       # 0.0 to 1.0
    description: str
    evolution_rate: float = 0.05    # How quickly this trait responds to feedback


@dataclass  
class PersonaPerformance:
    total_queries: int = 0
    total_score: float = 0.0
    last_used: Optional[str] = None
    success_rate: float = 0.0
    domain_scores: Dict[str, float] = field(default_factory=dict)

    @property
    def average_score(self) -> float:
        if self.total_queries == 0:
            return 0.0
        return self.total_score / self.total_queries


@dataclass
class Persona:
    """
    A sovereign persona: fully owned, fully auditable, fully evolvable.
    
    This is not a system prompt. It is a data structure with history,
    with traits that evolve according to rules you define,
    with performance metrics that you evaluate,
    and with a lifecycle that you control.
    """
    id: str
    name: str
    description: str
    traits: Dict[str, PersonaTrait]
    expertise: List[str]
    activation_cost: float = 0.3
    status: str = "experimental"        # experimental → active → stable → pruned
    version: int = 1
    created_at: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    updated_at: Optional[str] = None
    performance: PersonaPerformance = field(default_factory=PersonaPerformance)
    evolution_log: List[Dict[str, Any]] = field(default_factory=list)
    system_prompt_template: str = ""

    def get_system_prompt(self, context: str = "") -> str:
        """Generate the system prompt from trait weights and context."""
        trait_descriptions = []
        for trait_name, trait in self.traits.items():
            if trait.weight > 0.6:
                trait_descriptions.append(f"strong {trait_name.replace('_', ' ')}")
            elif trait.weight > 0.3:
                trait_descriptions.append(f"moderate {trait_name.replace('_', ' ')}")
        
        trait_string = ", ".join(trait_descriptions) if trait_descriptions else "balanced reasoning"
        return (
            f"You are {self.name}. {self.description} "
            f"Your reasoning is characterized by: {trait_string}. "
            f"Your areas of expertise are: {', '.join(self.expertise)}. "
            f"{self.system_prompt_template} "
            f"{f'Current context: {context}' if context else ''}"
        ).strip()

    def apply_bounded_update(self, feedback_vector: Dict[str, float]) -> Dict[str, Any]:
        """
        Apply the bounded update function: Δw = f(feedback) × (1 − w)
        
        The (1 − w) term ensures convergence — high-weight traits resist
        extreme changes. This prevents runaway specialization.
        Stability is a design feature, not a constraint.
        """
        evolution_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "version": self.version,
            "changes": []
        }
        
        for trait_name, trait in self.traits.items():
            feedback_value = feedback_vector.get(trait_name, 0.0)
            delta = feedback_value * trait.evolution_rate * (1.0 - trait.weight)
            new_weight = max(0.0, min(1.0, trait.weight + delta))
            
            evolution_entry["changes"].append({
                "trait": trait_name,
                "from": trait.weight,
                "to": new_weight,
                "delta": new_weight - trait.weight,
                "feedback": feedback_value
            })
            trait.weight = new_weight
        
        self.version += 1
        self.updated_at = datetime.utcnow().isoformat()
        self.evolution_log.append(evolution_entry)
        return evolution_entry


class PersonaEngine:
    """
    Manages the complete lifecycle of sovereign personas.
    
    Active → Stable → Pruned → Cold Storage → Recalled.
    The lifecycle is yours to govern.
    Nothing is deleted without your explicit instruction.
    Cold storage preserves everything for potential recall.
    """

    LIFECYCLE_STATES = ["experimental", "active", "stable", "pruned"]
    PERSONAS_DIR = "./data/personas"

    def __init__(self, config: Dict[str, Any]):
        self.config = config
        self.active_personas: Dict[str, Persona] = {}
        self.cold_storage: Dict[str, Persona] = {}
        self.personas_dir = config.get("personas_dir", self.PERSONAS_DIR)
        self._ensure_directory_structure()
        self._load_active_personas()

    def _ensure_directory_structure(self):
        for state in self.LIFECYCLE_STATES:
            os.makedirs(os.path.join(self.personas_dir, state), exist_ok=True)
        os.makedirs(os.path.join(self.personas_dir, "cold_storage"), exist_ok=True)

    def _load_active_personas(self):
        for state in ["experimental", "active", "stable"]:
            state_dir = os.path.join(self.personas_dir, state)
            for fname in os.listdir(state_dir):
                if fname.endswith(".json"):
                    with open(os.path.join(state_dir, fname)) as f:
                        data = json.load(f)
                        persona = self._deserialize_persona(data)
                        self.active_personas[persona.id] = persona

    def route_to_persona(self, query: str, query_domain: str) -> List[Persona]:
        """
        Select the best personas for the current query using multi-factor routing.
        
        Routing considers: domain expertise match, activation cost,
        historical performance in the query domain, and current lifecycle state.
        Only stable and active personas participate in production routing.
        """
        candidates = [
            p for p in self.active_personas.values()
            if p.status in ("active", "stable")
        ]
        
        scored_candidates = []
        for persona in candidates:
            domain_match = 1.0 if query_domain in persona.expertise else 0.3
            historical_score = persona.performance.domain_scores.get(query_domain, 0.5)
            cost_penalty = 1.0 - persona.activation_cost
            composite_score = (
                0.4 * domain_match +
                0.4 * historical_score +
                0.2 * cost_penalty
            )
            scored_candidates.append((persona, composite_score))
        
        scored_candidates.sort(key=lambda x: x[1], reverse=True)
        max_parallel = self.config.get("max_parallel_personas", 3)
        return [p for p, _ in scored_candidates[:max_parallel]]

    def prune_persona(self, persona_id: str, reason: str = "performance_threshold") -> bool:
        """
        Retire a persona to cold storage. Not deletion — archival.
        The persona's full history is preserved.
        The reason is logged.
        It can be recalled if context warrants.
        """
        if persona_id not in self.active_personas:
            return False
        
        persona = self.active_personas[persona_id]
        persona.status = "pruned"
        persona.updated_at = datetime.utcnow().isoformat()
        persona.evolution_log.append({
            "timestamp": datetime.utcnow().isoformat(),
            "event": "pruned",
            "reason": reason
        })
        
        self.cold_storage[persona_id] = persona
        del self.active_personas[persona_id]
        self._save_persona_to_state(persona, "cold_storage")
        return True

    def recall_persona(self, persona_id: str, query_context: str) -> Optional[Persona]:
        """
        Attempt to recall a pruned persona based on current query context.
        
        The system asks: is this dormant knowledge relevant again?
        If yes, it is restored. If no, it remains dormant.
        The question is explicit. The answer is auditable.
        """
        if persona_id not in self.cold_storage:
            return None
        
        persona = self.cold_storage[persona_id]
        # Compute context relevance by checking domain overlap
        query_terms = set(query_context.lower().split())
        expertise_terms = set(" ".join(persona.expertise).lower().split())
        overlap = len(query_terms & expertise_terms) / max(len(expertise_terms), 1)
        
        recall_threshold = self.config.get("recall_threshold", 0.3)
        if overlap >= recall_threshold:
            persona.status = "active"
            persona.updated_at = datetime.utcnow().isoformat()
            persona.evolution_log.append({
                "timestamp": datetime.utcnow().isoformat(),
                "event": "recalled",
                "context_overlap": overlap
            })
            self.active_personas[persona_id] = persona
            del self.cold_storage[persona_id]
            return persona
        return None

    def _deserialize_persona(self, data: Dict[str, Any]) -> Persona:
        traits = {
            k: PersonaTrait(**v) if isinstance(v, dict) else PersonaTrait(
                name=k, weight=float(v), description="", evolution_rate=0.05
            )
            for k, v in data.get("traits", {}).items()
        }
        performance_data = data.get("performance", {})
        performance = PersonaPerformance(
            total_queries=performance_data.get("total_queries", 0),
            total_score=performance_data.get("total_score", 0.0),
            last_used=performance_data.get("last_used"),
            success_rate=performance_data.get("success_rate", 0.0),
            domain_scores=performance_data.get("domain_scores", {})
        )
        return Persona(
            id=data.get("id", str(uuid.uuid4())),
            name=data["name"],
            description=data.get("description", ""),
            traits=traits,
            expertise=data.get("expertise", []),
            activation_cost=data.get("activation_cost", 0.3),
            status=data.get("status", "experimental"),
            version=data.get("version", 1),
            created_at=data.get("created_at", datetime.utcnow().isoformat()),
            performance=performance,
            evolution_log=data.get("evolution_log", []),
            system_prompt_template=data.get("system_prompt_template", "")
        )

    def _save_persona_to_state(self, persona: Persona, state: str):
        filepath = os.path.join(self.personas_dir, state, f"{persona.id}.json")
        with open(filepath, "w") as f:
            json.dump(vars(persona), f, indent=2, default=str)

VI. The Governance Layer: The Control Boundary Engine

The Control Boundary Engine is the system's conscience. It runs on every request. It cannot be bypassed. It evaluates intent before execution, scores outputs after generation, and emits a complete audit trail that satisfies enterprise governance requirements including the Colorado AI Act's "Reasonable Care" standard.

In corporate AI, governance is a post-hoc appendage — a feedback button, a content moderation layer, a logging system bolted onto the side of the architecture after the fact. In SOVEREIGN, governance is embedded in the execution path. You cannot get a response without passing through the evaluation loop. You cannot update a persona without logging the change. You cannot prune a knowledge graph node without recording the decision.

This is not compliance theater. It is the architecture of a system that answers to you.

Python
# sovereign/governance/control_boundary.py

from dataclasses import dataclass, field
from typing import Dict, Any, Optional, List
from datetime import datetime
from enum import Enum
import uuid


class IntentCategory(Enum):
    INFORMATIONAL = "informational"
    GENERATIVE = "generative"
    ANALYTICAL = "analytical"
    EXECUTABLE = "executable"         # Triggers higher governance scrutiny
    ADMINISTRATIVE = "administrative" # System modification — maximum scrutiny


class GovernanceDecision(Enum):
    PROCEED = "proceed"
    PROCEED_WITH_LOGGING = "proceed_with_logging"
    REQUIRE_CONFIRMATION = "require_confirmation"
    BLOCK = "block"


@dataclass
class ControlBoundaryResult:
    request_id: str
    intent_category: IntentCategory
    governance_decision: GovernanceDecision
    risk_score: float                   # 0.0 (benign) to 1.0 (high risk)
    justification: str
    audit_record: Dict[str, Any]
    timestamp: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    passed: bool = True


@dataclass
class OutputEvaluationResult:
    request_id: str
    grounding_score: float              # How well anchored to source documents
    coherence_score: float              # Internal logical consistency
    coverage_score: float               # Query completeness
    hallucination_penalty: float        # Detected confabulation
    composite_score: float              # Weighted aggregate
    flagged_claims: List[str]           # Claims requiring provenance verification
    audit_record: Dict[str, Any]
    timestamp: str = field(default_factory=lambda: datetime.utcnow().isoformat())


class ControlBoundaryEngine:
    """
    The governance conscience of SOVEREIGN.
    
    Every request passes through here before execution.
    Every output passes through here before delivery.
    The audit trail is complete, immutable, and yours.
    
    This is not a security layer. It is an accountability layer.
    The distinction matters: security prevents bad actors.
    Accountability ensures the system answers to you.
    """

    def __init__(self, config: Dict[str, Any]):
        self.config = config
        self.audit_log_path = config.get("audit_log_path", "./logs/audit.jsonl")
        self.risk_thresholds = config.get("risk_thresholds", {
            "block": 0.9,
            "require_confirmation": 0.7,
            "enhanced_logging": 0.4
        })
        self._init_audit_log()

    def _init_audit_log(self):
        import os
        os.makedirs(os.path.dirname(self.audit_log_path), exist_ok=True)

    def evaluate_request(self, query: str, session_id: str,
                         user_context: Dict[str, Any]) -> ControlBoundaryResult:
        """
        Phase 1: Evaluate intent before execution.
        
        The system asks itself: what is this request trying to do?
        Is the intent aligned with the configured governance policy?
        What level of scrutiny does this request warrant?
        """
        request_id = str(uuid.uuid4())
        intent_category = self._classify_intent(query)
        risk_score = self._compute_risk_score(query, intent_category, user_context)
        governance_decision = self._make_governance_decision(risk_score, intent_category)
        
        justification = self._generate_justification(
            intent_category, risk_score, governance_decision
        )
        
        audit_record = {
            "request_id": request_id,
            "session_id": session_id,
            "query_hash": hash(query),      # Hash, not raw query — privacy-preserving audit
            "intent_category": intent_category.value,
            "risk_score": risk_score,
            "governance_decision": governance_decision.value,
            "justification": justification,
            "timestamp": datetime.utcnow().isoformat()
        }
        
        self._append_to_audit_log(audit_record)
        
        return ControlBoundaryResult(
            request_id=request_id,
            intent_category=intent_category,
            governance_decision=governance_decision,
            risk_score=risk_score,
            justification=justification,
            audit_record=audit_record,
            passed=(governance_decision != GovernanceDecision.BLOCK)
        )

    def evaluate_output(self, output: str, source_nodes: List[Dict],
                         query: str, request_id: str) -> OutputEvaluationResult:
        """
        Phase 2: Evaluate output before delivery.
        
        The system asks: is this response grounded in evidence?
        Does it make claims that cannot be traced to source documents?
        Is it coherent? Is it complete relative to the query?
        
        This is the architectural answer to hallucination.
        Not a post-hoc filter — an embedded evaluation.
        """
        grounding_score = self._compute_grounding_score(output, source_nodes)
        coherence_score = self._compute_coherence_score(output)
        coverage_score = self._compute_coverage_score(output, query)
        hallucination_penalty = self._detect_hallucinations(output, source_nodes)
        flagged_claims = self._extract_flagged_claims(output, source_nodes)
        
        composite_score = (
            0.35 * grounding_score +
            0.30 * coherence_score +
            0.25 * coverage_score -
            0.10 * hallucination_penalty
        )
        composite_score = max(0.0, min(1.0, composite_score))
        
        audit_record = {
            "request_id": request_id,
            "grounding_score": grounding_score,
            "coherence_score": coherence_score,
            "coverage_score": coverage_score,
            "hallucination_penalty": hallucination_penalty,
            "composite_score": composite_score,
            "flagged_claims_count": len(flagged_claims),
            "timestamp": datetime.utcnow().isoformat()
        }
        self._append_to_audit_log(audit_record)
        
        return OutputEvaluationResult(
            request_id=request_id,
            grounding_score=grounding_score,
            coherence_score=coherence_score,
            coverage_score=coverage_score,
            hallucination_penalty=hallucination_penalty,
            composite_score=composite_score,
            flagged_claims=flagged_claims,
            audit_record=audit_record
        )

    def _classify_intent(self, query: str) -> IntentCategory:
        query_lower = query.lower()
        if any(k in query_lower for k in ["delete", "modify", "update", "configure", "install"]):
            return IntentCategory.ADMINISTRATIVE
        if any(k in query_lower for k in ["execute", "run", "deploy", "create file", "write to"]):
            return IntentCategory.EXECUTABLE
        if any(k in query_lower for k in ["analyze", "compare", "evaluate", "assess"]):
            return IntentCategory.ANALYTICAL
        if any(k in query_lower for k in ["write", "generate", "create", "draft", "produce"]):
            return IntentCategory.GENERATIVE
        return IntentCategory.INFORMATIONAL

    def _compute_risk_score(self, query: str, intent: IntentCategory,
                             context: Dict[str, Any]) -> float:
        base_scores = {
            IntentCategory.INFORMATIONAL: 0.1,
            IntentCategory.GENERATIVE: 0.3,
            IntentCategory.ANALYTICAL: 0.2,
            IntentCategory.EXECUTABLE: 0.6,
            IntentCategory.ADMINISTRATIVE: 0.8
        }
        return base_scores.get(intent, 0.5)

    def _make_governance_decision(self, risk_score: float,
                                   intent: IntentCategory) -> GovernanceDecision:
        if risk_score >= self.risk_thresholds["block"]:
            return GovernanceDecision.BLOCK
        if risk_score >= self.risk_thresholds["require_confirmation"]:
            return GovernanceDecision.REQUIRE_CONFIRMATION
        if risk_score >= self.risk_thresholds["enhanced_logging"]:
            return GovernanceDecision.PROCEED_WITH_LOGGING
        return GovernanceDecision.PROCEED

    def _compute_grounding_score(self, output: str,
                                   source_nodes: List[Dict]) -> float:
        if not source_nodes:
            return 0.0
        source_terms = set()
        for node in source_nodes:
            content = node.get("content", "")
            source_terms.update(content.lower().split())
        output_terms = set(output.lower().split())
        overlap = len(output_terms & source_terms)
        return min(1.0, overlap / max(len(output_terms), 1) * 3.0)

    def _compute_coherence_score(self, output: str) -> float:
        sentences = [s.strip() for s in output.split(".") if s.strip()]
        if len(sentences) < 2:
            return 1.0
        return min(1.0, 0.5 + (len(sentences) / 20.0))

    def _compute_coverage_score(self, output: str, query: str) -> float:
        query_terms = set(query.lower().split())
        output_text = output.lower()
        covered = sum(1 for term in query_terms if term in output_text)
        return covered / max(len(query_terms), 1)

    def _detect_hallucinations(self, output: str,
                                source_nodes: List[Dict]) -> float:
        specific_claims = [
            word for word in output.split()
            if word.replace(",", "").replace(".", "").isdigit()
               or (len(word) > 2 and word[0].isupper())
        ]
        if not specific_claims or not source_nodes:
            return 0.0
        source_content = " ".join(n.get("content", "") for n in source_nodes).lower()
        ungrounded = sum(
            1 for claim in specific_claims
            if claim.lower() not in source_content
        )
        return min(1.0, ungrounded / max(len(specific_claims), 1))

    def _extract_flagged_claims(self, output: str,
                                 source_nodes: List[Dict]) -> List[str]:
        source_content = " ".join(n.get("content", "") for n in source_nodes).lower()
        sentences = [s.strip() for s in output.split(".") if s.strip()]
        flagged = []
        for sentence in sentences:
            key_terms = [w for w in sentence.split() if len(w) > 5]
            if key_terms and not any(t.lower() in source_content for t in key_terms):
                flagged.append(sentence)
        return flagged[:5]  # Return top 5 flagged sentences

    def _generate_justification(self, intent: IntentCategory,
                                  risk_score: float,
                                  decision: GovernanceDecision) -> str:
        return (
            f"Intent classified as {intent.value} with risk score {risk_score:.2f}. "
            f"Governance decision: {decision.value}. "
            f"Threshold configuration: block={self.risk_thresholds['block']}, "
            f"confirm={self.risk_thresholds['require_confirmation']}."
        )

    def _append_to_audit_log(self, record: Dict[str, Any]):
        import json
        with open(self.audit_log_path, "a") as f:
            f.write(json.dumps(record) + "\n")

VII. The Orchestration Layer: MoE Routing and Agent Swarms

The MoE orchestrator is the brain of SOVEREIGN's execution path. It receives a query from the API gateway, consults the governance layer for clearance, routes to the persona engine for specialist selection, dispatches parallel persona commentary passes against the knowledge graph, aggregates results through a multi-dimensional evaluation function, and returns a synthesized response with a full execution trace.

This is not a chain. It is a graph. Execution can be parallel, recursive, or branching depending on query complexity and persona routing decisions.

Python
# sovereign/orchestration/moe_orchestrator.py

from typing import Dict, List, Any, Optional
from datetime import datetime
import asyncio
import uuid

from sovereign.reasoning.persona_engine import PersonaEngine, Persona
from sovereign.memory.knowledge_graph import SovereignKnowledgeGraph
from sovereign.memory.vector_store import SovereignVectorStore
from sovereign.inference.local_engine import LocalInferenceEngine
from sovereign.governance.control_boundary import ControlBoundaryEngine, GovernanceDecision


class MoEOrchestrator:
    """
    The Mixture-of-Experts orchestrator for SOVEREIGN.
    
    Routes queries to specialist personas, executes parallel
    commentary passes, aggregates results through multi-dimensional
    evaluation, and returns synthesized responses with full execution traces.
    
    Every execution is reproducible.
    Every routing decision is logged.
    Every persona contribution is attributed.
    """

    def __init__(self, config: Dict[str, Any]):
        self.config = config
        self.persona_engine = PersonaEngine(config.get("persona_config", {}))
        self.knowledge_graph = SovereignKnowledgeGraph(config.get("graph_config", {}))
        self.vector_store = SovereignVectorStore(config.get("vector_config", {}))
        self.inference_engine = LocalInferenceEngine(config.get("inference_config", {}))
        self.governance = ControlBoundaryEngine(config.get("governance_config", {}))

    def execute(self, query: str, session_id: str,
                user_context: Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
        """
        Full orchestration pipeline.
        
        Phase 1: Governance pre-check
        Phase 2: Context retrieval (vector + graph)
        Phase 3: Persona routing
        Phase 4: Parallel persona commentary passes
        Phase 5: Aggregation and synthesis
        Phase 6: Governance post-check
        Phase 7: Persona evolution update
        Phase 8: Return with full execution trace
        """
        execution_trace = {
            "execution_id": str(uuid.uuid4()),
            "query": query,
            "session_id": session_id,
            "started_at": datetime.utcnow().isoformat(),
            "phases": []
        }

        # ── Phase 1: Governance Pre-Check ────────────────────────────────────────
        governance_result = self.governance.evaluate_request(
            query, session_id, user_context or {}
        )
        execution_trace["phases"].append({
            "phase": "governance_precheck",
            "result": governance_result.audit_record
        })
        
        if not governance_result.passed:
            return self._build_blocked_response(query, governance_result, execution_trace)

        # ── Phase 2: Context Retrieval ────────────────────────────────────────────
        vector_results = self.vector_store.query(query, n_results=10)
        query_domain = self._infer_domain(query, vector_results)
        
        # Build query-scoped graph from retrieved documents
        source_node_ids = self._build_query_graph(query, vector_results)
        execution_trace["phases"].append({
            "phase": "context_retrieval",
            "vector_results_count": len(vector_results),
            "graph_nodes_constructed": len(source_node_ids),
            "inferred_domain": query_domain
        })

        # ── Phase 3: Persona Routing ──────────────────────────────────────────────
        activated_personas = self.persona_engine.route_to_persona(query, query_domain)
        execution_trace["phases"].append({
            "phase": "persona_routing",
            "activated_personas": [p.id for p in activated_personas],
            "persona_count": len(activated_personas)
        })

        if not activated_personas:
            return self._build_no_persona_response(query, execution_trace)

        # ── Phase 4: Parallel Persona Commentary ─────────────────────────────────
        persona_results = self._execute_persona_passes(
            query, activated_personas, vector_results, source_node_ids
        )
        execution_trace["phases"].append({
            "phase": "persona_commentary",
            "results_count": len(persona_results)
        })

        # ── Phase 5: Aggregation and Synthesis ───────────────────────────────────
        aggregated_response = self._aggregate_and_synthesize(
            query, persona_results, vector_results
        )
        execution_trace["phases"].append({
            "phase": "aggregation",
            "composite_score": aggregated_response["evaluation_score"],
            "synthesis_length": len(aggregated_response["synthesis"])
        })

        # ── Phase 6: Governance Post-Check ───────────────────────────────────────
        output_evaluation = self.governance.evaluate_output(
            aggregated_response["synthesis"],
            vector_results,
            query,
            governance_result.request_id
        )
        execution_trace["phases"].append({
            "phase": "governance_postcheck",
            "grounding_score": output_evaluation.grounding_score,
            "hallucination_penalty": output_evaluation.hallucination_penalty,
            "flagged_claims_count": len(output_evaluation.flagged_claims)
        })

        # ── Phase 7: Persona Evolution ────────────────────────────────────────────
        self._update_persona_evolution(
            activated_personas, persona_results,
            aggregated_response["evaluation_score"], query_domain
        )

        # ── Phase 8: Prune underperformers ───────────────────────────────────────
        self._run_pruning_cycle()

        execution_trace["completed_at"] = datetime.utcnow().isoformat()
        
        return {
            "response": aggregated_response["synthesis"],
            "evaluation": {
                "composite_score": aggregated_response["evaluation_score"],
                "grounding_score": output_evaluation.grounding_score,
                "coherence_score": output_evaluation.coherence_score,
                "hallucination_penalty": output_evaluation.hallucination_penalty
            },
            "provenance": {
                "source_documents": [r["metadata"].get("source") for r in vector_results[:5]],
                "activated_personas": [p.name for p in activated_personas],
                "flagged_claims": output_evaluation.flagged_claims
            },
            "execution_trace": execution_trace
        }

    def _execute_persona_passes(self, query: str, personas: List[Persona],
                                  vector_results: List[Dict],
                                  source_node_ids: List[str]) -> List[Dict[str, Any]]:
        """Execute parallel persona commentary passes."""
        context = self._format_context_for_inference(vector_results)
        results = []
        
        for persona in personas:
            start_time = datetime.utcnow()
            system_prompt = persona.get_system_prompt(context=query)
            
            inference_prompt = (
                f"Based on the following context, provide your expert analysis:\n\n"
                f"CONTEXT:\n{context}\n\n"
                f"QUERY: {query}\n\n"
                f"Provide a detailed analysis from your perspective as {persona.name}. "
                f"Reference specific information from the context. "
                f"Identify key insights and any limitations in the available information."
            )
            
            try:
                commentary = self.inference_engine.generate(
                    inference_prompt, system_prompt, max_tokens=1500
                )
                latency_ms = (datetime.utcnow() - start_time).total_seconds() * 1000
                
                results.append({
                    "persona_id": persona.id,
                    "persona_name": persona.name,
                    "commentary": commentary,
                    "relevance_score": self._score_relevance(commentary, query),
                    "key_insights": self._extract_key_insights(commentary),
                    "latency_ms": latency_ms,
                    "success": True
                })
            except Exception as e:
                results.append({
                    "persona_id": persona.id,
                    "persona_name": persona.name,
                    "commentary": "",
                    "relevance_score": 0.0,
                    "key_insights": [],
                    "latency_ms": 0,
                    "success": False,
                    "error": str(e)
                })
        
        return results

    def _aggregate_and_synthesize(self, query: str, persona_results: List[Dict],
                                    vector_results: List[Dict]) -> Dict[str, Any]:
        """Synthesize persona commentaries into a unified response."""
        successful_results = [r for r in persona_results if r["success"]]
        
        if not successful_results:
            return {"synthesis": "No successful persona passes completed.", "evaluation_score": 0.0}
        
        synthesis_prompt = (
            "Synthesize the following expert analyses into a single, coherent response. "
            "Preserve the key insights from each perspective. "
            "Resolve contradictions explicitly. "
            "Be precise about what is known versus inferred.\n\n"
        )
        
        for result in successful_results:
            synthesis_prompt += (
                f"### {result['persona_name']} Analysis:\n"
                f"{result['commentary']}\n\n"
            )
        
        synthesis_prompt += f"\nQuery to address: {query}\n\nProvide a unified synthesis:"
        
        synthesis = self.inference_engine.generate(
            synthesis_prompt,
            system_prompt="You are a synthesis engine. Combine multiple expert perspectives into clear, grounded analysis.",
            max_tokens=2000
        )
        
        evaluation_score = self._evaluate_synthesis(
            [r["commentary"] for r in successful_results],
            [insight for r in successful_results for insight in r["key_insights"]],
            query
        )
        
        return {"synthesis": synthesis, "evaluation_score": evaluation_score}

    def _evaluate_synthesis(self, commentaries: List[str],
                              insights: List[str], query: str) -> float:
        if not commentaries:
            return 0.0
        
        coverage = min(1.0, len(insights) / max(len(query.split()), 1) * 2.0)
        
        if len(commentaries) < 2:
            coherence = 1.0
        else:
            all_terms = [set(c.lower().split()) for c in commentaries]
            pairwise_overlaps = []
            for i in range(len(all_terms)):
                for j in range(i + 1, len(all_terms)):
                    union = all_terms[i] | all_terms[j]
                    intersection = all_terms[i] & all_terms[j]
                    pairwise_overlaps.append(len(intersection) / max(len(union), 1))
            coherence = sum(pairwise_overlaps) / max(len(pairwise_overlaps), 1)
        
        query_terms = set(query.lower().split())
        all_output = " ".join(commentaries).lower()
        relevance = sum(1 for t in query_terms if t in all_output) / max(len(query_terms), 1)
        
        return 0.4 * coverage + 0.3 * coherence + 0.3 * relevance

    def _build_query_graph(self, query: str,
                            vector_results: List[Dict]) -> List[str]:
        """Construct a query-scoped knowledge graph from retrieved documents."""
        node_ids = []
        for result in vector_results:
            node = self.knowledge_graph.add_node(
                label="DOCUMENT",
                content=result["content"][:500],
                source_document_id=result["id"],
                confidence=result["relevance_score"]
            )
            node_ids.append(node.id)
        
        # Connect related documents
        for i in range(len(node_ids) - 1):
            self.knowledge_graph.add_edge(
                node_ids[i], node_ids[i + 1],
                relationship="RELATED_TO",
                weight=0.5,
                established_by="query_construction"
            )
        return node_ids

    def _update_persona_evolution(self, personas: List[Persona],
                                   results: List[Dict],
                                   aggregate_score: float, domain: str):
        for persona in personas:
            persona_result = next(
                (r for r in results if r["persona_id"] == persona.id), None
            )
            if not persona_result:
                continue
            
            individual_score = persona_result.get("relevance_score", aggregate_score)
            feedback_vector = {
                trait_name: individual_score
                for trait_name in persona.traits.keys()
            }
            persona.apply_bounded_update(feedback_vector)
            
            persona.performance.total_queries += 1
            persona.performance.total_score += individual_score
            persona.performance.last_used = datetime.utcnow().isoformat()
            persona.performance.domain_scores[domain] = (
                persona.performance.domain_scores.get(domain, 0.5) * 0.8 +
                individual_score * 0.2
            )
            if individual_score >= 0.6:
                persona.performance.success_rate = (
                    persona.performance.success_rate * 0.9 + 0.1
                )

    def _run_pruning_cycle(self):
        """Retire consistently underperforming personas."""
        prune_threshold = self.config.get("prune_threshold", 0.3)
        for persona_id, persona in list(self.persona_engine.active_personas.items()):
            if (persona.performance.total_queries >= 10 and
                    persona.performance.average_score < prune_threshold):
                self.persona_engine.prune_persona(
                    persona_id, reason=f"average_score {persona.performance.average_score:.2f} below threshold {prune_threshold}"
                )

    def _infer_domain(self, query: str, vector_results: List[Dict]) -> str:
        domain_keywords = {
            "code": ["function", "class", "algorithm", "implement", "debug", "code", "python", "typescript"],
            "research": ["analyze", "study", "evidence", "research", "paper", "data", "statistics"],
            "writing": ["write", "draft", "compose", "article", "blog", "narrative", "story"],
            "architecture": ["system", "design", "architecture", "infrastructure", "deploy", "scale"],
            "governance": ["compliance", "policy", "audit", "risk", "regulation", "governance"]
        }
        query_lower = query.lower()
        domain_scores = {}
        for domain, keywords in domain_keywords.items():
            domain_scores[domain] = sum(1 for kw in keywords if kw in query_lower)
        return max(domain_scores, key=domain_scores.get)

    def _format_context_for_inference(self, vector_results: List[Dict]) -> str:
        context_parts = []
        for i, result in enumerate(vector_results[:5]):
            source = result["metadata"].get("source", "unknown")
            content = result["content"][:400]
            score = result["relevance_score"]
            context_parts.append(f"[Source {i+1}: {source} | Relevance: {score:.2f}]\n{content}")
        return "\n\n".join(context_parts)

    def _score_relevance(self, commentary: str, query: str) -> float:
        query_terms = set(query.lower().split())
        commentary_terms = set(commentary.lower().split())
        return len(query_terms & commentary_terms) / max(len(query_terms), 1)

    def _extract_key_insights(self, commentary: str) -> List[str]:
        sentences = [s.strip() for s in commentary.split(".") if len(s.strip()) > 40]
        return sentences[:3]

    def _build_blocked_response(self, query: str, governance_result: Any,
                                  trace: Dict) -> Dict[str, Any]:
        return {
            "response": f"Request blocked by governance layer. Reason: {governance_result.justification}",
            "blocked": True,
            "governance_result": governance_result.audit_record,
            "execution_trace": trace
        }

    def _build_no_persona_response(self, query: str, trace: Dict) -> Dict[str, Any]:
        return {
            "response": "No active personas available for this query domain. Review persona configuration.",
            "no_personas": True,
            "execution_trace": trace
        }

VIII. The SpecGen Module: Deterministic Code from Specification

One of the most powerful — and underutilized — components in the system is SpecGen: the deterministic code generation engine that produces production-ready implementations from structured technical specifications.

SpecGen was born from a frustration I could not resolve with vanilla LLM code generation: non-determinism. Given the same specification twice, most code generation systems will produce meaningfully different implementations. The patterns, the naming conventions, the error handling strategies, the test coverage — all of it varies with temperature and token sampling. This is fine for exploration. It is unacceptable for production infrastructure.

SpecGen solves this through three mechanisms: (1) a structured specification format that eliminates ambiguity before generation, (2) RAG-grounded generation that anchors output to your existing codebase patterns, and (3) a fixed-seed inference call that produces deterministic output given the same specification and context.

Python
# sovereign/specgen/spec_generator.py

from dataclasses import dataclass, field
from typing import Dict, List, Optional, Any
import json
import hashlib


@dataclass
class ComponentSpec:
    """
    A fully specified component for deterministic code generation.
    
    Ambiguity in the spec means ambiguity in the output.
    Every field is required because every field shapes the generated code.
    Underspecified components produce underspecified implementations.
    """
    name: str
    component_type: str           # service, model, api_endpoint, utility, test, config
    language: str                 # python, typescript, sql, yaml, bash
    description: str
    inputs: List[Dict[str, str]]  # [{name, type, description, required}]
    outputs: List[Dict[str, str]] # [{name, type, description}]
    dependencies: List[str]       # Other component names this depends on
    constraints: List[str]        # Explicit behavioral constraints
    error_handling: List[str]     # Error cases and handling strategies
    test_scenarios: List[Dict]    # [{name, given, when, then}]
    existing_patterns: List[str]  # Code patterns from codebase to follow
    
    @property
    def spec_hash(self) -> str:
        """Deterministic hash of the specification — same spec = same hash = same code."""
        spec_string = json.dumps(
            {k: v for k, v in vars(self).items() if k != "spec_hash"},
            sort_keys=True
        )
        return hashlib.sha256(spec_string.encode()).hexdigest()[:12]


class SpecGenerator:
    """
    Deterministic code generation from structured specifications.
    
    The key insight: LLM code generation is non-deterministic by default
    because the prompt is underspecified and the sampling is random.
    Remove the underspecification. Fix the seed.
    Now the generation is deterministic.
    
    Your codebase is a corpus. New code should be grounded in existing patterns.
    SpecGen retrieves those patterns before generating.
    The result is code that looks like it was written by the same author
    as the rest of the codebase — because it was trained on the same corpus.
    """

    def __init__(self, config: Dict[str, Any], vector_store, inference_engine):
        self.config = config
        self.vector_store = vector_store
        self.inference_engine = inference_engine
        self.generation_seed = config.get("generation_seed", 42)
        self.spec_cache: Dict[str, str] = {}

    def generate_component(self, spec: ComponentSpec) -> Dict[str, Any]:
        """Generate a complete, production-ready component from specification."""
        
        # Check spec cache — same spec always produces same code
        if spec.spec_hash in self.spec_cache:
            return {
                "code": self.spec_cache[spec.spec_hash],
                "spec_hash": spec.spec_hash,
                "cache_hit": True
            }
        
        # Retrieve existing patterns from the codebase
        pattern_context = self._retrieve_existing_patterns(spec)
        
        # Build deterministic generation prompt
        generation_prompt = self._build_generation_prompt(spec, pattern_context)
        system_prompt = self._build_system_prompt(spec)
        
        # Generate with fixed seed for determinism
        generated_code = self.inference_engine.generate(
            generation_prompt,
            system_prompt=system_prompt,
            temperature=0.0,      # Zero temperature: maximum determinism
            seed=self.generation_seed,
            max_tokens=3000
        )
        
        # Generate tests in a separate pass
        test_code = self._generate_tests(spec, generated_code, pattern_context)
        
        result = {
            "component_name": spec.name,
            "component_type": spec.component_type,
            "language": spec.language,
            "spec_hash": spec.spec_hash,
            "implementation": generated_code,
            "tests": test_code,
            "dependencies": spec.dependencies,
            "cache_hit": False
        }
        
        self.spec_cache[spec.spec_hash] = generated_code
        return result

    def _retrieve_existing_patterns(self, spec: ComponentSpec) -> str:
        """Retrieve relevant code patterns from the existing codebase."""
        search_query = f"{spec.component_type} {spec.language} {' '.join(spec.existing_patterns[:3])}"
        results = self.vector_store.query(
            search_query,
            n_results=5,
            where_filter={"doc_type": "code"}
        )
        if not results:
            return "No existing patterns found in codebase."
        return "\n\n".join([
            f"# Pattern from {r['metadata'].get('source', 'unknown')}:\n{r['content']}"
            for r in results
        ])

    def _build_generation_prompt(self, spec: ComponentSpec, pattern_context: str) -> str:
        return f"""Generate a production-ready {spec.language} {spec.component_type} named {spec.name}.

SPECIFICATION:
- Description: {spec.description}
- Inputs: {json.dumps(spec.inputs, indent=2)}
- Outputs: {json.dumps(spec.outputs, indent=2)}
- Dependencies: {', '.join(spec.dependencies)}
- Constraints: {chr(10).join(f'  - {c}' for c in spec.constraints)}
- Error handling: {chr(10).join(f'  - {e}' for e in spec.error_handling)}

EXISTING CODEBASE PATTERNS TO FOLLOW:
{pattern_context}

Generate ONLY the implementation code. No preamble. No explanation. No markdown fences.
The code must be complete, typed, and production-ready."""

    def _build_system_prompt(self, spec: ComponentSpec) -> str:
        language_instructions = {
            "python": "Use type hints, dataclasses, explicit error handling, and docstrings. Follow PEP 8.",
            "typescript": "Use strict TypeScript with explicit types. No `any`. Prefer interfaces over types for objects.",
            "sql": "Use explicit column names, proper indexes, and transactional safety.",
        }
        return (
            f"You are a senior software engineer generating production {spec.language} code. "
            f"{language_instructions.get(spec.language, '')} "
            f"Output ONLY valid {spec.language} code. No explanations."
        )

    def _generate_tests(self, spec: ComponentSpec, implementation: str,
                         pattern_context: str) -> str:
        test_prompt = f"""Generate comprehensive tests for this {spec.language} {spec.component_type}.

IMPLEMENTATION:
{implementation}

TEST SCENARIOS:
{json.dumps(spec.test_scenarios, indent=2)}

Generate complete test code following the patterns in the codebase.
Cover success cases, edge cases, and each error handling scenario.
Output ONLY test code."""
        return self.inference_engine.generate(
            test_prompt,
            system_prompt=f"Generate complete {spec.language} tests. Output ONLY code.",
            temperature=0.0,
            seed=self.generation_seed,
            max_tokens=2000
        )

IX. The API Gateway: FastAPI Backend

Python
# sovereign/api/main.py

from fastapi import FastAPI, HTTPException, BackgroundTasks, WebSocket
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from typing import Dict, Any, Optional, List
import uuid
import yaml
from sovereign.orchestration.moe_orchestrator import MoEOrchestrator
from sovereign.governance.control_boundary import ControlBoundaryEngine


def load_config(path: str = "./config/sovereign.yaml") -> Dict[str, Any]:
    with open(path) as f:
        return yaml.safe_load(f)


config = load_config()
app = FastAPI(
    title="SOVEREIGN API",
    description="Self-owned local-first AI orchestration. No cloud. No telemetry. Your inference.",
    version="1.0.0"
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=config.get("cors_origins", ["http://localhost:3000"]),
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

orchestrator = MoEOrchestrator(config)


class QueryRequest(BaseModel):
    query: str = Field(..., min_length=1, max_length=10000)
    session_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
    persona_override: Optional[List[str]] = None
    domain_hint: Optional[str] = None
    stream: bool = False


class DocumentIngestRequest(BaseModel):
    documents: List[Dict[str, Any]]
    collection: Optional[str] = "default"
    extract_entities: bool = True
    build_graph_edges: bool = True


@app.post("/query")
async def query(request: QueryRequest) -> Dict[str, Any]:
    """
    Primary query endpoint. Runs the full 8-phase orchestration pipeline.
    Returns response with evaluation scores, provenance, and execution trace.
    """
    try:
        result = orchestrator.execute(
            query=request.query,
            session_id=request.session_id,
            user_context={"domain_hint": request.domain_hint}
        )
        return result
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.websocket("/query/stream")
async def query_stream(websocket: WebSocket):
    """
    Streaming query endpoint for real-time token delivery.
    Every token comes from local inference.
    """
    await websocket.accept()
    try:
        data = await websocket.receive_json()
        query_text = data.get("query", "")
        session_id = data.get("session_id", str(uuid.uuid4()))
        
        for token in orchestrator.inference_engine.generate_stream(query_text):
            await websocket.send_json({"token": token, "done": False})
        
        await websocket.send_json({"token": "", "done": True})
    except Exception as e:
        await websocket.send_json({"error": str(e), "done": True})
    finally:
        await websocket.close()


@app.post("/documents/ingest")
async def ingest_documents(request: DocumentIngestRequest,
                            background_tasks: BackgroundTasks) -> Dict[str, Any]:
    """Ingest documents into the memory substrate (vector store + knowledge graph)."""
    doc_ids = orchestrator.vector_store.embed_and_store(request.documents)
    return {
        "ingested_count": len(doc_ids),
        "document_ids": doc_ids,
        "collection": request.collection
    }


@app.get("/personas")
async def list_personas() -> Dict[str, Any]:
    """List all personas with their current lifecycle state and performance metrics."""
    active = {
        pid: {
            "name": p.name,
            "status": p.status,
            "expertise": p.expertise,
            "average_score": p.performance.average_score,
            "total_queries": p.performance.total_queries,
            "version": p.version
        }
        for pid, p in orchestrator.persona_engine.active_personas.items()
    }
    cold = {
        pid: {"name": p.name, "status": p.status}
        for pid, p in orchestrator.persona_engine.cold_storage.items()
    }
    return {"active": active, "cold_storage": cold}


@app.post("/personas/{persona_id}/recall")
async def recall_persona(persona_id: str, query_context: str) -> Dict[str, Any]:
    """Attempt to recall a pruned persona based on query context."""
    recalled = orchestrator.persona_engine.recall_persona(persona_id, query_context)
    if recalled:
        return {"recalled": True, "persona_name": recalled.name, "persona_id": recalled.id}
    return {"recalled": False, "reason": "Context relevance below recall threshold"}


@app.get("/audit/log")
async def get_audit_log(limit: int = 50) -> Dict[str, Any]:
    """Return the most recent audit log entries."""
    import json
    entries = []
    try:
        with open(config.get("governance_config", {}).get("audit_log_path", "./logs/audit.jsonl")) as f:
            for line in f:
                if line.strip():
                    entries.append(json.loads(line))
    except FileNotFoundError:
        entries = []
    return {"entries": entries[-limit:], "total_count": len(entries)}


@app.get("/health")
async def health() -> Dict[str, Any]:
    return {
        "status": "sovereign",
        "inference_mode": config.get("inference_config", {}).get("execution_mode", "ollama"),
        "cloud_dependency": False,
        "telemetry": False
    }

X. Complete Project Scaffolding

This is the directory structure for a coding agent to construct from scratch. Every file listed is necessary. Every directory serves a specific architectural purpose.

sovereign/
├── README.md
├── pyproject.toml
├── docker-compose.yml
├── Makefile
│
├── config/
│   ├── sovereign.yaml          # Master configuration
│   ├── personas/               # Persona definition templates
│   │   ├── analytical.json
│   │   ├── creative.json
│   │   ├── technical.json
│   │   ├── critical.json
│   │   └── generalist.json
│   └── model_registry.yaml     # Local model routing table
│
├── sovereign/                  # Core Python package
│   ├── __init__.py
│   │
│   ├── inference/
│   │   ├── __init__.py
│   │   └── local_engine.py     # Ollama + llama.cpp unified interface
│   │
│   ├── memory/
│   │   ├── __init__.py
│   │   ├── knowledge_graph.py  # Dual-substrate KG (Neo4j + NetworkX)
│   │   ├── vector_store.py     # ChromaDB/Qdrant local vector store
│   │   └── document_loader.py  # PDF, Markdown, HTML, JSON loaders
│   │
│   ├── reasoning/
│   │   ├── __init__.py
│   │   ├── persona_engine.py   # Persona lifecycle + bounded evolution
│   │   └── domain_classifier.py
│   │
│   ├── orchestration/
│   │   ├── __init__.py
│   │   ├── moe_orchestrator.py # 8-phase query execution pipeline
│   │   └── agent_swarm.py      # Multi-agent parallel execution
│   │
│   ├── governance/
│   │   ├── __init__.py
│   │   ├── control_boundary.py # Intent evaluation + output scoring
│   │   └── audit_exporter.py   # Export audit trail to CSV/JSON
│   │
│   ├── specgen/
│   │   ├── __init__.py
│   │   ├── spec_generator.py   # Deterministic code generation
│   │   └── spec_validator.py   # Validate spec completeness before generation
│   │
│   └── api/
│       ├── __init__.py
│       ├── main.py             # FastAPI application
│       ├── middleware.py       # Request logging, auth
│       └── models.py           # Pydantic request/response models
│
├── frontend/                   # Next.js 14 interface
│   ├── package.json
│   ├── tsconfig.json
│   ├── next.config.ts
│   ├── tailwind.config.ts
│   │
│   ├── app/
│   │   ├── layout.tsx
│   │   ├── page.tsx            # Main chat interface
│   │   ├── globals.css
│   │   │
│   │   ├── chat/
│   │   │   └── page.tsx        # Conversational query UI
│   │   ├── personas/
│   │   │   └── page.tsx        # Persona management dashboard
│   │   ├── knowledge/
│   │   │   └── page.tsx        # Knowledge graph visualization
│   │   ├── audit/
│   │   │   └── page.tsx        # Audit log viewer
│   │   └── specgen/
│   │       └── page.tsx        # SpecGen UI: spec input → code output
│   │
│   └── components/
│       ├── ChatInterface.tsx
│       ├── PersonaCard.tsx
│       ├── GraphViewer.tsx     # D3.js or Cytoscape knowledge graph viz
│       ├── AuditLog.tsx
│       ├── EvaluationScore.tsx
│       ├── ProvenancePanel.tsx
│       └── SpecForm.tsx
│
├── data/
│   ├── personas/
│   │   ├── experimental/
│   │   ├── active/
│   │   ├── stable/
│   │   ├── pruned/
│   │   └── cold_storage/
│   ├── chromadb/               # Local vector store persistence
│   ├── graph_snapshots/        # Exported knowledge graph states
│   └── documents/              # Source document repository
│
├── logs/
│   ├── audit.jsonl             # Governance audit trail (append-only)
│   ├── execution_traces/       # Per-query execution traces
│   └── persona_evolution/      # Persona lifecycle change logs
│
├── scripts/
│   ├── setup.sh                # One-command environment setup
│   ├── ingest_documents.py     # Batch document ingestion
│   ├── create_persona.py       # Interactive persona creation wizard
│   ├── export_audit.py         # Audit trail export utility
│   ├── run_specgen.py          # SpecGen CLI
│   └── graph_snapshot.py       # Export knowledge graph state
│
└── tests/
    ├── unit/
    │   ├── test_knowledge_graph.py
    │   ├── test_persona_engine.py
    │   ├── test_control_boundary.py
    │   ├── test_local_engine.py
    │   └── test_spec_generator.py
    ├── integration/
    │   ├── test_orchestration_pipeline.py
    │   └── test_api_endpoints.py
    └── fixtures/
        ├── sample_personas.json
        ├── sample_documents/
        └── sample_specs.json

XI. Configuration: The Master Manifest

YAML
# config/sovereign.yaml
# Every value here is yours to set. Nothing is a default you cannot override.
# Read this file as a declaration of your own system's values.

sovereign:
  version: "1.0.0"
  environment: "development"   # development | production | air_gap

inference_config:
  execution_mode: "ollama"     # ollama | llama_cpp | hybrid
  ollama_endpoint: "http://localhost:11434"
  default_model: "llama3.2"
  seed: 42                     # Reproducibility: same seed = same output
  temperature: 0.1             # Low temperature: precision over creativity
  max_tokens: 2000
  model_registry:
    routing:
      code: "qwen2.5-coder:7b"
      research: "llama3.2"
      writing: "mistral:7b"
      architecture: "llama3.2"
      governance: "llama3.2"
    paths: {}                  # For llama_cpp mode: model file paths

graph_config:
  neo4j_uri: "bolt://localhost:7687"
  neo4j_user: "neo4j"
  neo4j_password: "sovereign"  # Change this before production
  decay_factor: 0.95           # Temporal decay per session
  prune_confidence_threshold: 0.1

vector_config:
  persist_directory: "./data/chromadb"
  collection_name: "sovereign_documents"
  embedding_model: "nomic-embed-text"

persona_config:
  personas_dir: "./data/personas"
  max_parallel_personas: 3
  prune_threshold: 0.3
  recall_threshold: 0.3
  evolution_rate: 0.05         # How quickly persona traits respond to feedback
  min_queries_before_prune: 10

governance_config:
  audit_log_path: "./logs/audit.jsonl"
  risk_thresholds:
    block: 0.9
    require_confirmation: 0.7
    enhanced_logging: 0.4
  reasonable_care_mode: true   # Colorado AI Act alignment

specgen_config:
  generation_seed: 42
  temperature: 0.0             # Zero temperature: maximum determinism
  cache_generated_specs: true

api_config:
  host: "0.0.0.0"
  port: 8000
  cors_origins:
    - "http://localhost:3000"

frontend_config:
  api_base_url: "http://localhost:8000"
  websocket_url: "ws://localhost:8000/query/stream"
  graph_visualization: "cytoscape"  # d3 | cytoscape

XII. Bootstrap: From Zero to Sovereign in Ten Commands

Bash
# 1. Clone and enter
git clone https://github.com/kliewerdaniel/sovereign.git
cd sovereign

# 2. Install Python dependencies
pip install -r requirements.txt

# 3. Install spaCy language model (for entity extraction in governance layer)
python -m spacy download en_core_web_sm

# 4. Start Ollama and pull your primary model
ollama serve &
ollama pull llama3.2
ollama pull nomic-embed-text   # For local embeddings

# 5. Start Neo4j (optional: skip for pure in-memory graph)
docker run -d \
  --name sovereign-neo4j \
  -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/sovereign \
  neo4j:latest

# 6. Create directory structure
python scripts/setup.sh

# 7. Ingest your first documents
python scripts/ingest_documents.py --source ./data/documents/

# 8. Start the API backend
uvicorn sovereign.api.main:app --reload --port 8000

# 9. Start the frontend
cd frontend && npm install && npm run dev

# 10. Open your sovereign AI at http://localhost:3000
# No API keys. No cloud. No telemetry.
# Your hardware. Your inference. Your memory.
echo "SOVEREIGN is running. You own this."

XIII. The Knowledge Graph of the Blog — Why This Project Is the Synthesis

Every post I have written on this blog is a node in a knowledge graph. Every project I have built is an edge between concepts. SOVEREIGN is the traversal of that graph from end to end — the path that passes through every significant node and resolves the relationships between them.

[local inference] ──ENABLES──▶ [data sovereignty]
[data sovereignty] ──REQUIRES──▶ [audit trails]
[audit trails] ──REQUIRES──▶ [control boundary]
[control boundary] ──GOVERNS──▶ [MoE orchestration]
[MoE orchestration] ──ROUTES_TO──▶ [persona engine]
[persona engine] ──QUERIES──▶ [knowledge graph]
[knowledge graph] ──GROUNDS──▶ [RAG retrieval]
[RAG retrieval] ──FEEDS──▶ [SpecGen]
[SpecGen] ──GENERATES──▶ [new sovereign components]
[new sovereign components] ──EXPAND──▶ [knowledge graph]
                                              ▲
                                              └── (the loop closes)

This is not a coincidence of architecture. It is the point. A sovereign AI system should be able to reason about its own architecture. The knowledge graph should contain documentation of the system itself. SpecGen should be able to generate new components for the system from its own specifications. The orchestrator should be able to route queries about how to improve the orchestrator.

The system is self-referential by design. Not self-modifying — you remain the author of every change. But self-aware in the sense that every component can be queried, explained, and improved using the system itself.

That is what sovereignty means at full depth. Not just that your data stays local. Not just that your inference is on-prem. But that the system you use to think can be used to improve the way you think, and the improvement remains yours.


XIV. What This Is Not

SOVEREIGN is not:

  • A replacement for the best frontier models. GPT-5 and Claude and Gemini outperform every local model on raw capability benchmarks. If capability on cloud hardware with their data on their telemetry is the only thing you care about, this architecture is not for you.

  • A finished product. It is an architecture. A blueprint. A starting point. The personas you define will shape it. The documents you ingest will train its memory. The governance thresholds you configure will determine its behavior. The code this post generates is scaffolding, not a ceiling.

  • A political statement against any particular company. It is a structural argument: systems designed to extract from you produce different architecture than systems designed to serve you. Both exist. The choice between them is yours to make.

What this is: the most complete expression of everything I understand about building AI systems that answer to the person running them. Every module in this codebase is the distillation of a problem I could not stop thinking about until I had an implementation that solved it.

Build it. Modify it. Extend it. Publish your modifications. The graph grows in every direction from here.


Closing: The Architecture Is the Argument

The code in this post is an argument.

The bounded update function Δw = f(feedback) × (1 − w) is an argument that stability matters — that a system should resist extremes, not optimize toward them.

The query-scoped knowledge graph is an argument that memory should be deliberate — that accumulation without discernment is not intelligence, it is noise.

The governance layer in the execution path is an argument that accountability cannot be post-hoc — that a system which can only be evaluated after the fact cannot be meaningfully controlled.

The local inference requirement is an argument that the execution path should belong to the person executing — that cognitive infrastructure has an owner, and that owner should be you.

Every design choice in SOVEREIGN is downstream of one question: who is this system for?

I built it for myself. And then I wrote it down so you could build it for yourself too.

That is what sovereignty means in practice: not the absence of dependency on everything, but the deliberate choice of which dependencies you accept and which you refuse. The cloud can keep the telemetry. You keep the mind.


Appendix A: Python Dependencies

TOML
# pyproject.toml
[project]
name = "sovereign"
version = "1.0.0"
description = "Self-owned local-first AI orchestration system"
requires-python = ">=3.11"

dependencies = [
    # Core
    "fastapi>=0.110.0",
    "uvicorn[standard]>=0.29.0",
    "pydantic>=2.6.0",
    "pyyaml>=6.0",
    
    # Inference
    "requests>=2.31.0",
    
    # Memory
    "chromadb>=0.4.24",
    "networkx>=3.2",
    "neo4j>=5.18.0",
    
    # Document processing
    "pypdf>=4.1.0",
    "python-docx>=1.1.0",
    "markdown>=3.6",
    
    # NLP / Entity extraction
    "spacy>=3.7.4",
    
    # Utilities
    "python-multipart>=0.0.9",
    "aiofiles>=23.2.1",
    "websockets>=12.0",
]

[project.optional-dependencies]
dev = [
    "pytest>=8.1.0",
    "pytest-asyncio>=0.23.0",
    "httpx>=0.27.0",
    "black>=24.3.0",
    "ruff>=0.3.0",
    "mypy>=1.9.0",
]

Appendix B: Docker Compose

YAML
# docker-compose.yml
# Complete local stack. No external services. No internet required after initial pull.

version: "3.9"

services:
  sovereign-api:
    build: .
    ports:
      - "8000:8000"
    volumes:
      - ./data:/app/data
      - ./logs:/app/logs
      - ./config:/app/config
    environment:
      - OLLAMA_ENDPOINT=http://ollama:11434
      - NEO4J_URI=bolt://neo4j:7687
    depends_on:
      - ollama
      - neo4j
    networks:
      - sovereign-network

  sovereign-frontend:
    build: ./frontend
    ports:
      - "3000:3000"
    environment:
      - NEXT_PUBLIC_API_URL=http://localhost:8000
    networks:
      - sovereign-network

  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama-models:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    networks:
      - sovereign-network

  neo4j:
    image: neo4j:5
    ports:
      - "7474:7474"
      - "7687:7687"
    environment:
      - NEO4J_AUTH=neo4j/sovereign
    volumes:
      - neo4j-data:/data
    networks:
      - sovereign-network

volumes:
  ollama-models:
  neo4j-data:

networks:
  sovereign-network:
    driver: bridge

SOVEREIGN is the synthesis of every system documented on this blog. Every component described here has a prior post that goes deeper on its individual design. The knowledge graph of danielkliewer.com is the context this post assumes you already carry. If you arrived here without that context, the blog is the prerequisite.

Repository: github.com/kliewerdaniel/sovereign

Series: Sovereignty Manifesto · Architecture as Autonomy · Architecture of Autonomy · Private Knowledge Graph · DeerFlow 2.0 · OpenClaw Guide · SOVEREIGN — This Post

Related Articles

Loading related articles...