·8 min

Building a Dynamic Persona-Based Mixture-of-Experts RAG System

A comprehensive guide to building a dynamic, graph-based Mixture-of-Experts Retrieval-Augmented Generation system that leverages persona-driven AI agents for contextually rich responses.

DK

Daniel Kliewer

Author, Sovereign AI

AIMachine LearningRAGMixture-of-ExpertsKnowledge GraphsOllamaPython
Sovereign AI book cover

From the Book

This is from Sovereign AI: Building Local-First Intelligent Systems.

Get the Book — $88
Building a Dynamic Persona-Based Mixture-of-Experts RAG System

Code for this guide can be found on my github here

Building a Dynamic Persona-Based Mixture-of-Experts RAG System

Introduction

Welcome to this comprehensive guide on building a dynamic, graph-based Mixture-of-Experts (MoE) Retrieval-Augmented Generation (RAG) system that leverages persona-driven AI agents. This project represents a cutting-edge approach to AI orchestration, combining multiple AI "personas" that dynamically traverse knowledge graphs to provide contextually rich, diverse responses.

In this post, we'll walk through the complete construction of this system, from initial project setup to the final scaffolded architecture. We'll explore each component, understand the design decisions, and learn how the pieces fit together to create an intelligent, adaptive AI system.

Part 1: Project Foundations and Architecture

1.1 The Vision: Dynamic Persona MoE RAG

At its core, this system implements a Mixture-of-Experts RAG where:

  • Personas are specialized AI agents with unique traits, expertise, and behavioral patterns
  • Dynamic Graphs represent knowledge in a flexible, query-scoped structure
  • Traversal Logic allows personas to navigate graphs based on their individual perspectives
  • Ollama Integration provides local LLM inference with synthesized persona context

The key innovation is the dynamic nature: graphs are built on-demand for each query, personas evolve through performance feedback, and the system adapts through pruning and promotion cycles.

1.2 Project Initialization

We begin by creating a robust Python project structure:

bash
1mkdir dynamic_persona_moe_rag
2cd dynamic_persona_moe_rag
3python3 -m venv venv

The .gitignore file follows Python best practices, excluding virtual environments, cache files, and build artifacts:

gitignore
1# Byte-compiled / optimized / DLL files
2__pycache__/
3*.py[cod]
4*$py.class
5
6# Environments
7.env
8.venv
9env/
10venv/
11ENV/

1.3 Core Architecture Overview

The system follows a modular architecture with clear separation of concerns:

text
1src/
2├── core/ # Main orchestration and interfaces
3├── graph/ # Dynamic knowledge graph implementation
4├── personas/ # Persona lifecycle and storage
5├── agents/ # Specialized AI agents
6├── evaluation/ # Scoring and metrics
7└── storage/ # Persistence and snapshots
8
9configs/ # YAML configuration files
10scripts/ # Pipeline execution
11data/ # Input/output data

Part 2: Configuration and Data Structures

2.1 Configuration System

The system uses YAML for configuration, providing human-readable, type-safe settings:

system.yaml - Global parameters:

yaml
1# Global system parameters
2max_iterations: # Maximum number of iterations for the pipeline
3batch_size: # Batch size for processing
4log_level: # Logging level (DEBUG, INFO, etc.)
5enable_caching: # Whether to enable caching

thresholds.yaml - Pruning logic:

yaml
1# Pruning and promotion thresholds
2pruning_threshold: # Threshold for pruning personas
3promotion_threshold: # Threshold for promoting personas
4demotion_threshold: # Threshold for demoting personas
5activation_threshold: # Threshold for activating personas

ollama.yaml - Model settings:

yaml
1# Local model configuration
2model_name: # Name of the Ollama model to use
3temperature: # Temperature for generation
4max_tokens: # Maximum tokens to generate
5api_endpoint: # Ollama API endpoint (usually localhost)

2.2 Persona Schema Definition

Personas are defined by a strict JSON schema ensuring consistency:

json
1{
2 "$schema": "https://json-schema.org/draft/2020-12/schema",
3 "type": "object",
4 "properties": {
5 "persona_id": {
6 "type": "string",
7 "description": "Unique identifier for the persona"
8 },
9 "traits": {
10 "type": "object",
11 "patternProperties": {
12 "^.*$": {
13 "type": "integer",
14 "minimum": 1,
15 "maximum": 9,
16 "description": "Trait value between 1 and 9"
17 }
18 },
19 "description": "Object containing trait names as keys and numeric values 1-9 as values"
20 },
21 "expertise": {
22 "type": "array",
23 "items": {
24 "type": "string"
25 },
26 "description": "Array of strings representing areas of expertise"
27 },
28 "activation_cost": {
29 "type": "number",
30 "description": "Float representing the cost to activate this persona"
31 },
32 "historical_performance": {
33 "type": "object",
34 "description": "Object containing historical performance metrics"
35 },
36 "metadata": {
37 "type": "object",
38 "description": "Object containing additional metadata"
39 }
40 },
41 "required": ["persona_id", "traits", "expertise", "activation_cost", "historical_performance", "metadata"]
42}

Part 3: Core Components Deep Dive

3.1 Dynamic Knowledge Graph

The graph system is designed for query-scoped efficiency:

Graph Class:

python
1class DynamicKnowledgeGraph:
2 """
3 A dynamic graph that constructs nodes and edges on-demand for a single query.
4 """
5
6 def __init__(self):
7 self.nodes = {}
8 self.edges = []
9
10 def add_node(self, node_id, node_data):
11 """Lazily construct a node when needed."""
12 pass
13
14 def add_edge(self, source_id, target_id, edge_data):
15 """Create an edge on-demand between nodes."""
16 pass

Node and Edge Classes:

python
1class Node:
2 """Represents a node in the dynamic knowledge graph."""
3 def __init__(self, node_id, data=None):
4 self.node_id = node_id
5 self.data = data or {}
6 self.edges = []
7
8class Edge:
9 """Represents an edge in the dynamic knowledge graph."""
10 def __init__(self, source_node, target_node, data=None):
11 self.source = source_node
12 self.target = target_node
13 self.data = data or {}

3.2 Persona Traversal Interface

The traversal system uses abstract interfaces for flexibility:

python
1from abc import ABC, abstractmethod
2
3class PersonaTraversalInterface(ABC):
4 """
5 Abstract base class defining the interface for persona traversal.
6 """
7
8 @abstractmethod
9 def evaluate_node_relevance(self, persona, node):
10 """
11 Evaluate how relevant a graph node is to a given persona.
12 Returns: float (relevance score between 0 and 1)
13 """
14 pass
15
16 @abstractmethod
17 def decide_traversal(self, current_node, available_nodes, persona):
18 """
19 Decide which nodes to traverse to next based on persona evaluation.
20 Returns: list (nodes to traverse to next)
21 """
22 pass

3.3 Mixture-of-Experts Orchestrator

The orchestrator manages the entire MoE cycle:

python
1class MoeOrchestrator:
2 """
3 Orchestrates the mixture-of-experts RAG system.
4 """
5
6 def expansion_phase(self):
7 """Expansion phase: Generate diverse outputs from active personas."""
8 pass
9
10 def evaluation_phase(self):
11 """Evaluation phase: Score and rank the generated outputs."""
12 pass
13
14 def pruning_phase(self):
15 """Pruning phase: Remove underperforming personas and promote high performers."""
16 pass

Part 4: Evaluation and Adaptation

4.1 Scoring Framework

Multiple scoring criteria ensure comprehensive evaluation:

python
1def score_relevance(output, query):
2 """Score the relevance of an output to the input query."""
3 return 0.0
4
5def score_consistency(output, reference_outputs):
6 """Score the consistency of an output with reference outputs."""
7 return 0.0
8
9def score_novelty(output, existing_outputs):
10 """Score the novelty of an output compared to existing outputs."""
11 return 0.0
12
13def score_entity_grounding(output, entities):
14 """Score how well the output is grounded in the provided entities."""
15 return 0.0

4.2 Metrics and Aggregation

python
1def calculate_average_score(scores):
2 """Calculate the average of a list of scores."""
3 return 0.0
4
5def calculate_weighted_score(scores, weights):
6 """Calculate a weighted average of scores."""
7 return 0.0
8
9def aggregate_persona_performance(persona_scores):
10 """Aggregate performance metrics for a persona across multiple evaluations."""
11 return {}

4.3 Persona Lifecycle Management

Personas evolve through performance-based transitions:

  • Active: Currently participating in inference
  • Stable: Proven performers, quick to activate
  • Experimental: Newly created or modified, being tested
  • Pruned: Underperforming, archived in tiered folders
python
1def evaluate_pruning_thresholds(persona_performance, thresholds):
2 """
3 Threshold-based demotion: Personas below certain performance metrics
4 are demoted from active to stable, stable to experimental, experimental to pruned.
5 """
6 return 'keep'
7
8def move_persona_to_folder(persona_id, current_folder, target_folder):
9 """
10 Folder-based archival: Move personas to appropriate archival folders
11 instead of deleting them.
12 """
13 pass

Part 5: Integration and Execution

5.1 Ollama Integration

Local LLM inference with persona context:

python
1def synthesize_persona_context(persona_outputs, graph_context):
2 """Synthesize context from multiple persona outputs and graph traversal."""
3 return ""
4
5def send_prompt_to_ollama(synthesized_context, query, ollama_client):
6 """Send the final prompt to the local Ollama model."""
7 return ""

5.2 Storage and Persistence

Robust persistence for personas and graphs:

python
1def load_persona_from_file(filepath):
2 """Load a persona JSON file from disk."""
3 return {}
4
5def save_persona_to_file(persona_data, filepath):
6 """Save persona data to a JSON file."""
7 pass
8
9def save_graph_snapshot(graph, query_id, timestamp):
10 """Save a snapshot of the current graph state."""
11 pass

5.3 Pipeline Execution

The main pipeline orchestrates all components:

python
1def main():
2 # 1. Input ingestion
3 input_query = ""
4
5 # 2. Entity construction
6 entities = {}
7
8 # 3. Graph creation
9 graph = None
10
11 # 4. Persona traversal loop
12 traversal_outputs = []
13
14 # 5. Scoring and pruning
15 scores = []
16
17 # 6. Final Ollama inference
18 final_response = ""

Part 6: Design Philosophy and Future Directions

6.1 Key Design Decisions

  1. Query-Scoped Graphs: Graphs are built fresh for each query, ensuring relevance and preventing state pollution.

  2. Persona Evolution: Personas accumulate metadata over time, enabling performance-based adaptation.

  3. Threshold-Based Pruning: Mathematical thresholds provide deterministic, auditable persona management.

  4. Local Inference: Ollama integration ensures privacy and reduces API dependencies.

  5. Modular Architecture: Clear separation of concerns enables independent development and testing.

6.2 Implementation Roadmap

Phase 1: Core Infrastructure

  • Complete basic graph operations
  • Implement persona loading/saving
  • Basic Ollama integration

Phase 2: Intelligence Layer

  • Develop relevance evaluation algorithms
  • Implement traversal heuristics
  • Add sophisticated scoring metrics

Phase 3: Learning and Adaptation

  • Performance-based persona evolution
  • Dynamic threshold adjustment
  • Graph optimization techniques

Phase 4: Production Readiness

  • Comprehensive error handling
  • Performance optimization
  • Monitoring and logging
  • API interfaces

6.3 Potential Extensions

  • Multi-Modal Personas: Support for different input/output modalities
  • Federated Learning: Distributed persona training across multiple systems
  • Hierarchical Graphs: Multi-level graph representations for complex domains
  • Real-Time Adaptation: Continuous learning during inference cycles

Conclusion

This dynamic persona MoE RAG system represents a sophisticated approach to AI orchestration, combining the strengths of specialized agents, flexible knowledge representation, and adaptive learning. By scaffolding the architecture through systematic, incremental development, we've created a foundation that can evolve into a powerful, context-aware AI system.

The modular design ensures that each component can be developed, tested, and improved independently while maintaining clear interfaces for integration. The emphasis on performance tracking, threshold-based adaptation, and local inference provides a robust framework for building reliable, adaptive AI applications.

As we move forward, the challenge will be balancing the complexity of multiple interacting components with the need for reliable, interpretable behavior. The systematic approach demonstrated here provides a blueprint for tackling these challenges in complex AI system development.

Code for this guide can be found on my github here

Sovereign AI book cover

Sovereign AI: Building Local-First Intelligent Systems

by Daniel Kliewer · Paperback · 72 pages

The hands-on guide to building AI that runs on your hardware, keeps your data private, and eliminates cloud dependence. Working code included.