Daniel Kliewer

Building a Dynamic Persona-Based Mixture-of-Experts RAG System

by Daniel Kliewer8 min read
AIMachine LearningRAGMixture-of-ExpertsKnowledge GraphsOllamaPython
Building a Dynamic Persona-Based Mixture-of-Experts RAG System

Code for this guide can be found on my github here

Building a Dynamic Persona-Based Mixture-of-Experts RAG System

Introduction

Welcome to this comprehensive guide on building a dynamic, graph-based Mixture-of-Experts (MoE) Retrieval-Augmented Generation (RAG) system that leverages persona-driven AI agents. This project represents a cutting-edge approach to AI orchestration, combining multiple AI "personas" that dynamically traverse knowledge graphs to provide contextually rich, diverse responses.

In this post, we'll walk through the complete construction of this system, from initial project setup to the final scaffolded architecture. We'll explore each component, understand the design decisions, and learn how the pieces fit together to create an intelligent, adaptive AI system.

Part 1: Project Foundations and Architecture

1.1 The Vision: Dynamic Persona MoE RAG

At its core, this system implements a Mixture-of-Experts RAG where:

  • Personas are specialized AI agents with unique traits, expertise, and behavioral patterns
  • Dynamic Graphs represent knowledge in a flexible, query-scoped structure
  • Traversal Logic allows personas to navigate graphs based on their individual perspectives
  • Ollama Integration provides local LLM inference with synthesized persona context

The key innovation is the dynamic nature: graphs are built on-demand for each query, personas evolve through performance feedback, and the system adapts through pruning and promotion cycles.

1.2 Project Initialization

We begin by creating a robust Python project structure:

Bash
mkdir dynamic_persona_moe_rag
cd dynamic_persona_moe_rag
python3 -m venv venv

The .gitignore file follows Python best practices, excluding virtual environments, cache files, and build artifacts:

gitignore
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# Environments
.env
.venv
env/
venv/
ENV/

1.3 Core Architecture Overview

The system follows a modular architecture with clear separation of concerns:

src/
├── core/           # Main orchestration and interfaces
├── graph/          # Dynamic knowledge graph implementation
├── personas/       # Persona lifecycle and storage
├── agents/         # Specialized AI agents
├── evaluation/     # Scoring and metrics
└── storage/        # Persistence and snapshots

configs/            # YAML configuration files
scripts/            # Pipeline execution
data/               # Input/output data

Part 2: Configuration and Data Structures

2.1 Configuration System

The system uses YAML for configuration, providing human-readable, type-safe settings:

system.yaml - Global parameters:

YAML
# Global system parameters
max_iterations:  # Maximum number of iterations for the pipeline
batch_size:  # Batch size for processing
log_level:  # Logging level (DEBUG, INFO, etc.)
enable_caching:  # Whether to enable caching

thresholds.yaml - Pruning logic:

YAML
# Pruning and promotion thresholds
pruning_threshold:  # Threshold for pruning personas
promotion_threshold:  # Threshold for promoting personas
demotion_threshold:  # Threshold for demoting personas
activation_threshold:  # Threshold for activating personas

ollama.yaml - Model settings:

YAML
# Local model configuration
model_name:  # Name of the Ollama model to use
temperature:  # Temperature for generation
max_tokens:  # Maximum tokens to generate
api_endpoint:  # Ollama API endpoint (usually localhost)

2.2 Persona Schema Definition

Personas are defined by a strict JSON schema ensuring consistency:

JSON
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "properties": {
    "persona_id": {
      "type": "string",
      "description": "Unique identifier for the persona"
    },
    "traits": {
      "type": "object",
      "patternProperties": {
        "^.*$": {
          "type": "integer",
          "minimum": 1,
          "maximum": 9,
          "description": "Trait value between 1 and 9"
        }
      },
      "description": "Object containing trait names as keys and numeric values 1-9 as values"
    },
    "expertise": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "description": "Array of strings representing areas of expertise"
    },
    "activation_cost": {
      "type": "number",
      "description": "Float representing the cost to activate this persona"
    },
    "historical_performance": {
      "type": "object",
      "description": "Object containing historical performance metrics"
    },
    "metadata": {
      "type": "object",
      "description": "Object containing additional metadata"
    }
  },
  "required": ["persona_id", "traits", "expertise", "activation_cost", "historical_performance", "metadata"]
}

Part 3: Core Components Deep Dive

3.1 Dynamic Knowledge Graph

The graph system is designed for query-scoped efficiency:

Graph Class:

Python
class DynamicKnowledgeGraph:
    """
    A dynamic graph that constructs nodes and edges on-demand for a single query.
    """

    def __init__(self):
        self.nodes = {}
        self.edges = []

    def add_node(self, node_id, node_data):
        """Lazily construct a node when needed."""
        pass

    def add_edge(self, source_id, target_id, edge_data):
        """Create an edge on-demand between nodes."""
        pass

Node and Edge Classes:

Python
class Node:
    """Represents a node in the dynamic knowledge graph."""
    def __init__(self, node_id, data=None):
        self.node_id = node_id
        self.data = data or {}
        self.edges = []

class Edge:
    """Represents an edge in the dynamic knowledge graph."""
    def __init__(self, source_node, target_node, data=None):
        self.source = source_node
        self.target = target_node
        self.data = data or {}

3.2 Persona Traversal Interface

The traversal system uses abstract interfaces for flexibility:

Python
from abc import ABC, abstractmethod

class PersonaTraversalInterface(ABC):
    """
    Abstract base class defining the interface for persona traversal.
    """

    @abstractmethod
    def evaluate_node_relevance(self, persona, node):
        """
        Evaluate how relevant a graph node is to a given persona.
        Returns: float (relevance score between 0 and 1)
        """
        pass

    @abstractmethod
    def decide_traversal(self, current_node, available_nodes, persona):
        """
        Decide which nodes to traverse to next based on persona evaluation.
        Returns: list (nodes to traverse to next)
        """
        pass

3.3 Mixture-of-Experts Orchestrator

The orchestrator manages the entire MoE cycle:

Python
class MoeOrchestrator:
    """
    Orchestrates the mixture-of-experts RAG system.
    """

    def expansion_phase(self):
        """Expansion phase: Generate diverse outputs from active personas."""
        pass

    def evaluation_phase(self):
        """Evaluation phase: Score and rank the generated outputs."""
        pass

    def pruning_phase(self):
        """Pruning phase: Remove underperforming personas and promote high performers."""
        pass

Part 4: Evaluation and Adaptation

4.1 Scoring Framework

Multiple scoring criteria ensure comprehensive evaluation:

Python
def score_relevance(output, query):
    """Score the relevance of an output to the input query."""
    return 0.0

def score_consistency(output, reference_outputs):
    """Score the consistency of an output with reference outputs."""
    return 0.0

def score_novelty(output, existing_outputs):
    """Score the novelty of an output compared to existing outputs."""
    return 0.0

def score_entity_grounding(output, entities):
    """Score how well the output is grounded in the provided entities."""
    return 0.0

4.2 Metrics and Aggregation

Python
def calculate_average_score(scores):
    """Calculate the average of a list of scores."""
    return 0.0

def calculate_weighted_score(scores, weights):
    """Calculate a weighted average of scores."""
    return 0.0

def aggregate_persona_performance(persona_scores):
    """Aggregate performance metrics for a persona across multiple evaluations."""
    return {}

4.3 Persona Lifecycle Management

Personas evolve through performance-based transitions:

  • Active: Currently participating in inference
  • Stable: Proven performers, quick to activate
  • Experimental: Newly created or modified, being tested
  • Pruned: Underperforming, archived in tiered folders
Python
def evaluate_pruning_thresholds(persona_performance, thresholds):
    """
    Threshold-based demotion: Personas below certain performance metrics
    are demoted from active to stable, stable to experimental, experimental to pruned.
    """
    return 'keep'

def move_persona_to_folder(persona_id, current_folder, target_folder):
    """
    Folder-based archival: Move personas to appropriate archival folders
    instead of deleting them.
    """
    pass

Part 5: Integration and Execution

5.1 Ollama Integration

Local LLM inference with persona context:

Python
def synthesize_persona_context(persona_outputs, graph_context):
    """Synthesize context from multiple persona outputs and graph traversal."""
    return ""

def send_prompt_to_ollama(synthesized_context, query, ollama_client):
    """Send the final prompt to the local Ollama model."""
    return ""

5.2 Storage and Persistence

Robust persistence for personas and graphs:

Python
def load_persona_from_file(filepath):
    """Load a persona JSON file from disk."""
    return {}

def save_persona_to_file(persona_data, filepath):
    """Save persona data to a JSON file."""
    pass

def save_graph_snapshot(graph, query_id, timestamp):
    """Save a snapshot of the current graph state."""
    pass

5.3 Pipeline Execution

The main pipeline orchestrates all components:

Python
def main():
    # 1. Input ingestion
    input_query = ""

    # 2. Entity construction
    entities = {}

    # 3. Graph creation
    graph = None

    # 4. Persona traversal loop
    traversal_outputs = []

    # 5. Scoring and pruning
    scores = []

    # 6. Final Ollama inference
    final_response = ""

Part 6: Design Philosophy and Future Directions

6.1 Key Design Decisions

  1. Query-Scoped Graphs: Graphs are built fresh for each query, ensuring relevance and preventing state pollution.

  2. Persona Evolution: Personas accumulate metadata over time, enabling performance-based adaptation.

  3. Threshold-Based Pruning: Mathematical thresholds provide deterministic, auditable persona management.

  4. Local Inference: Ollama integration ensures privacy and reduces API dependencies.

  5. Modular Architecture: Clear separation of concerns enables independent development and testing.

6.2 Implementation Roadmap

Phase 1: Core Infrastructure

  • Complete basic graph operations
  • Implement persona loading/saving
  • Basic Ollama integration

Phase 2: Intelligence Layer

  • Develop relevance evaluation algorithms
  • Implement traversal heuristics
  • Add sophisticated scoring metrics

Phase 3: Learning and Adaptation

  • Performance-based persona evolution
  • Dynamic threshold adjustment
  • Graph optimization techniques

Phase 4: Production Readiness

  • Comprehensive error handling
  • Performance optimization
  • Monitoring and logging
  • API interfaces

6.3 Potential Extensions

  • Multi-Modal Personas: Support for different input/output modalities
  • Federated Learning: Distributed persona training across multiple systems
  • Hierarchical Graphs: Multi-level graph representations for complex domains
  • Real-Time Adaptation: Continuous learning during inference cycles

Conclusion

This dynamic persona MoE RAG system represents a sophisticated approach to AI orchestration, combining the strengths of specialized agents, flexible knowledge representation, and adaptive learning. By scaffolding the architecture through systematic, incremental development, we've created a foundation that can evolve into a powerful, context-aware AI system.

The modular design ensures that each component can be developed, tested, and improved independently while maintaining clear interfaces for integration. The emphasis on performance tracking, threshold-based adaptation, and local inference provides a robust framework for building reliable, adaptive AI applications.

As we move forward, the challenge will be balancing the complexity of multiple interacting components with the need for reliable, interpretable behavior. The systematic approach demonstrated here provides a blueprint for tackling these challenges in complex AI system development.

Code for this guide can be found on my github here

Related Articles

Loading related articles...