·253 min

OpenAI Agents SDK & Ollama Integration: Complete Architecture Guide

This comprehensive guide demonstrates how to integrate the official OpenAI Agents SDK with Ollama to create AI agents that run entirely on local infrastructure. By the end, you'll understand both the theoretical foundations and practical implementation of locally-hosted AI agents.

DK

Daniel Kliewer

Author, Sovereign AI

Sovereign AI book cover

From the Book

This is from Sovereign AI: Building Local-First Intelligent Systems.

Get the Book — $88
OpenAI Agents SDK & Ollama Integration: Complete Architecture Guide

Image

Architectural Synthesis: Integrating OpenAI's Agents SDK with Ollama

A Convergence of Contemporary AI Paradigms

In the evolving landscape of artificial intelligence systems, the architectural integration of OpenAI's Agents SDK with Ollama represents a sophisticated approach to creating hybrid, responsive computational entities. This synthesis enables a dialectical interaction between cloud-based intelligence and local computational resources, creating what might be conceptualized as a Modern Computational Paradigm (MCP) system.

Theoretical Framework and Architectural Considerations

The foundational architecture of this integration leverages the strengths of both paradigms: OpenAI's Agents SDK provides a structured framework for creating autonomous agents capable of orchestrating complex, multi-step reasoning processes, while Ollama offers localized execution of large language models with reduced latency and enhanced privacy guarantees.

At its epistemological core, this architecture addresses the fundamental tension between computational capability and data sovereignty. The implementation creates a fluid boundary between local and remote processing, determined by contextual parameters including:

  • Computational complexity thresholds
  • Privacy requirements of specific data domains
  • Latency tolerance for particular interaction modalities
  • Economic considerations regarding API utilization

Functional Capabilities and Implementation Vectors

This architectural synthesis manifests several advanced capabilities:

  1. Cognitive Load Distribution: The system intelligently routes cognitive tasks between local and remote execution environments based on complexity, resource requirements, and privacy constraints.

  2. Tool Integration Framework: Both OpenAI's agents and Ollama instances can leverage a unified tool ecosystem, allowing for consistent interaction patterns with external systems.

  3. Conversational State Management: A sophisticated state management system maintains coherent interaction context across the distributed computational environment.

  4. Fallback Mechanisms: The architecture implements graceful degradation pathways, ensuring functionality persistence when either component faces constraints.

Implementation Methodology

The GitHub repository (kliewerdaniel/OpenAIAgentsSDKOllama01) provides the foundational code structure for this integration. The implementation follows a modular approach that encapsulates:

  • Abstraction layers for model interactions
  • Contextual routing logic
  • Unified response formatting
  • Configurable threshold parameters for decision boundaries

Theoretical Implications and Future Directions

This architectural approach represents a significant advancement in distributed AI systems theory. By creating a harmonious integration of cloud and edge AI capabilities, it establishes a framework for future systems that may further blur the boundaries between computational environments.

The integration opens avenues for research in several domains:

  • Optimal decision boundaries for computational routing
  • Privacy-preserving techniques for sensitive information processing
  • Economic models for hybrid AI systems
  • Cognitive load balancing algorithms

Conclusion

The integration of OpenAI's Agents SDK with Ollama represents not merely a technical implementation but a philosophical statement about the future of AI architectures. It suggests a path toward systems that transcend binary distinctions between local and remote, private and shared, efficient and powerful—instead creating a nuanced computational environment that adapts to the specific needs of each interaction context.

This approach invites further exploration and refinement, as the field continues to evolve toward increasingly sophisticated hybrid AI architectures that balance capability, privacy, efficiency, and cost.

Technical Infrastructure: Establishing the Development Environment for OpenAI-Ollama Integration

Foundational Dependencies and Technological Requisites

The implementation of a sophisticated hybrid AI architecture integrating OpenAI's Agents SDK with Ollama necessitates a carefully curated technological stack. This infrastructure must accommodate both cloud-based intelligence and local inference capabilities within a coherent framework.

Core Dependencies

Python Environment

Python 3.10+ (3.11 recommended for optimal performance characteristics)

Essential Python Packages

text
1openai>=1.12.0 # Provides Agents SDK capabilities
2ollama>=0.1.6 # Python client for Ollama interaction
3fastapi>=0.109.0 # API framework for service endpoints
4uvicorn>=0.27.0 # ASGI server implementation
5pydantic>=2.5.0 # Data validation and settings management
6python-dotenv>=1.0.0 # Environment variable management
7requests>=2.31.0 # HTTP requests for external service interaction
8websockets>=12.0 # WebSocket support for real-time communication
9tenacity>=8.2.3 # Retry logic for resilient API interactions

External Services

text
1OpenAI API access (API key required)
2Ollama (local installation)

Environment Configuration

Installation Procedure

  1. Python Environment Initialization

    bash
    1# Create isolated environment
    2python -m venv venv
    3
    4# Activate environment
    5# On Unix/macOS:
    6source venv/bin/activate
    7# On Windows:
    8venv\Scripts\activate
  2. Dependency Installation

    bash
    1pip install openai ollama fastapi uvicorn pydantic python-dotenv requests websockets tenacity
  3. Ollama Installation

    bash
    1# macOS (using Homebrew)
    2brew install ollama
    3
    4# Linux (using curl)
    5curl -fsSL https://ollama.com/install.sh | sh
    6
    7# Windows
    8# Download from https://ollama.com/download/windows
  4. Model Initialization for Ollama

    bash
    1# Pull high-performance local model (e.g., Llama2)
    2ollama pull llama2
    3
    4# Optional: Pull additional specialized models
    5ollama pull mistral
    6ollama pull codellama

Environment Configuration

Create a .env file in the project root with the following parameters:

text
1# OpenAI Configuration
2OPENAI_API_KEY=sk-...
3OPENAI_ORG_ID=org-... # Optional
4
5# Model Configuration
6OPENAI_MODEL=gpt-4o
7OLLAMA_MODEL=llama2
8OLLAMA_HOST=http://localhost:11434
9
10# System Behavior
11TEMPERATURE=0.7
12MAX_TOKENS=4096
13REQUEST_TIMEOUT=120
14
15# Routing Configuration
16COMPLEXITY_THRESHOLD=0.65
17PRIVACY_SENSITIVE_TOKENS=["password", "secret", "token", "key", "credential"]
18
19# Logging Configuration
20LOG_LEVEL=INFO

Development Environment Setup

Repository Initialization

bash
1git clone https://github.com/kliewerdaniel/OpenAIAgentsSDKOllama01.git
2cd OpenAIAgentsSDKOllama01

Project Structure Implementation

bash
1mkdir -p app/core app/models app/routers app/services app/utils tests
2touch app/__init__.py app/core/__init__.py app/models/__init__.py app/routers/__init__.py app/services/__init__.py app/utils/__init__.py

Local Development Server

bash
1# Start Ollama service
2ollama serve
3
4# In a separate terminal, start the application
5uvicorn app.main:app --reload

Containerization (Optional)

For reproducible environments and deployment consistency:

dockerfile
1# Dockerfile
2FROM python:3.11-slim
3
4WORKDIR /app
5
6COPY requirements.txt .
7RUN pip install --no-cache-dir -r requirements.txt
8
9COPY . .
10
11CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

With Docker Compose integration for Ollama:

yaml
1# docker-compose.yml
2version: '3.8'
3
4services:
5 app:
6 build: .
7 ports:
8 - "8000:8000"
9 environment:
10 - OLLAMA_HOST=http://ollama:11434
11 depends_on:
12 - ollama
13 volumes:
14 - .:/app
15
16 ollama:
17 image: ollama/ollama:latest
18 ports:
19 - "11434:11434"
20 volumes:
21 - ollama_data:/root/.ollama
22
23volumes:
24 ollama_data:

Verification of Installation

To validate the environment configuration:

bash
1python -c "import openai; import ollama; print('OpenAI SDK Version:', openai.__version__); print('Ollama Client Version:', ollama.__version__)"

To test Ollama connectivity:

bash
1python -c "import ollama; print(ollama.list())"

To test OpenAI API connectivity:

bash
1python -c "import openai; import os; from dotenv import load_dotenv; load_dotenv(); client = openai.OpenAI(); print(client.models.list())"

This comprehensive environment setup establishes the foundation for a sophisticated hybrid AI system that leverages both cloud-based intelligence and local inference capabilities. The configuration allows for flexible routing of requests based on privacy considerations, computational complexity, and performance requirements.

Integration Architecture: OpenAI Responses API within the MCP Framework

Theoretical Framework for API Integration

The integration of OpenAI's Responses API within our Modern Computational Paradigm (MCP) framework represents a sophisticated exercise in distributed intelligence architecture. This document delineates the structural components, interface definitions, and operational parameters for establishing a cohesive integration that leverages both cloud-based and local inference capabilities.

API Architectural Design

Core Endpoints Structure

The system exposes a carefully designed set of endpoints that abstract the underlying complexity of model routing and response generation:

text
1/api/v1
2├── /chat
3│ ├── POST /completions # Primary conversational interface
4│ ├── POST /streaming # Event-stream response generation
5│ └── POST /hybrid # Intelligent routing between OpenAI and Ollama
6├── /tools
7│ ├── POST /execute # Tool execution framework
8│ └── GET /available # Tool discovery mechanism
9├── /agents
10│ ├── POST /run # Agent execution with Agents SDK
11│ ├── GET /status/{run_id} # Asynchronous execution status
12│ └── POST /cancel/{run_id} # Execution termination
13└── /system
14 ├── GET /health # Service health verification
15 ├── GET /models # Available model enumeration
16 └── POST /config # Runtime configuration adjustment

Request/Response Schemata

Primary Chat Interface

json
1// POST /api/v1/chat/completions
2// Request
3{
4 "messages": [
5 {"role": "system", "content": "You are a helpful assistant."},
6 {"role": "user", "content": "Explain quantum computing."}
7 ],
8 "model": "auto", // "auto", "openai:<model_id>", or "ollama:<model_id>"
9 "temperature": 0.7,
10 "max_tokens": 1024,
11 "stream": false,
12 "routing_preferences": {
13 "force_provider": null, // null, "openai", "ollama"
14 "privacy_level": "standard", // "standard", "high", "max"
15 "latency_preference": "balanced" // "speed", "balanced", "quality"
16 },
17 "tools": [...] // Optional tool definitions
18}
19
20// Response
21{
22 "id": "resp_abc123",
23 "object": "chat.completion",
24 "created": 1677858242,
25 "provider": "openai", // The actual provider used
26 "model": "gpt-4o",
27 "usage": {
28 "prompt_tokens": 56,
29 "completion_tokens": 325,
30 "total_tokens": 381
31 },
32 "message": {
33 "role": "assistant",
34 "content": "Quantum computing is...",
35 "tool_calls": [] // Optional tool calls if requested
36 },
37 "routing_metrics": {
38 "complexity_score": 0.78,
39 "privacy_impact": "low",
40 "decision_factors": ["complexity", "tool_requirements"]
41 }
42}

Agent Execution Interface

json
1// POST /api/v1/agents/run
2// Request
3{
4 "agent_config": {
5 "instructions": "You are a research assistant. Help the user find information about recent AI developments.",
6 "model": "gpt-4o",
7 "tools": [
8 // Tool definitions following OpenAI's format
9 ]
10 },
11 "messages": [
12 {"role": "user", "content": "Find recent papers on transformer efficiency."}
13 ],
14 "metadata": {
15 "session_id": "user_session_abc123",
16 "locale": "en-US"
17 }
18}
19
20// Response
21{
22 "run_id": "run_def456",
23 "status": "in_progress",
24 "created_at": 1677858242,
25 "estimated_completion_time": 1677858260,
26 "polling_url": "/api/v1/agents/status/run_def456"
27}

Authentication & Security Framework

Authentication Mechanisms

The system implements a layered authentication approach:

  1. API Key Authentication

    Authorization: Bearer {api_key}
    
  2. OpenAI Credential Management

    • Server-side credential storage with encryption at rest
    • Optional client-provided credentials per request
    json
    1// Optional credential override
    2{
    3 "auth_override": {
    4 "openai_api_key": "sk_...",
    5 "openai_org_id": "org-..."
    6 }
    7}
  3. Session-Based Authentication (Web Interface)

    • JWT-based authentication with refresh token rotation
    • PKCE flow for authorization code exchanges

Security Considerations

  • TLS 1.3 required for all communications
  • Request signing for high-security deployments
  • Content-Security-Policy headers to prevent XSS
  • Rate limiting by user/IP with exponential backoff

Error Handling Architecture

The system implements a comprehensive error handling framework:

json
1// Error Response Structure
2{
3 "error": {
4 "code": "provider_error",
5 "message": "OpenAI API returned an error",
6 "details": {
7 "provider": "openai",
8 "status_code": 429,
9 "original_message": "Rate limit exceeded",
10 "request_id": "req_ghi789"
11 },
12 "remediation": {
13 "retry_after": 30,
14 "alternatives": ["switch_provider", "reduce_complexity"],
15 "fallback_available": true
16 }
17 }
18}

Error Categories

  1. Provider Errors (provider_error)

    • OpenAI API failures
    • Ollama execution failures
    • Network connectivity issues
  2. Input Validation Errors (validation_error)

    • Schema validation failures
    • Content policy violations
    • Size limit exceedances
  3. System Errors (system_error)

    • Resource exhaustion
    • Internal component failures
    • Dependency service outages
  4. Authentication Errors (auth_error)

    • Invalid credentials
    • Expired tokens
    • Insufficient permissions

Rate Limiting Architecture

The system implements a sophisticated rate limiting structure:

Tiered Rate Limiting

text
1Standard tier:
2 - 10 requests/minute
3 - 100 requests/hour
4 - 1000 requests/day
5
6Premium tier:
7 - 60 requests/minute
8 - 1000 requests/hour
9 - 10000 requests/day

Dynamic Rate Adjustment

  • Token bucket algorithm with dynamic refill rates
  • Separate buckets for different endpoint categories
  • Priority-based token distribution

Rate Limit Response

json
1{
2 "error": {
3 "code": "rate_limit_exceeded",
4 "message": "You have exceeded the rate limit",
5 "details": {
6 "rate_limit": {
7 "tier": "standard",
8 "limit": "10 per minute",
9 "remaining": 0,
10 "reset_at": "2023-03-01T12:35:00Z",
11 "retry_after": 25
12 },
13 "usage_statistics": {
14 "current_minute": 11,
15 "current_hour": 43,
16 "current_day": 178
17 }
18 },
19 "remediation": {
20 "upgrade_url": "/account/upgrade",
21 "alternatives": ["reduce_frequency", "batch_requests"]
22 }
23 }
24}

Implementation Strategy

Provider Abstraction Layer

python
1# Pseudocode for the Provider Abstraction Layer
2class ModelProvider(ABC):
3 @abstractmethod
4 async def generate_completion(self, messages, params):
5 pass
6
7 @abstractmethod
8 async def stream_completion(self, messages, params):
9 pass
10
11 @classmethod
12 def get_provider(cls, provider_name, model_id):
13 if provider_name == "openai":
14 return OpenAIProvider(model_id)
15 elif provider_name == "ollama":
16 return OllamaProvider(model_id)
17 else:
18 return AutoRoutingProvider()

Intelligent Routing Decision Engine

python
1# Pseudocode for Routing Logic
2class RoutingEngine:
3 def __init__(self, config):
4 self.config = config
5
6 async def determine_route(self, request):
7 # Analyze request complexity
8 complexity = self._analyze_complexity(request.messages)
9
10 # Check for privacy constraints
11 privacy_impact = self._assess_privacy_impact(request.messages)
12
13 # Consider tool requirements
14 tools_compatible = self._check_tool_compatibility(
15 request.tools, available_providers)
16
17 # Make routing decision
18 if request.routing_preferences.force_provider:
19 return request.routing_preferences.force_provider
20
21 if privacy_impact == "high" and self.config.privacy_first:
22 return "ollama"
23
24 if complexity > self.config.complexity_threshold:
25 return "openai"
26
27 # Default routing logic
28 return "ollama" if self.config.prefer_local else "openai"

Authentication Implementation

python
1# Middleware for API Key Authentication
2async def api_key_middleware(request, call_next):
3 api_key = request.headers.get("Authorization")
4
5 if not api_key or not api_key.startswith("Bearer "):
6 return JSONResponse(
7 status_code=401,
8 content={"error": {
9 "code": "auth_error",
10 "message": "Missing or invalid API key"
11 }}
12 )
13
14 # Extract and validate token
15 token = api_key.replace("Bearer ", "")
16 user = await validate_api_key(token)
17
18 if not user:
19 return JSONResponse(
20 status_code=401,
21 content={"error": {
22 "code": "auth_error",
23 "message": "Invalid API key"
24 }}
25 )
26
27 # Attach user to request state
28 request.state.user = user
29 return await call_next(request)

Rate Limiting Implementation

python
1# Rate Limiter Implementation
2class RateLimiter:
3 def __init__(self, redis_client):
4 self.redis = redis_client
5
6 async def check_rate_limit(self, user_id, endpoint_category):
7 # Generate Redis keys for different time windows
8 minute_key = f"rate:user:{user_id}:{endpoint_category}:minute"
9 hour_key = f"rate:user:{user_id}:{endpoint_category}:hour"
10
11 # Get user tier and corresponding limits
12 user_tier = await self._get_user_tier(user_id)
13 tier_limits = TIER_LIMITS[user_tier]
14
15 # Check limits for each window
16 pipe = self.redis.pipeline()
17 pipe.incr(minute_key)
18 pipe.expire(minute_key, 60)
19 pipe.incr(hour_key)
20 pipe.expire(hour_key, 3600)
21 results = await pipe.execute()
22
23 minute_count, _, hour_count, _ = results
24
25 # Check if limits are exceeded
26 if minute_count > tier_limits["per_minute"]:
27 return {
28 "allowed": False,
29 "window": "minute",
30 "limit": tier_limits["per_minute"],
31 "current": minute_count,
32 "retry_after": self._calculate_retry_after(minute_key)
33 }
34
35 if hour_count > tier_limits["per_hour"]:
36 return {
37 "allowed": False,
38 "window": "hour",
39 "limit": tier_limits["per_hour"],
40 "current": hour_count,
41 "retry_after": self._calculate_retry_after(hour_key)
42 }
43
44 return {"allowed": True}
45
46 async def _calculate_retry_after(self, key):
47 ttl = await self.redis.ttl(key)
48 return max(1, ttl)

Operational Considerations

  1. Monitoring and Observability

    • Structured logging with correlation IDs
    • Prometheus metrics for request routing decisions
    • Tracing with OpenTelemetry
  2. Fallback Mechanisms

    • Circuit breaker pattern for provider failures
    • Graceful degradation to simpler models
    • Response caching for common queries
  3. Deployment Strategy

    • Containerized deployment with Kubernetes
    • Blue/green deployment for zero-downtime updates
    • Regional deployment for latency optimization

Conclusion

This integration architecture establishes a robust framework for leveraging both OpenAI's cloud capabilities and Ollama's local inference within a unified system. The design emphasizes flexibility, security, and resilience while providing sophisticated routing logic to optimize for different operational parameters including cost, privacy, and performance.

The implementation allows for progressive enhancement as requirements evolve, with clear extension points for additional providers, tools, and routing strategies.

Autonomous Agent Architecture: Python Implementations for MCP Integration

Theoretical Framework for Agent Design

This collection of Python implementations establishes a comprehensive agent architecture leveraging the Modern Computational Paradigm (MCP) system. The design emphasizes cognitive capabilities including knowledge retrieval, conversation flow management, and contextual awareness through a modular approach to agent construction.

Core Agent Infrastructure

Base Agent Class

python
1# app/agents/base_agent.py
2from abc import ABC, abstractmethod
3from typing import Dict, List, Any, Optional
4import uuid
5import logging
6from pydantic import BaseModel, Field
7
8from app.services.provider_service import ProviderService
9from app.models.message import Message, MessageRole
10from app.models.tool import Tool
11
12logger = logging.getLogger(__name__)
13
14class AgentState(BaseModel):
15 """Represents the internal state of an agent."""
16 conversation_history: List[Message] = Field(default_factory=list)
17 memory: Dict[str, Any] = Field(default_factory=dict)
18 context: Dict[str, Any] = Field(default_factory=dict)
19 metadata: Dict[str, Any] = Field(default_factory=dict)
20 session_id: str = Field(default_factory=lambda: str(uuid.uuid4()))
21
22class BaseAgent(ABC):
23 """Abstract base class for all agents in the system."""
24
25 def __init__(
26 self,
27 provider_service: ProviderService,
28 system_prompt: str,
29 tools: Optional[List[Tool]] = None,
30 state: Optional[AgentState] = None
31 ):
32 self.provider_service = provider_service
33 self.system_prompt = system_prompt
34 self.tools = tools or []
35 self.state = state or AgentState()
36
37 # Initialize conversation with system prompt
38 self._initialize_conversation()
39
40 def _initialize_conversation(self):
41 """Initialize the conversation history with the system prompt."""
42 self.state.conversation_history.append(
43 Message(role=MessageRole.SYSTEM, content=self.system_prompt)
44 )
45
46 async def process_message(self, message: str, user_id: str) -> str:
47 """Process a user message and return a response."""
48 # Add user message to conversation history
49 user_message = Message(role=MessageRole.USER, content=message)
50 self.state.conversation_history.append(user_message)
51
52 # Process the message and generate a response
53 response = await self._generate_response(user_id)
54
55 # Add assistant response to conversation history
56 assistant_message = Message(role=MessageRole.ASSISTANT, content=response)
57 self.state.conversation_history.append(assistant_message)
58
59 return response
60
61 @abstractmethod
62 async def _generate_response(self, user_id: str) -> str:
63 """Generate a response based on the conversation history."""
64 pass
65
66 async def add_context(self, key: str, value: Any):
67 """Add contextual information to the agent's state."""
68 self.state.context[key] = value
69
70 def get_conversation_history(self) -> List[Message]:
71 """Return the conversation history."""
72 return self.state.conversation_history
73
74 def clear_conversation(self, keep_system_prompt: bool = True):
75 """Clear the conversation history."""
76 if keep_system_prompt and self.state.conversation_history:
77 system_messages = [
78 msg for msg in self.state.conversation_history
79 if msg.role == MessageRole.SYSTEM
80 ]
81 self.state.conversation_history = system_messages
82 else:
83 self.state.conversation_history = []
84 self._initialize_conversation()

Specialized Agent Implementations

Research Agent with Knowledge Retrieval

python
1# app/agents/research_agent.py
2from typing import List, Dict, Any, Optional
3import logging
4
5from app.agents.base_agent import BaseAgent
6from app.services.knowledge_service import KnowledgeService
7from app.models.message import Message, MessageRole
8from app.models.tool import Tool
9
10logger = logging.getLogger(__name__)
11
12class ResearchAgent(BaseAgent):
13 """Agent specialized for research tasks with knowledge retrieval capabilities."""
14
15 def __init__(self, *args, knowledge_service: KnowledgeService, **kwargs):
16 super().__init__(*args, **kwargs)
17 self.knowledge_service = knowledge_service
18
19 # Register knowledge retrieval tools
20 self.tools.extend([
21 Tool(
22 name="search_knowledge_base",
23 description="Search the knowledge base for relevant information",
24 parameters={
25 "type": "object",
26 "properties": {
27 "query": {
28 "type": "string",
29 "description": "The search query"
30 },
31 "max_results": {
32 "type": "integer",
33 "description": "Maximum number of results to return",
34 "default": 3
35 }
36 },
37 "required": ["query"]
38 }
39 ),
40 Tool(
41 name="retrieve_document",
42 description="Retrieve a specific document by ID",
43 parameters={
44 "type": "object",
45 "properties": {
46 "document_id": {
47 "type": "string",
48 "description": "The ID of the document to retrieve"
49 }
50 },
51 "required": ["document_id"]
52 }
53 )
54 ])
55
56 async def _generate_response(self, user_id: str) -> str:
57 """Generate a response with knowledge augmentation."""
58 # Extract the last user message
59 last_user_message = next(
60 (msg for msg in reversed(self.state.conversation_history)
61 if msg.role == MessageRole.USER),
62 None
63 )
64
65 if not last_user_message:
66 return "I don't have any messages to respond to."
67
68 # Perform knowledge retrieval to augment the response
69 relevant_information = await self._retrieve_relevant_knowledge(last_user_message.content)
70
71 # Add retrieved information to context
72 if relevant_information:
73 context_message = Message(
74 role=MessageRole.SYSTEM,
75 content=f"Relevant information: {relevant_information}"
76 )
77 augmented_history = self.state.conversation_history.copy()
78 augmented_history.insert(-1, context_message)
79 else:
80 augmented_history = self.state.conversation_history
81
82 # Generate response using the provider service
83 response = await self.provider_service.generate_completion(
84 messages=[msg.model_dump() for msg in augmented_history],
85 tools=self.tools,
86 user=user_id
87 )
88
89 # Process tool calls if any
90 if response.get("tool_calls"):
91 tool_responses = await self._process_tool_calls(response["tool_calls"])
92
93 # Add tool responses to conversation history
94 for tool_response in tool_responses:
95 self.state.conversation_history.append(
96 Message(
97 role=MessageRole.TOOL,
98 content=tool_response["content"],
99 tool_call_id=tool_response["tool_call_id"]
100 )
101 )
102
103 # Generate a new response with tool results
104 final_response = await self.provider_service.generate_completion(
105 messages=[msg.model_dump() for msg in self.state.conversation_history],
106 tools=self.tools,
107 user=user_id
108 )
109 return final_response["message"]["content"]
110
111 return response["message"]["content"]
112
113 async def _retrieve_relevant_knowledge(self, query: str) -> Optional[str]:
114 """Retrieve relevant information from knowledge base."""
115 try:
116 results = await self.knowledge_service.search(query, max_results=3)
117
118 if not results:
119 return None
120
121 # Format the results
122 formatted_results = "\n\n".join([
123 f"Source: {result['title']}\n"
124 f"Content: {result['content']}\n"
125 f"Relevance: {result['relevance_score']}"
126 for result in results
127 ])
128
129 return formatted_results
130 except Exception as e:
131 logger.error(f"Error retrieving knowledge: {str(e)}")
132 return None
133
134 async def _process_tool_calls(self, tool_calls: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
135 """Process tool calls and return tool responses."""
136 tool_responses = []
137
138 for tool_call in tool_calls:
139 tool_name = tool_call["function"]["name"]
140 tool_args = tool_call["function"]["arguments"]
141 tool_call_id = tool_call["id"]
142
143 try:
144 if tool_name == "search_knowledge_base":
145 results = await self.knowledge_service.search(
146 query=tool_args["query"],
147 max_results=tool_args.get("max_results", 3)
148 )
149 formatted_results = "\n\n".join([
150 f"Document ID: {result['id']}\n"
151 f"Title: {result['title']}\n"
152 f"Summary: {result['summary']}"
153 for result in results
154 ])
155
156 tool_responses.append({
157 "tool_call_id": tool_call_id,
158 "content": formatted_results or "No results found."
159 })
160
161 elif tool_name == "retrieve_document":
162 document = await self.knowledge_service.retrieve_document(
163 document_id=tool_args["document_id"]
164 )
165
166 if document:
167 tool_responses.append({
168 "tool_call_id": tool_call_id,
169 "content": f"Title: {document['title']}\n\n{document['content']}"
170 })
171 else:
172 tool_responses.append({
173 "tool_call_id": tool_call_id,
174 "content": "Document not found."
175 })
176 except Exception as e:
177 logger.error(f"Error processing tool call {tool_name}: {str(e)}")
178 tool_responses.append({
179 "tool_call_id": tool_call_id,
180 "content": f"Error processing tool call: {str(e)}"
181 })
182
183 return tool_responses

Conversational Flow Manager Agent

python
1# app/agents/conversation_manager.py
2from typing import Dict, List, Any, Optional
3import logging
4import json
5
6from app.agents.base_agent import BaseAgent
7from app.models.message import Message, MessageRole
8
9logger = logging.getLogger(__name__)
10
11class ConversationState(BaseModel):
12 """Tracks the state of a conversation."""
13 current_topic: Optional[str] = None
14 topic_history: List[str] = Field(default_factory=list)
15 user_preferences: Dict[str, Any] = Field(default_factory=dict)
16 conversation_stage: str = "opening" # opening, exploring, focusing, concluding
17 open_questions: List[str] = Field(default_factory=list)
18 satisfaction_score: Optional[float] = None
19
20class ConversationManager(BaseAgent):
21 """Agent specialized in managing conversation flow and context."""
22
23 def __init__(self, *args, **kwargs):
24 super().__init__(*args, **kwargs)
25 self.conversation_state = ConversationState()
26
27 # Register conversation management tools
28 self.tools.extend([
29 {
30 "type": "function",
31 "function": {
32 "name": "update_conversation_state",
33 "description": "Update the state of the conversation based on analysis",
34 "parameters": {
35 "type": "object",
36 "properties": {
37 "current_topic": {
38 "type": "string",
39 "description": "The current topic of conversation"
40 },
41 "conversation_stage": {
42 "type": "string",
43 "description": "The current stage of the conversation",
44 "enum": ["opening", "exploring", "focusing", "concluding"]
45 },
46 "detected_preferences": {
47 "type": "object",
48 "description": "Preferences detected from the user"
49 },
50 "open_questions": {
51 "type": "array",
52 "items": {"type": "string"},
53 "description": "Questions that remain unanswered"
54 },
55 "satisfaction_estimate": {
56 "type": "number",
57 "description": "Estimated user satisfaction (0-1)"
58 }
59 }
60 }
61 }
62 }
63 ])
64
65 async def _generate_response(self, user_id: str) -> str:
66 """Generate a response with conversation flow management."""
67 # First, analyze the conversation to update state
68 analysis_prompt = self._create_analysis_prompt()
69
70 analysis_messages = [
71 {"role": "system", "content": analysis_prompt},
72 {"role": "user", "content": "Analyze the following conversation and update the conversation state."},
73 {"role": "user", "content": self._format_conversation_history()}
74 ]
75
76 analysis_response = await self.provider_service.generate_completion(
77 messages=analysis_messages,
78 tools=self.tools,
79 tool_choice={"type": "function", "function": {"name": "update_conversation_state"}},
80 user=user_id
81 )
82
83 # Process conversation state update
84 if analysis_response.get("tool_calls"):
85 tool_call = analysis_response["tool_calls"][0]
86 if tool_call["function"]["name"] == "update_conversation_state":
87 try:
88 state_update = json.loads(tool_call["function"]["arguments"])
89 self._update_conversation_state(state_update)
90 except Exception as e:
91 logger.error(f"Error updating conversation state: {str(e)}")
92
93 # Now generate the actual response with enhanced context
94 enhanced_messages = self.state.conversation_history.copy()
95
96 # Add conversation state as context
97 context_message = Message(
98 role=MessageRole.SYSTEM,
99 content=self._format_conversation_context()
100 )
101 enhanced_messages.insert(-1, context_message)
102
103 response = await self.provider_service.generate_completion(
104 messages=[msg.model_dump() for msg in enhanced_messages],
105 user=user_id
106 )
107
108 return response["message"]["content"]
109
110 def _create_analysis_prompt(self) -> str:
111 """Create a prompt for conversation analysis."""
112 return """
113 You are a conversation analysis expert. Your task is to analyze the conversation
114 and extract key information about the current state of the dialogue.
115
116 Specifically, you should:
117 1. Identify the current main topic of conversation
118 2. Determine the stage of the conversation (opening, exploring, focusing, or concluding)
119 3. Detect user preferences and interests from their messages
120 4. Track open questions that haven't been fully addressed
121 5. Estimate user satisfaction based on their engagement and responses
122
123 Use the update_conversation_state function to provide this analysis.
124 """
125
126 def _format_conversation_history(self) -> str:
127 """Format the conversation history for analysis."""
128 formatted = []
129
130 for msg in self.state.conversation_history:
131 if msg.role == MessageRole.SYSTEM:
132 continue
133 formatted.append(f"{msg.role.value}: {msg.content}")
134
135 return "\n\n".join(formatted)
136
137 def _update_conversation_state(self, update: Dict[str, Any]):
138 """Update the conversation state with analysis results."""
139 if "current_topic" in update and update["current_topic"]:
140 if self.conversation_state.current_topic != update["current_topic"]:
141 if self.conversation_state.current_topic:
142 self.conversation_state.topic_history.append(
143 self.conversation_state.current_topic
144 )
145 self.conversation_state.current_topic = update["current_topic"]
146
147 if "conversation_stage" in update:
148 self.conversation_state.conversation_stage = update["conversation_stage"]
149
150 if "detected_preferences" in update:
151 for key, value in update["detected_preferences"].items():
152 self.conversation_state.user_preferences[key] = value
153
154 if "open_questions" in update:
155 self.conversation_state.open_questions = update["open_questions"]
156
157 if "satisfaction_estimate" in update:
158 self.conversation_state.satisfaction_score = update["satisfaction_estimate"]
159
160 def _format_conversation_context(self) -> str:
161 """Format the conversation state as context for response generation."""
162 return f"""
163 Current conversation context:
164 - Topic: {self.conversation_state.current_topic or 'Not yet established'}
165 - Conversation stage: {self.conversation_state.conversation_stage}
166 - User preferences: {json.dumps(self.conversation_state.user_preferences, indent=2)}
167 - Open questions: {', '.join(self.conversation_state.open_questions) if self.conversation_state.open_questions else 'None'}
168
169 Previous topics: {', '.join(self.conversation_state.topic_history) if self.conversation_state.topic_history else 'None'}
170
171 Adapt your response to this conversation context. If in exploring stage, ask open-ended questions.
172 If in focusing stage, provide detailed information on the current topic. If in concluding stage,
173 summarize key points and check if the user needs anything else.
174 """

Memory-Enhanced Contextual Agent

python
1# app/agents/contextual_agent.py
2from typing import List, Dict, Any, Optional, Tuple
3import logging
4import time
5from datetime import datetime
6
7from app.agents.base_agent import BaseAgent
8from app.services.memory_service import MemoryService
9from app.models.message import Message, MessageRole
10
11logger = logging.getLogger(__name__)
12
13class ContextualAgent(BaseAgent):
14 """Agent with enhanced contextual awareness and memory capabilities."""
15
16 def __init__(self, *args, memory_service: MemoryService, **kwargs):
17 super().__init__(*args, **kwargs)
18 self.memory_service = memory_service
19
20 # Initialize memory collections
21 self.episodic_memory = [] # Stores specific interactions/events
22 self.semantic_memory = {} # Stores facts and knowledge
23 self.working_memory = [] # Currently active context
24
25 self.max_working_memory = 10 # Max items in working memory
26
27 async def _generate_response(self, user_id: str) -> str:
28 """Generate a response with contextual memory enhancement."""
29 # Update memories based on recent conversation
30 await self._update_memories(user_id)
31
32 # Retrieve relevant memories for current context
33 relevant_memories = await self._retrieve_relevant_memories(user_id)
34
35 # Create context-enhanced prompt
36 context_message = Message(
37 role=MessageRole.SYSTEM,
38 content=self._create_context_prompt(relevant_memories)
39 )
40
41 # Insert context before the last user message
42 enhanced_history = self.state.conversation_history.copy()
43 user_message_index = next(
44 (i for i, msg in enumerate(reversed(enhanced_history))
45 if msg.role == MessageRole.USER),
46 None
47 )
48 if user_message_index is not None:
49 user_message_index = len(enhanced_history) - 1 - user_message_index
50 enhanced_history.insert(user_message_index, context_message)
51
52 # Generate response
53 response = await self.provider_service.generate_completion(
54 messages=[msg.model_dump() for msg in enhanced_history],
55 tools=self.tools,
56 user=user_id
57 )
58
59 # Process memory-related tool calls if any
60 if response.get("tool_calls"):
61 memory_updates = await self._process_memory_tools(response["tool_calls"])
62 if memory_updates:
63 # If memory was updated, we might want to regenerate with new context
64 return await self._generate_response(user_id)
65
66 # Update working memory with the response
67 if response["message"]["content"]:
68 self.working_memory.append({
69 "type": "assistant_response",
70 "content": response["message"]["content"],
71 "timestamp": time.time()
72 })
73 self._prune_working_memory()
74
75 return response["message"]["content"]
76
77 async def _update_memories(self, user_id: str):
78 """Update the agent's memories based on recent conversation."""
79 # Get last user message
80 last_user_message = next(
81 (msg for msg in reversed(self.state.conversation_history)
82 if msg.role == MessageRole.USER),
83 None
84 )
85
86 if not last_user_message:
87 return
88
89 # Add to working memory
90 self.working_memory.append({
91 "type": "user_message",
92 "content": last_user_message.content,
93 "timestamp": time.time()
94 })
95
96 # Extract potential semantic memories (facts, preferences)
97 if len(self.state.conversation_history) > 2:
98 extraction_messages = [
99 {"role": "system", "content": "Extract key facts, preferences, or personal details from this user message that would be useful to remember for future interactions. Return in JSON format with keys: 'facts', 'preferences', 'personal_details', each containing an array of strings."},
100 {"role": "user", "content": last_user_message.content}
101 ]
102
103 try:
104 extraction = await self.provider_service.generate_completion(
105 messages=extraction_messages,
106 user=user_id,
107 response_format={"type": "json_object"}
108 )
109
110 content = extraction["message"]["content"]
111 if content:
112 import json
113 memory_data = json.loads(content)
114
115 # Store in semantic memory
116 timestamp = datetime.now().isoformat()
117 for category, items in memory_data.items():
118 if not isinstance(items, list):
119 continue
120 for item in items:
121 if not item or not isinstance(item, str):
122 continue
123 memory_key = f"{category}:{self._generate_memory_key(item)}"
124 self.semantic_memory[memory_key] = {
125 "content": item,
126 "category": category,
127 "last_accessed": timestamp,
128 "created_at": timestamp,
129 "importance": self._calculate_importance(item)
130 }
131
132 # Store in memory service for persistence
133 await self.memory_service.store_memories(
134 user_id=user_id,
135 memories=self.semantic_memory
136 )
137 except Exception as e:
138 logger.error(f"Error extracting memories: {str(e)}")
139
140 # Prune working memory if needed
141 self._prune_working_memory()
142
143 async def _retrieve_relevant_memories(self, user_id: str) -> Dict[str, List[Any]]:
144 """Retrieve memories relevant to the current context."""
145 # Get conversation summary or last few messages
146 if len(self.state.conversation_history) <= 2:
147 query = self.state.conversation_history[-1].content
148 else:
149 recent_messages = self.state.conversation_history[-3:]
150 query = " ".join([msg.content for msg in recent_messages if msg.role != MessageRole.SYSTEM])
151
152 # Retrieve from memory service
153 stored_memories = await self.memory_service.retrieve_memories(
154 user_id=user_id,
155 query=query,
156 limit=5
157 )
158
159 # Combine with local semantic memory
160 all_memories = {
161 "facts": [],
162 "preferences": [],
163 "personal_details": [],
164 "episodic": self.episodic_memory[-3:] if self.episodic_memory else []
165 }
166
167 # Add from semantic memory
168 for key, memory in self.semantic_memory.items():
169 category = memory["category"]
170 if category in all_memories and len(all_memories[category]) < 5:
171 all_memories[category].append(memory["content"])
172
173 # Add from stored memories
174 for memory in stored_memories:
175 category = memory.get("category", "facts")
176 if category in all_memories and len(all_memories[category]) < 5:
177 all_memories[category].append(memory["content"])
178
179 # Update last accessed
180 if memory.get("id"):
181 memory_key = f"{category}:{memory['id']}"
182 if memory_key in self.semantic_memory:
183 self.semantic_memory[memory_key]["last_accessed"] = datetime.now().isoformat()
184
185 return all_memories
186
187 def _create_context_prompt(self, memories: Dict[str, List[Any]]) -> str:
188 """Create a context prompt with relevant memories."""
189 context_parts = ["Additional context to consider:"]
190
191 if memories["facts"]:
192 facts = "\n".join([f"- {fact}" for fact in memories["facts"]])
193 context_parts.append(f"Facts about the user or relevant topics:\n{facts}")
194
195 if memories["preferences"]:
196 prefs = "\n".join([f"- {pref}" for pref in memories["preferences"]])
197 context_parts.append(f"User preferences:\n{prefs}")
198
199 if memories["personal_details"]:
200 details = "\n".join([f"- {detail}" for detail in memories["personal_details"]])
201 context_parts.append(f"Personal details:\n{details}")
202
203 if memories["episodic"]:
204 episodes = "\n".join([f"- {ep.get('summary', '')}" for ep in memories["episodic"]])
205 context_parts.append(f"Recent interactions:\n{episodes}")
206
207 # Add working memory summary
208 if self.working_memory:
209 working_context = "Current context:\n"
210 for item in self.working_memory[-5:]:
211 item_type = item["type"]
212 content_preview = item["content"][:100] + "..." if len(item["content"]) > 100 else item["content"]
213 working_context += f"- [{item_type}] {content_preview}\n"
214 context_parts.append(working_context)
215
216 context_parts.append("Use this information to personalize your response, but don't explicitly mention that you're using saved information unless directly relevant.")
217
218 return "\n\n".join(context_parts)
219
220 def _prune_working_memory(self):
221 """Prune working memory to stay within limits."""
222 if len(self.working_memory) > self.max_working_memory:
223 # Instead of simple truncation, we prioritize by recency and importance
224 self.working_memory.sort(key=lambda x: (x.get("importance", 0.5), x["timestamp"]), reverse=True)
225 self.working_memory = self.working_memory[:self.max_working_memory]
226
227 def _generate_memory_key(self, content: str) -> str:
228 """Generate a unique key for memory storage."""
229 import hashlib
230 return hashlib.md5(content.encode()).hexdigest()[:10]
231
232 def _calculate_importance(self, content: str) -> float:
233 """Calculate the importance score of a memory item."""
234 # Simple heuristic based on content length and presence of certain keywords
235 importance_keywords = ["always", "never", "hate", "love", "favorite", "important", "must", "need"]
236
237 base_score = min(len(content) / 100, 0.5) # Longer items get higher base score, up to 0.5
238
239 keyword_score = sum(0.1 for word in importance_keywords if word in content.lower())
240 keyword_score = min(keyword_score, 0.5) # Cap at 0.5
241
242 return base_score + keyword_score
243
244 async def _process_memory_tools(self, tool_calls: List[Dict[str, Any]]) -> bool:
245 """Process memory-related tool calls."""
246 # Implement if we add memory-specific tools
247 return False

Advanced Tool Integration

Collaborative Task Management Agent

python
1# app/agents/task_agent.py
2from typing import List, Dict, Any, Optional
3import logging
4import json
5import asyncio
6
7from app.agents.base_agent import BaseAgent
8from app.models.message import Message, MessageRole
9from app.models.tool import Tool
10from app.services.task_service import TaskService
11
12logger = logging.getLogger(__name__)
13
14class TaskManagementAgent(BaseAgent):
15 """Agent specialized in collaborative task management."""
16
17 def __init__(self, *args, task_service: TaskService, **kwargs):
18 super().__init__(*args, **kwargs)
19 self.task_service = task_service
20
21 # Register task management tools
22 self.tools.extend([
23 Tool(
24 name="list_tasks",
25 description="List tasks for the user",
26 parameters={
27 "type": "object",
28 "properties": {
29 "status": {
30 "type": "string",
31 "enum": ["pending", "in_progress", "completed", "all"],
32 "description": "Filter tasks by status"
33 },
34 "limit": {
35 "type": "integer",
36 "description": "Maximum number of tasks to return",
37 "default": 10
38 }
39 }
40 }
41 ),
42 Tool(
43 name="create_task",
44 description="Create a new task",
45 parameters={
46 "type": "object",
47 "properties": {
48 "title": {
49 "type": "string",
50 "description": "Title of the task"
51 },
52 "description": {
53 "type": "string",
54 "description": "Detailed description of the task"
55 },
56 "due_date": {
57 "type": "string",
58 "description": "Due date in ISO format (YYYY-MM-DD)"
59 },
60 "priority": {
61 "type": "string",
62 "enum": ["low", "medium", "high"],
63 "description": "Priority level of the task"
64 }
65 },
66 "required": ["title"]
67 }
68 ),
69 Tool(
70 name="update_task",
71 description="Update an existing task",
72 parameters={
73 "type": "object",
74 "properties": {
75 "task_id": {
76 "type": "string",
77 "description": "ID of the task to update"
78 },
79 "title": {
80 "type": "string",
81 "description": "New title of the task"
82 },
83 "description": {
84 "type": "string",
85 "description": "New description of the task"
86 },
87 "status": {
88 "type": "string",
89 "enum": ["pending", "in_progress", "completed"],
90 "description": "New status of the task"
91 },
92 "due_date": {
93 "type": "string",
94 "description": "New due date in ISO format (YYYY-MM-DD)"
95 },
96 "priority": {
97 "type": "string",
98 "enum": ["low", "medium", "high"],
99 "description": "New priority level of the task"
100 }
101 },
102 "required": ["task_id"]
103 }
104 ),
105 Tool(
106 name="delete_task",
107 description="Delete a task",
108 parameters={
109 "type": "object",
110 "properties": {
111 "task_id": {
112 "type": "string",
113 "description": "ID of the task to delete"
114 },
115 "confirm": {
116 "type": "boolean",
117 "description": "Confirmation to delete the task",
118 "default": False
119 }
120 },
121 "required": ["task_id", "confirm"]
122 }
123 )
124 ])
125
126 async def _generate_response(self, user_id: str) -> str:
127 """Generate a response with task management capabilities."""
128 # Prepare messages for completion
129 messages = [msg.model_dump() for msg in self.state.conversation_history]
130
131 # Generate initial response
132 response = await self.provider_service.generate_completion(
133 messages=messages,
134 tools=self.tools,
135 user=user_id
136 )
137
138 # Process tool calls if any
139 if response.get("tool_calls"):
140 tool_responses = await self._process_tool_calls(response["tool_calls"], user_id)
141
142 # Add tool responses to conversation history
143 for tool_response in tool_responses:
144 self.state.conversation_history.append(
145 Message(
146 role=MessageRole.TOOL,
147 content=tool_response["content"],
148 tool_call_id=tool_response["tool_call_id"]
149 )
150 )
151
152 # Generate new response with tool results
153 updated_messages = [msg.model_dump() for msg in self.state.conversation_history]
154 final_response = await self.provider_service.generate_completion(
155 messages=updated_messages,
156 tools=self.tools,
157 user=user_id
158 )
159
160 # Handle any additional tool calls (recursive)
161 if final_response.get("tool_calls"):
162 # For simplicity, we'll limit to one level of recursion
163 return await self._handle_recursive_tool_calls(final_response, user_id)
164
165 return final_response["message"]["content"]
166
167 return response["message"]["content"]
168
169 async def _handle_recursive_tool_calls(self, response: Dict[str, Any], user_id: str) -> str:
170 """Handle additional tool calls recursively."""
171 tool_responses = await self._process_tool_calls(response["tool_calls"], user_id)
172
173 # Add tool responses to conversation history
174 for tool_response in tool_responses:
175 self.state.conversation_history.append(
176 Message(
177 role=MessageRole.TOOL,
178 content=tool_response["content"],
179 tool_call_id=tool_response["tool_call_id"]
180 )
181 )
182
183 # Generate final response with all tool results
184 updated_messages = [msg.model_dump() for msg in self.state.conversation_history]
185 final_response = await self.provider_service.generate_completion(
186 messages=updated_messages,
187 tools=self.tools,
188 user=user_id
189 )
190
191 return final_response["message"]["content"]
192
193 async def _process_tool_calls(self, tool_calls: List[Dict[str, Any]], user_id: str) -> List[Dict[str, Any]]:
194 """Process tool calls and return tool responses."""
195 tool_responses = []
196
197 for tool_call in tool_calls:
198 tool_name = tool_call["function"]["name"]
199 tool_args_json = tool_call["function"]["arguments"]
200 tool_call_id = tool_call["id"]
201
202 try:
203 # Parse arguments as JSON
204 tool_args = json.loads(tool_args_json)
205
206 # Process based on tool name
207 if tool_name == "list_tasks":
208 result = await self.task_service.list_tasks(
209 user_id=user_id,
210 status=tool_args.get("status", "all"),
211 limit=tool_args.get("limit", 10)
212 )
213
214 if result:
215 tasks_formatted = "\n\n".join([
216 f"ID: {task['id']}\n"
217 f"Title: {task['title']}\n"
218 f"Status: {task['status']}\n"
219 f"Priority: {task['priority']}\n"
220 f"Due Date: {task['due_date']}\n"
221 f"Description: {task['description']}"
222 for task in result
223 ])
224 tool_responses.append({
225 "tool_call_id": tool_call_id,
226 "content": f"Found {len(result)} tasks:\n\n{tasks_formatted}"
227 })
228 else:
229 tool_responses.append({
230 "tool_call_id": tool_call_id,
231 "content": "No tasks found matching your criteria."
232 })
233
234 elif tool_name == "create_task":
235 result = await self.task_service.create_task(
236 user_id=user_id,
237 title=tool_args["title"],
238 description=tool_args.get("description", ""),
239 due_date=tool_args.get("due_date"),
240 priority=tool_args.get("priority", "medium")
241 )
242
243 tool_responses.append({
244 "tool_call_id": tool_call_id,
245 "content": f"Task created successfully.\n\nID: {result['id']}\nTitle: {result['title']}"
246 })
247
248 elif tool_name == "update_task":
249 update_data = {k: v for k, v in tool_args.items() if k != "task_id"}
250 result = await self.task_service.update_task(
251 user_id=user_id,
252 task_id=tool_args["task_id"],
253 **update_data
254 )
255
256 if result:
257 tool_responses.append({
258 "tool_call_id": tool_call_id,
259 "content": f"Task updated successfully.\n\nID: {result['id']}\nTitle: {result['title']}\nStatus: {result['status']}"
260 })
261 else:
262 tool_responses.append({
263 "tool_call_id": tool_call_id,
264 "content": f"Task with ID {tool_args['task_id']} not found or you don't have permission to update it."
265 })
266
267 elif tool_name == "delete_task":
268 if not tool_args.get("confirm", False):
269 tool_responses.append({
270 "tool_call_id": tool_call_id,
271 "content": "Task deletion requires confirmation. Please set 'confirm' to true to proceed."
272 })
273 else:
274 result = await self.task_service.delete_task(
275 user_id=user_id,
276 task_id=tool_args["task_id"]
277 )
278
279 if result:
280 tool_responses.append({
281 "tool_call_id": tool_call_id,
282 "content": f"Task with ID {tool_args['task_id']} has been deleted successfully."
283 })
284 else:
285 tool_responses.append({
286 "tool_call_id": tool_call_id,
287 "content": f"Task with ID {tool_args['task_id']} not found or you don't have permission to delete it."
288 })
289
290 except json.JSONDecodeError:
291 tool_responses.append({
292 "tool_call_id": tool_call_id,
293 "content": "Error: Invalid JSON in tool arguments."
294 })
295 except KeyError as e:
296 tool_responses.append({
297 "tool_call_id": tool_call_id,
298 "content": f"Error: Missing required parameter: {str(e)}"
299 })
300 except Exception as e:
301 logger.error(f"Error processing tool call {tool_name}: {str(e)}")
302 tool_responses.append({
303 "tool_call_id": tool_call_id,
304 "content": f"Error executing {tool_name}: {str(e)}"
305 })
306
307 return tool_responses

Agent Factory and Orchestration

python
1# app/agents/agent_factory.py
2from typing import Dict, Any, Optional, List, Type
3import logging
4
5from app.agents.base_agent import BaseAgent
6from app.agents.research_agent import ResearchAgent
7from app.agents.conversation_manager import ConversationManager
8from app.agents.contextual_agent import ContextualAgent
9from app.agents.task_agent import TaskManagementAgent
10
11from app.services.provider_service import ProviderService
12from app.services.knowledge_service import KnowledgeService
13from app.services.memory_service import MemoryService
14from app.services.task_service import TaskService
15
16logger = logging.getLogger(__name__)
17
18class AgentFactory:
19 """Factory for creating agent instances based on requirements."""
20
21 def __init__(self,
22 provider_service: ProviderService,
23 knowledge_service: Optional[KnowledgeService] = None,
24 memory_service: Optional[MemoryService] = None,
25 task_service: Optional[TaskService] = None):
26 self.provider_service = provider_service
27 self.knowledge_service = knowledge_service
28 self.memory_service = memory_service
29 self.task_service = task_service
30
31 # Register available agent types
32 self.agent_types: Dict[str, Type[BaseAgent]] = {
33 "research": ResearchAgent,
34 "conversation": ConversationManager,
35 "contextual": ContextualAgent,
36 "task": TaskManagementAgent
37 }
38
39 def create_agent(self,
40 agent_type: str,
41 system_prompt: str,
42 tools: Optional[List[Dict[str, Any]]] = None,
43 **kwargs) -> BaseAgent:
44 """Create and return an agent instance of the specified type."""
45 if agent_type not in self.agent_types:
46 raise ValueError(f"Unknown agent type: {agent_type}. Available types: {list(self.agent_types.keys())}")
47
48 agent_class = self.agent_types[agent_type]
49
50 # Prepare required services based on agent type
51 agent_kwargs = {
52 "provider_service": self.provider_service,
53 "system_prompt": system_prompt,
54 "tools": tools
55 }
56
57 # Add specialized services based on agent type
58 if agent_type == "research" and self.knowledge_service:
59 agent_kwargs["knowledge_service"] = self.knowledge_service
60
61 if agent_type == "contextual" and self.memory_service:
62 agent_kwargs["memory_service"] = self.memory_service
63
64 if agent_type == "task" and self.task_service:
65 agent_kwargs["task_service"] = self.task_service
66
67 # Add any additional kwargs
68 agent_kwargs.update(kwargs)
69
70 # Create and return the agent instance
71 return agent_class(**agent_kwargs)

Metaframework for Agent Composition

python
1# app/agents/meta_agent.py
2from typing import Dict, List, Any, Optional
3import logging
4import asyncio
5import json
6
7from app.agents.base_agent import BaseAgent, AgentState
8from app.models.message import Message, MessageRole
9from app.services.provider_service import ProviderService
10
11logger = logging.getLogger(__name__)
12
13class AgentSubsystem:
14 """Represents a specialized agent within the MetaAgent."""
15
16 def __init__(self, name: str, agent: BaseAgent, role: str):
17 self.name = name
18 self.agent = agent
19 self.role = role
20 self.active = True
21
22class MetaAgent(BaseAgent):
23 """A meta-agent that coordinates multiple specialized agents."""
24
25 def __init__(self,
26 provider_service: ProviderService,
27 system_prompt: str,
28 subsystems: Optional[List[AgentSubsystem]] = None,
29 state: Optional[AgentState] = None):
30 super().__init__(provider_service, system_prompt, [], state)
31 self.subsystems = subsystems or []
32
33 # Tools specific to the meta-agent
34 self.tools.extend([
35 {
36 "type": "function",
37 "function": {
38 "name": "route_to_subsystem",
39 "description": "Route a task to a specific subsystem agent",
40 "parameters": {
41 "type": "object",
42 "properties": {
43 "subsystem": {
44 "type": "string",
45 "description": "The name of the subsystem to route to"
46 },
47 "task": {
48 "type": "string",
49 "description": "The task to be performed by the subsystem"
50 },
51 "context": {
52 "type": "object",
53 "description": "Additional context for the subsystem"
54 }
55 },
56 "required": ["subsystem", "task"]
57 }
58 }
59 },
60 {
61 "type": "function",
62 "function": {
63 "name": "parallel_processing",
64 "description": "Process a task in parallel across multiple subsystems",
65 "parameters": {
66 "type": "object",
67 "properties": {
68 "task": {
69 "type": "string",
70 "description": "The task to process in parallel"
71 },
72 "subsystems": {
73 "type": "array",
74 "items": {
75 "type": "string"
76 },
77 "description": "List of subsystems to involve"
78 }
79 },
80 "required": ["task", "subsystems"]
81 }
82 }
83 }
84 ])
85
86 def add_subsystem(self, subsystem: AgentSubsystem):
87 """Add a new subsystem to the meta-agent."""
88 # Check for duplicate names
89 if any(sys.name == subsystem.name for sys in self.subsystems):
90 raise ValueError(f"Subsystem with name '{subsystem.name}' already exists")
91
92 self.subsystems.append(subsystem)
93
94 def get_subsystem(self, name: str) -> Optional[AgentSubsystem]:
95 """Get a subsystem by name."""
96 for subsystem in self.subsystems:
97 if subsystem.name == name:
98 return subsystem
99 return None
100
101 async def _generate_response(self, user_id: str) -> str:
102 """Generate a response using the meta-agent architecture."""
103 # Extract the last user message
104 last_user_message = next(
105 (msg for msg in reversed(self.state.conversation_history)
106 if msg.role == MessageRole.USER),
107 None
108 )
109
110 if not last_user_message:
111 return "I don't have any messages to respond to."
112
113 # First, determine routing strategy using the coordinator
114 coordinator_messages = [
115 {"role": "system", "content": f"""
116 You are the coordinator of a multi-agent system with the following subsystems:
117
118 {self._format_subsystems()}
119
120 Your job is to analyze the user's message and determine the optimal processing strategy:
121 1. If the query is best handled by a single specialized subsystem, use route_to_subsystem
122 2. If the query would benefit from multiple perspectives, use parallel_processing
123
124 Choose the most appropriate strategy based on the complexity and nature of the request.
125 """},
126 {"role": "user", "content": last_user_message.content}
127 ]
128
129 routing_response = await self.provider_service.generate_completion(
130 messages=coordinator_messages,
131 tools=self.tools,
132 tool_choice="auto",
133 user=user_id
134 )
135
136 # Process based on the routing decision
137 if routing_response.get("tool_calls"):
138 tool_call = routing_response["tool_calls"][0]
139 function_name = tool_call["function"]["name"]
140
141 try:
142 function_args = json.loads(tool_call["function"]["arguments"])
143
144 if function_name == "route_to_subsystem":
145 return await self._handle_single_subsystem_route(
146 function_args["subsystem"],
147 function_args["task"],
148 function_args.get("context", {}),
149 user_id
150 )
151
152 elif function_name == "parallel_processing":
153 return await self._handle_parallel_processing(
154 function_args["task"],
155 function_args["subsystems"],
156 user_id
157 )
158
159 except json.JSONDecodeError:
160 logger.error("Error parsing function arguments")
161 except KeyError as e:
162 logger.error(f"Missing required parameter: {e}")
163 except Exception as e:
164 logger.error(f"Error in routing: {e}")
165
166 # Fallback to direct response
167 return await self._handle_direct_response(user_id)
168
169 async def _handle_single_subsystem_route(self,
170 subsystem_name: str,
171 task: str,
172 context: Dict[str, Any],
173 user_id: str) -> str:
174 """Handle routing to a single subsystem."""
175 subsystem = self.get_subsystem(subsystem_name)
176
177 if not subsystem or not subsystem.active:
178 return f"Error: Subsystem '{subsystem_name}' not found or not active. Please try a different approach."
179
180 # Process with the selected subsystem
181 response = await subsystem.agent.process_message(task, user_id)
182
183 # Format the response to indicate the source
184 return f"[{subsystem.name} - {subsystem.role}] {response}"
185
186 async def _handle_parallel_processing(self,
187 task: str,
188 subsystem_names: List[str],
189 user_id: str) -> str:
190 """Handle parallel processing across multiple subsystems."""
191 # Validate subsystems
192 valid_subsystems = []
193 for name in subsystem_names:
194 subsystem = self.get_subsystem(name)
195 if subsystem and subsystem.active:
196 valid_subsystems.append(subsystem)
197
198 if not valid_subsystems:
199 return "Error: None of the specified subsystems are available."
200
201 # Process in parallel
202 tasks = [subsystem.agent.process_message(task, user_id) for subsystem in valid_subsystems]
203 responses = await asyncio.gather(*tasks)
204
205 # Format responses
206 formatted_responses = [
207 f"## {subsystem.name} ({subsystem.role}):\n{response}"
208 for subsystem, response in zip(valid_subsystems, responses)
209 ]
210
211 # Synthesize a final response
212 synthesis_prompt = f"""
213 The user's request was processed by multiple specialized agents:
214
215 {"".join(formatted_responses)}
216
217 Synthesize a comprehensive response that incorporates these perspectives.
218 Highlight areas of agreement and provide a balanced view where there are differences.
219 """
220
221 synthesis_messages = [
222 {"role": "system", "content": "You are a synthesis agent that combines multiple specialized perspectives into a coherent response."},
223 {"role": "user", "content": synthesis_prompt}
224 ]
225
226 synthesis = await self.provider_service.generate_completion(
227 messages=synthesis_messages,
228 user=user_id
229 )
230
231 return synthesis["message"]["content"]
232
233 async def _handle_direct_response(self, user_id: str) -> str:
234 """Handle direct response when no routing is determined."""
235 # Generate a response directly using the provider service
236 response = await self.provider_service.generate_completion(
237 messages=[msg.model_dump() for msg in self.state.conversation_history],
238 user=user_id
239 )
240
241 return response["message"]["content"]
242
243 def _format_subsystems(self) -> str:
244 """Format subsystem information for the coordinator prompt."""
245 return "\n".join([
246 f"- {subsystem.name}: {subsystem.role}"
247 for subsystem in self.subsystems if subsystem.active
248 ])

Sample Agent Usage Implementation

python
1# app/main.py
2import asyncio
3import logging
4from fastapi import FastAPI, HTTPException, Depends, Header
5from pydantic import BaseModel
6from typing import List, Optional, Dict, Any
7
8from app.agents.agent_factory import AgentFactory
9from app.agents.meta_agent import MetaAgent, AgentSubsystem
10from app.services.provider_service import ProviderService
11from app.services.knowledge_service import KnowledgeService
12from app.services.memory_service import MemoryService
13from app.services.task_service import TaskService
14
15# Configure logging
16logging.basicConfig(level=logging.INFO)
17logger = logging.getLogger(__name__)
18
19app = FastAPI(title="MCP Agent System")
20
21# Initialize services
22provider_service = ProviderService()
23knowledge_service = KnowledgeService()
24memory_service = MemoryService()
25task_service = TaskService()
26
27# Initialize agent factory
28agent_factory = AgentFactory(
29 provider_service=provider_service,
30 knowledge_service=knowledge_service,
31 memory_service=memory_service,
32 task_service=task_service
33)
34
35# Agent session storage
36agent_sessions = {}
37
38# Define request/response models
39class MessageRequest(BaseModel):
40 message: str
41 session_id: Optional[str] = None
42 agent_type: Optional[str] = None
43
44class MessageResponse(BaseModel):
45 response: str
46 session_id: str
47
48# Auth dependency
49async def verify_api_key(authorization: Optional[str] = Header(None)):
50 if not authorization or not authorization.startswith("Bearer "):
51 raise HTTPException(status_code=401, detail="Invalid or missing API key")
52
53 # Simple validation for demo purposes
54 token = authorization.replace("Bearer ", "")
55 if token != "demo_api_key": # In production, validate against secure storage
56 raise HTTPException(status_code=401, detail="Invalid API key")
57
58 return token
59
60# Routes
61@app.post("/api/v1/chat", response_model=MessageResponse)
62async def chat(
63 request: MessageRequest,
64 api_key: str = Depends(verify_api_key)
65):
66 user_id = "demo_user" # In production, extract from API key or auth token
67
68 # Create or retrieve session
69 session_id = request.session_id
70 if not session_id or session_id not in agent_sessions:
71 # Create a new agent instance if session doesn't exist
72 session_id = f"session_{len(agent_sessions) + 1}"
73
74 # Determine agent type
75 agent_type = request.agent_type or "meta"
76
77 if agent_type == "meta":
78 # Create a meta-agent with multiple specialized subsystems
79 research_agent = agent_factory.create_agent(
80 agent_type="research",
81 system_prompt="You are a research specialist that provides in-depth, accurate information based on available knowledge."
82 )
83
84 conversation_agent = agent_factory.create_agent(
85 agent_type="conversation",
86 system_prompt="You are a conversation expert that helps maintain engaging, relevant, and structured discussions."
87 )
88
89 task_agent = agent_factory.create_agent(
90 agent_type="task",
91 system_prompt="You are a task management specialist that helps organize, track, and complete tasks efficiently."
92 )
93
94 meta_agent = MetaAgent(
95 provider_service=provider_service,
96 system_prompt="You are an advanced assistant that coordinates multiple specialized systems to provide optimal responses."
97 )
98
99 # Add subsystems to meta-agent
100 meta_agent.add_subsystem(AgentSubsystem(
101 name="research",
102 agent=research_agent,
103 role="Knowledge and information retrieval specialist"
104 ))
105
106 meta_agent.add_subsystem(AgentSubsystem(
107 name="conversation",
108 agent=conversation_agent,
109 role="Conversation flow and engagement specialist"
110 ))
111
112 meta_agent.add_subsystem(AgentSubsystem(
113 name="task",
114 agent=task_agent,
115 role="Task management and organization specialist"
116 ))
117
118 agent = meta_agent
119 else:
120 # Create a specialized agent
121 agent = agent_factory.create_agent(
122 agent_type=agent_type,
123 system_prompt=f"You are a helpful assistant specializing in {agent_type} tasks."
124 )
125
126 agent_sessions[session_id] = agent
127 else:
128 agent = agent_sessions[session_id]
129
130 # Process the message
131 try:
132 response = await agent.process_message(request.message, user_id)
133 return MessageResponse(response=response, session_id=session_id)
134 except Exception as e:
135 logger.exception("Error processing message")
136 raise HTTPException(status_code=500, detail=f"Error processing message: {str(e)}")
137
138# Startup event
139@app.on_event("startup")
140async def startup_event():
141 # Initialize services
142 await provider_service.initialize()
143 await knowledge_service.initialize()
144 await memory_service.initialize()
145 await task_service.initialize()
146
147 logger.info("All services initialized")
148
149# Shutdown event
150@app.on_event("shutdown")
151async def shutdown_event():
152 # Cleanup
153 await provider_service.cleanup()
154 await knowledge_service.cleanup()
155 await memory_service.cleanup()
156 await task_service.cleanup()
157
158 logger.info("All services shut down")
159
160if __name__ == "__main__":
161 import uvicorn
162 uvicorn.run(app, host="0.0.0.0", port=8000)

Conclusion

This comprehensive implementation demonstrates the integration of OpenAI's Responses API within a sophisticated agent architecture. The modular design allows for specialized cognitive capabilities including knowledge retrieval, conversation management, contextual awareness, and task coordination.

Key architectural features include:

  1. Abstraction Layers: The system maintains clean separation between provider services, agent logic, and specialized capabilities.

  2. Contextual Enhancement: Agents utilize memory systems and knowledge retrieval to maintain context and provide more relevant responses.

  3. Tool Integration: The implementation leverages OpenAI's function calling capabilities to integrate with external systems and services.

  4. Meta-Agent Architecture: The meta-agent pattern enables composition of specialized agents into a coherent system that routes queries optimally.

  5. Stateful Conversations: All agents maintain conversation state, allowing for continuity and context preservation across interactions.

This architecture provides a foundation for building sophisticated AI applications that leverage both OpenAI's cloud capabilities and local Ollama models through the MCP system's intelligent routing.

Hybrid Intelligence Architecture: Integrating Ollama with OpenAI's Agent SDK

Theoretical Framework for Hybrid Model Inference

The integration of Ollama with OpenAI's Agent SDK represents a significant advancement in hybrid AI architectures. This document articulates the methodological approach for implementing a sophisticated orchestration layer that intelligently routes inference tasks between cloud-based and local computational resources based on contextual parameters.

Ollama Integration Architecture

Core Integration Components

python
1# app/services/ollama_service.py
2import os
3import json
4import logging
5from typing import List, Dict, Any, Optional, Union
6import aiohttp
7import asyncio
8from tenacity import retry, stop_after_attempt, wait_exponential
9
10from app.models.message import Message, MessageRole
11from app.config import settings
12
13logger = logging.getLogger(__name__)
14
15class OllamaService:
16 """Service for interacting with Ollama's local inference capabilities."""
17
18 def __init__(self):
19 self.base_url = settings.OLLAMA_HOST
20 self.default_model = settings.OLLAMA_MODEL
21 self.timeout = aiohttp.ClientTimeout(total=settings.REQUEST_TIMEOUT)
22 self.session = None
23
24 # Capability mapping for different models
25 self.model_capabilities = {
26 "llama2": {
27 "supports_tools": False,
28 "context_window": 4096,
29 "strengths": ["general_knowledge", "reasoning"],
30 "max_tokens": 2048
31 },
32 "codellama": {
33 "supports_tools": False,
34 "context_window": 8192,
35 "strengths": ["code_generation", "technical_explanation"],
36 "max_tokens": 2048
37 },
38 "mistral": {
39 "supports_tools": False,
40 "context_window": 8192,
41 "strengths": ["instruction_following", "reasoning"],
42 "max_tokens": 2048
43 },
44 "dolphin-mistral": {
45 "supports_tools": False,
46 "context_window": 8192,
47 "strengths": ["conversational", "creative_writing"],
48 "max_tokens": 2048
49 }
50 }
51
52 async def initialize(self):
53 """Initialize the Ollama service."""
54 self.session = aiohttp.ClientSession(timeout=self.timeout)
55
56 # Verify connectivity
57 try:
58 await self.list_models()
59 logger.info("Ollama service initialized successfully")
60 except Exception as e:
61 logger.error(f"Failed to initialize Ollama service: {str(e)}")
62 raise
63
64 async def cleanup(self):
65 """Clean up resources."""
66 if self.session:
67 await self.session.close()
68 self.session = None
69
70 @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10))
71 async def list_models(self) -> List[Dict[str, Any]]:
72 """List available models in Ollama."""
73 if not self.session:
74 self.session = aiohttp.ClientSession(timeout=self.timeout)
75
76 async with self.session.get(f"{self.base_url}/api/tags") as response:
77 if response.status != 200:
78 error_text = await response.text()
79 raise Exception(f"Failed to list models: {error_text}")
80
81 data = await response.json()
82 return data.get("models", [])
83
84 async def generate_completion(
85 self,
86 messages: List[Dict[str, str]],
87 model: Optional[str] = None,
88 temperature: float = 0.7,
89 max_tokens: Optional[int] = None,
90 tools: Optional[List[Dict[str, Any]]] = None,
91 stream: bool = False,
92 **kwargs
93 ) -> Dict[str, Any]:
94 """Generate a completion using Ollama."""
95 model_name = model or self.default_model
96
97 # Check if specified model is available
98 try:
99 available_models = await self.list_models()
100 model_names = [m.get("name") for m in available_models]
101
102 if model_name not in model_names:
103 fallback_model = self.default_model
104 logger.warning(
105 f"Model '{model_name}' not available in Ollama. "
106 f"Using fallback model '{fallback_model}'."
107 )
108 model_name = fallback_model
109 except Exception as e:
110 logger.error(f"Error checking model availability: {str(e)}")
111 model_name = self.default_model
112
113 # Get model capabilities
114 model_base_name = model_name.split(':')[0] if ':' in model_name else model_name
115 capabilities = self.model_capabilities.get(
116 model_base_name,
117 {"supports_tools": False, "context_window": 4096, "max_tokens": 2048}
118 )
119
120 # Check if tools are requested but not supported
121 if tools and not capabilities["supports_tools"]:
122 logger.warning(
123 f"Model '{model_name}' does not support tools. "
124 "Tool functionality will be simulated with prompt engineering."
125 )
126 # We'll handle this by incorporating tool descriptions into the prompt
127
128 # Format messages for Ollama
129 prompt = self._format_messages_for_ollama(messages, tools)
130
131 # Set max_tokens based on capabilities if not provided
132 if max_tokens is None:
133 max_tokens = capabilities["max_tokens"]
134 else:
135 max_tokens = min(max_tokens, capabilities["max_tokens"])
136
137 # Prepare request payload
138 payload = {
139 "model": model_name,
140 "prompt": prompt,
141 "stream": stream,
142 "options": {
143 "temperature": temperature,
144 "num_predict": max_tokens
145 }
146 }
147
148 if stream:
149 return await self._stream_completion(payload)
150 else:
151 return await self._generate_completion_sync(payload)
152
153 async def _generate_completion_sync(self, payload: Dict[str, Any]) -> Dict[str, Any]:
154 """Generate a completion synchronously."""
155 if not self.session:
156 self.session = aiohttp.ClientSession(timeout=self.timeout)
157
158 try:
159 async with self.session.post(
160 f"{self.base_url}/api/generate",
161 json=payload
162 ) as response:
163 if response.status != 200:
164 error_text = await response.text()
165 raise Exception(f"Ollama generate error: {error_text}")
166
167 result = await response.json()
168
169 # Format the response to match OpenAI's format for consistency
170 formatted_response = self._format_ollama_response(result, payload)
171 return formatted_response
172
173 except Exception as e:
174 logger.error(f"Error generating completion: {str(e)}")
175 raise
176
177 async def _stream_completion(self, payload: Dict[str, Any]):
178 """Stream a completion."""
179 if not self.session:
180 self.session = aiohttp.ClientSession(timeout=self.timeout)
181
182 try:
183 async with self.session.post(
184 f"{self.base_url}/api/generate",
185 json=payload,
186 timeout=aiohttp.ClientTimeout(total=60)
187 ) as response:
188 if response.status != 200:
189 error_text = await response.text()
190 raise Exception(f"Ollama generate error: {error_text}")
191
192 # Stream the response
193 full_text = ""
194 async for line in response.content:
195 if not line:
196 continue
197
198 try:
199 chunk = json.loads(line)
200 text_chunk = chunk.get("response", "")
201 full_text += text_chunk
202
203 # Yield formatted chunk for streaming
204 yield self._format_ollama_stream_chunk(text_chunk)
205
206 # Check if done
207 if chunk.get("done", False):
208 break
209 except json.JSONDecodeError:
210 logger.warning(f"Invalid JSON in stream: {line}")
211
212 # Send the final done chunk
213 yield self._format_ollama_stream_chunk("", done=True, full_text=full_text)
214
215 except Exception as e:
216 logger.error(f"Error streaming completion: {str(e)}")
217 raise
218
219 def _format_messages_for_ollama(
220 self,
221 messages: List[Dict[str, str]],
222 tools: Optional[List[Dict[str, Any]]] = None
223 ) -> str:
224 """Format messages for Ollama."""
225 formatted_messages = []
226
227 # Add tools descriptions if provided
228 if tools:
229 tools_description = self._format_tools_description(tools)
230 formatted_messages.append(f"[System]\n{tools_description}\n")
231
232 for msg in messages:
233 role = msg["role"]
234 content = msg["content"] or ""
235
236 if role == "system":
237 formatted_messages.append(f"[System]\n{content}")
238 elif role == "user":
239 formatted_messages.append(f"[User]\n{content}")
240 elif role == "assistant":
241 formatted_messages.append(f"[Assistant]\n{content}")
242 elif role == "tool":
243 # Format tool responses
244 tool_call_id = msg.get("tool_call_id", "unknown")
245 formatted_messages.append(f"[Tool Result: {tool_call_id}]\n{content}")
246
247 # Add final prompt for assistant response
248 formatted_messages.append("[Assistant]\n")
249
250 return "\n\n".join(formatted_messages)
251
252 def _format_tools_description(self, tools: List[Dict[str, Any]]) -> str:
253 """Format tools description for inclusion in the prompt."""
254 tools_text = ["You have access to the following tools:"]
255
256 for tool in tools:
257 if tool.get("type") == "function":
258 function = tool["function"]
259 function_name = function["name"]
260 function_description = function.get("description", "")
261
262 tools_text.append(f"Tool: {function_name}")
263 tools_text.append(f"Description: {function_description}")
264
265 # Format parameters if available
266 if "parameters" in function:
267 parameters = function["parameters"]
268 if "properties" in parameters:
269 tools_text.append("Parameters:")
270 for param_name, param_details in parameters["properties"].items():
271 param_type = param_details.get("type", "unknown")
272 param_desc = param_details.get("description", "")
273 required = "Required" if param_name in parameters.get("required", []) else "Optional"
274 tools_text.append(f" - {param_name} ({param_type}, {required}): {param_desc}")
275
276 tools_text.append("") # Empty line between tools
277
278 tools_text.append("""
279When you need to use a tool, specify it clearly using the format:
280
281<tool>
282{
283 "name": "tool_name",
284 "parameters": {
285 "param1": "value1",
286 "param2": "value2"
287 }
288}
289</tool>
290
291Wait for the tool result before continuing.
292""")
293
294 return "\n".join(tools_text)
295
296 def _format_ollama_response(self, result: Dict[str, Any], request: Dict[str, Any]) -> Dict[str, Any]:
297 """Format Ollama response to match OpenAI's format."""
298 response_text = result.get("response", "")
299
300 # Check for tool calls in the response
301 tool_calls = self._extract_tool_calls(response_text)
302
303 # Calculate token counts (approximate)
304 prompt_tokens = len(request["prompt"]) // 4 # Rough approximation
305 completion_tokens = len(response_text) // 4 # Rough approximation
306
307 response = {
308 "id": f"ollama-{result.get('id', 'unknown')}",
309 "object": "chat.completion",
310 "created": int(result.get("created_at", 0)),
311 "model": request["model"],
312 "provider": "ollama",
313 "usage": {
314 "prompt_tokens": prompt_tokens,
315 "completion_tokens": completion_tokens,
316 "total_tokens": prompt_tokens + completion_tokens
317 },
318 "message": {
319 "role": "assistant",
320 "content": self._clean_tool_calls_from_text(response_text) if tool_calls else response_text,
321 "tool_calls": tool_calls
322 }
323 }
324
325 return response
326
327 def _format_ollama_stream_chunk(
328 self,
329 chunk_text: str,
330 done: bool = False,
331 full_text: Optional[str] = None
332 ) -> Dict[str, Any]:
333 """Format a streaming chunk to match OpenAI's format."""
334 if done and full_text:
335 # Final chunk might include tool calls
336 tool_calls = self._extract_tool_calls(full_text)
337 cleaned_text = self._clean_tool_calls_from_text(full_text) if tool_calls else full_text
338
339 return {
340 "id": f"ollama-chunk-{id(chunk_text)}",
341 "object": "chat.completion.chunk",
342 "created": int(time.time()),
343 "model": self.default_model,
344 "choices": [{
345 "index": 0,
346 "delta": {
347 "content": "",
348 "tool_calls": tool_calls if tool_calls else None
349 },
350 "finish_reason": "stop"
351 }]
352 }
353 else:
354 return {
355 "id": f"ollama-chunk-{id(chunk_text)}",
356 "object": "chat.completion.chunk",
357 "created": int(time.time()),
358 "model": self.default_model,
359 "choices": [{
360 "index": 0,
361 "delta": {
362 "content": chunk_text
363 },
364 "finish_reason": None
365 }]
366 }
367
368 def _extract_tool_calls(self, text: str) -> Optional[List[Dict[str, Any]]]:
369 """Extract tool calls from response text."""
370 import re
371 import uuid
372
373 # Look for tool calls in the format <tool>...</tool>
374 tool_pattern = re.compile(r'<tool>(.*?)</tool>', re.DOTALL)
375 matches = tool_pattern.findall(text)
376
377 if not matches:
378 return None
379
380 tool_calls = []
381 for i, match in enumerate(matches):
382 try:
383 # Try to parse as JSON
384 tool_data = json.loads(match.strip())
385
386 tool_calls.append({
387 "id": f"call_{uuid.uuid4().hex[:8]}",
388 "type": "function",
389 "function": {
390 "name": tool_data.get("name", "unknown_tool"),
391 "arguments": json.dumps(tool_data.get("parameters", {}))
392 }
393 })
394 except json.JSONDecodeError:
395 # If not valid JSON, try to extract name and arguments using regex
396 name_match = re.search(r'"name"\s*:\s*"([^"]+)"', match)
397 args_match = re.search(r'"parameters"\s*:\s*(\{.*\})', match)
398
399 if name_match:
400 tool_name = name_match.group(1)
401 tool_args = "{}" if not args_match else args_match.group(1)
402
403 tool_calls.append({
404 "id": f"call_{uuid.uuid4().hex[:8]}",
405 "type": "function",
406 "function": {
407 "name": tool_name,
408 "arguments": tool_args
409 }
410 })
411
412 return tool_calls if tool_calls else None
413
414 def _clean_tool_calls_from_text(self, text: str) -> str:
415 """Remove tool calls from response text."""
416 import re
417
418 # Remove <tool>...</tool> blocks
419 cleaned_text = re.sub(r'<tool>.*?</tool>', '', text, flags=re.DOTALL)
420
421 # Remove any leftover tool usage instructions
422 cleaned_text = re.sub(r'I will use a tool to help with this\.', '', cleaned_text)
423 cleaned_text = re.sub(r'Let me use the .* tool\.', '', cleaned_text)
424
425 # Clean up multiple newlines
426 cleaned_text = re.sub(r'\n{3,}', '\n\n', cleaned_text)
427
428 return cleaned_text.strip()

Provider Selection Service

python
1# app/services/provider_service.py
2import os
3import json
4import logging
5import time
6from typing import List, Dict, Any, Optional, Union, AsyncGenerator
7import asyncio
8from enum import Enum
9import hashlib
10
11import openai
12from openai import AsyncOpenAI
13from app.services.ollama_service import OllamaService
14from app.config import settings
15
16logger = logging.getLogger(__name__)
17
18class Provider(str, Enum):
19 OPENAI = "openai"
20 OLLAMA = "ollama"
21 AUTO = "auto"
22
23class ModelSelectionCriteria:
24 """Criteria for model selection in auto-routing."""
25 def __init__(
26 self,
27 complexity_threshold: float = 0.65,
28 privacy_sensitive_tokens: List[str] = None,
29 latency_requirement: Optional[float] = None,
30 token_budget: Optional[int] = None,
31 tool_requirements: Optional[List[str]] = None
32 ):
33 self.complexity_threshold = complexity_threshold
34 self.privacy_sensitive_tokens = privacy_sensitive_tokens or []
35 self.latency_requirement = latency_requirement
36 self.token_budget = token_budget
37 self.tool_requirements = tool_requirements
38
39class ProviderService:
40 """Service for routing requests to the appropriate provider."""
41
42 def __init__(self):
43 self.openai_client = None
44 self.ollama_service = OllamaService()
45 self.model_selection_criteria = ModelSelectionCriteria(
46 complexity_threshold=settings.COMPLEXITY_THRESHOLD,
47 privacy_sensitive_tokens=settings.PRIVACY_SENSITIVE_TOKENS.split(",") if hasattr(settings, "PRIVACY_SENSITIVE_TOKENS") else []
48 )
49
50 # Model mappings
51 self.default_openai_model = settings.OPENAI_MODEL
52 self.default_ollama_model = settings.OLLAMA_MODEL
53
54 # Response cache
55 self.cache_enabled = getattr(settings, "ENABLE_RESPONSE_CACHE", False)
56 self.cache = {}
57 self.cache_ttl = getattr(settings, "RESPONSE_CACHE_TTL", 3600) # 1 hour default
58
59 async def initialize(self):
60 """Initialize the provider service."""
61 # Initialize OpenAI client
62 self.openai_client = AsyncOpenAI(
63 api_key=settings.OPENAI_API_KEY,
64 organization=getattr(settings, "OPENAI_ORG_ID", None)
65 )
66
67 # Initialize Ollama service
68 await self.ollama_service.initialize()
69
70 logger.info("Provider service initialized")
71
72 async def cleanup(self):
73 """Clean up resources."""
74 await self.ollama_service.cleanup()
75
76 async def generate_completion(
77 self,
78 messages: List[Dict[str, str]],
79 model: Optional[str] = None,
80 provider: Optional[Union[str, Provider]] = None,
81 tools: Optional[List[Dict[str, Any]]] = None,
82 stream: bool = False,
83 temperature: float = 0.7,
84 max_tokens: Optional[int] = None,
85 user: Optional[str] = None,
86 **kwargs
87 ) -> Dict[str, Any]:
88 """Generate a completion from the selected provider."""
89 # Determine the provider and model
90 selected_provider, selected_model = await self._select_provider_and_model(
91 messages, model, provider, tools, **kwargs
92 )
93
94 # Check cache if enabled and not streaming
95 if self.cache_enabled and not stream:
96 cache_key = self._generate_cache_key(
97 messages, selected_provider, selected_model, tools, temperature, max_tokens, kwargs
98 )
99 cached_response = self._get_from_cache(cache_key)
100 if cached_response:
101 logger.info(f"Cache hit for {selected_provider}:{selected_model}")
102 return cached_response
103
104 # Generate completion based on selected provider
105 try:
106 if selected_provider == Provider.OPENAI:
107 response = await self._generate_openai_completion(
108 messages, selected_model, tools, stream, temperature, max_tokens, user, **kwargs
109 )
110 else: # OLLAMA
111 response = await self._generate_ollama_completion(
112 messages, selected_model, tools, stream, temperature, max_tokens, **kwargs
113 )
114
115 # Add provider info and cache if appropriate
116 if not stream and response:
117 response["provider"] = selected_provider.value
118 if self.cache_enabled:
119 self._add_to_cache(cache_key, response)
120
121 return response
122 except Exception as e:
123 logger.error(f"Error generating completion with {selected_provider}: {str(e)}")
124
125 # Try fallback if auto-routing was enabled
126 if provider == Provider.AUTO:
127 fallback_provider = Provider.OLLAMA if selected_provider == Provider.OPENAI else Provider.OPENAI
128 logger.info(f"Attempting fallback to {fallback_provider}")
129
130 try:
131 if fallback_provider == Provider.OPENAI:
132 fallback_model = self.default_openai_model
133 response = await self._generate_openai_completion(
134 messages, fallback_model, tools, stream, temperature, max_tokens, user, **kwargs
135 )
136 else: # OLLAMA
137 fallback_model = self.default_ollama_model
138 response = await self._generate_ollama_completion(
139 messages, fallback_model, tools, stream, temperature, max_tokens, **kwargs
140 )
141
142 if not stream and response:
143 response["provider"] = fallback_provider.value
144 # Don't cache fallback responses
145
146 return response
147 except Exception as fallback_error:
148 logger.error(f"Fallback also failed: {str(fallback_error)}")
149
150 # Re-raise the original error if we couldn't fall back
151 raise
152
153 async def stream_completion(
154 self,
155 messages: List[Dict[str, str]],
156 model: Optional[str] = None,
157 provider: Optional[Union[str, Provider]] = None,
158 tools: Optional[List[Dict[str, Any]]] = None,
159 temperature: float = 0.7,
160 max_tokens: Optional[int] = None,
161 user: Optional[str] = None,
162 **kwargs
163 ) -> AsyncGenerator[Dict[str, Any], None]:
164 """Stream a completion from the selected provider."""
165 # Always stream with this method
166 kwargs["stream"] = True
167
168 # Determine the provider and model
169 selected_provider, selected_model = await self._select_provider_and_model(
170 messages, model, provider, tools, **kwargs
171 )
172
173 try:
174 if selected_provider == Provider.OPENAI:
175 async for chunk in self._stream_openai_completion(
176 messages, selected_model, tools, temperature, max_tokens, user, **kwargs
177 ):
178 chunk["provider"] = selected_provider.value
179 yield chunk
180 else: # OLLAMA
181 async for chunk in self._stream_ollama_completion(
182 messages, selected_model, tools, temperature, max_tokens, **kwargs
183 ):
184 chunk["provider"] = selected_provider.value
185 yield chunk
186 except Exception as e:
187 logger.error(f"Error streaming completion with {selected_provider}: {str(e)}")
188
189 # Try fallback if auto-routing was enabled
190 if provider == Provider.AUTO:
191 fallback_provider = Provider.OLLAMA if selected_provider == Provider.OPENAI else Provider.OPENAI
192 logger.info(f"Attempting fallback to {fallback_provider}")
193
194 try:
195 if fallback_provider == Provider.OPENAI:
196 fallback_model = self.default_openai_model
197 async for chunk in self._stream_openai_completion(
198 messages, fallback_model, tools, temperature, max_tokens, user, **kwargs
199 ):
200 chunk["provider"] = fallback_provider.value
201 yield chunk
202 else: # OLLAMA
203 fallback_model = self.default_ollama_model
204 async for chunk in self._stream_ollama_completion(
205 messages, fallback_model, tools, temperature, max_tokens, **kwargs
206 ):
207 chunk["provider"] = fallback_provider.value
208 yield chunk
209 except Exception as fallback_error:
210 logger.error(f"Fallback streaming also failed: {str(fallback_error)}")
211 # Nothing more we can do here
212
213 # For streaming, we don't re-raise since we've already started the response
214
215 async def _select_provider_and_model(
216 self,
217 messages: List[Dict[str, str]],
218 model: Optional[str] = None,
219 provider: Optional[Union[str, Provider]] = None,
220 tools: Optional[List[Dict[str, Any]]] = None,
221 **kwargs
222 ) -> tuple[Provider, str]:
223 """Select the provider and model based on input and criteria."""
224 # Handle explicit provider/model specification
225 if model and ":" in model:
226 # Format: "provider:model", e.g. "openai:gpt-4" or "ollama:llama2"
227 provider_str, model_name = model.split(":", 1)
228 selected_provider = Provider(provider_str.lower())
229 return selected_provider, model_name
230
231 # Handle explicit provider with default model
232 if provider and provider != Provider.AUTO:
233 selected_provider = Provider(provider) if isinstance(provider, str) else provider
234 selected_model = model or (
235 self.default_openai_model if selected_provider == Provider.OPENAI
236 else self.default_ollama_model
237 )
238 return selected_provider, selected_model
239
240 # If model specified without provider, infer provider
241 if model:
242 # Heuristic: OpenAI models typically start with "gpt-" or "text-"
243 if model.startswith(("gpt-", "text-")):
244 return Provider.OPENAI, model
245 else:
246 return Provider.OLLAMA, model
247
248 # Auto-routing based on message content and requirements
249 if not provider or provider == Provider.AUTO:
250 selected_provider = await self._auto_route(messages, tools, **kwargs)
251 selected_model = (
252 self.default_openai_model if selected_provider == Provider.OPENAI
253 else self.default_ollama_model
254 )
255 return selected_provider, selected_model
256
257 # Default fallback
258 return Provider.OPENAI, self.default_openai_model
259
260 async def _auto_route(
261 self,
262 messages: List[Dict[str, str]],
263 tools: Optional[List[Dict[str, Any]]] = None,
264 **kwargs
265 ) -> Provider:
266 """Automatically route to the appropriate provider based on content and requirements."""
267 # 1. Check for tool requirements
268 if tools:
269 # If tools are required, prefer OpenAI as Ollama's tool support is limited
270 return Provider.OPENAI
271
272 # 2. Check for privacy concerns
273 if self._contains_sensitive_information(messages):
274 logger.info("Privacy sensitive information detected, routing to Ollama")
275 return Provider.OLLAMA
276
277 # 3. Assess complexity
278 complexity_score = await self._assess_complexity(messages)
279 logger.info(f"Content complexity score: {complexity_score}")
280
281 if complexity_score > self.model_selection_criteria.complexity_threshold:
282 logger.info(f"High complexity content ({complexity_score}), routing to OpenAI")
283 return Provider.OPENAI
284
285 # 4. Consider token budget (if specified)
286 token_budget = kwargs.get("token_budget") or self.model_selection_criteria.token_budget
287 if token_budget:
288 estimated_tokens = self._estimate_token_count(messages)
289 if estimated_tokens > token_budget:
290 logger.info(f"Token budget ({token_budget}) exceeded ({estimated_tokens}), routing to OpenAI")
291 return Provider.OPENAI
292
293 # Default to Ollama for standard requests
294 logger.info("Standard request, routing to Ollama")
295 return Provider.OLLAMA
296
297 def _contains_sensitive_information(self, messages: List[Dict[str, str]]) -> bool:
298 """Check if messages contain privacy-sensitive information."""
299 sensitive_tokens = self.model_selection_criteria.privacy_sensitive_tokens
300 if not sensitive_tokens:
301 return False
302
303 combined_text = " ".join([msg.get("content", "") or "" for msg in messages])
304 combined_text = combined_text.lower()
305
306 for token in sensitive_tokens:
307 if token.lower() in combined_text:
308 return True
309
310 return False
311
312 async def _assess_complexity(self, messages: List[Dict[str, str]]) -> float:
313 """Assess the complexity of the messages."""
314 # Simple heuristics for complexity:
315 # 1. Length of content
316 # 2. Presence of complex tokens (technical terms, specialized vocabulary)
317 # 3. Sentence complexity
318
319 user_messages = [msg.get("content", "") for msg in messages if msg.get("role") == "user"]
320 if not user_messages:
321 return 0.0
322
323 last_message = user_messages[-1] or ""
324
325 # 1. Length factor (normalized to 0-1 range)
326 length = len(last_message)
327 length_factor = min(length / 1000, 1.0) * 0.3 # 30% weight to length
328
329 # 2. Complexity indicators
330 complex_terms = [
331 "analyze", "synthesize", "evaluate", "compare", "contrast",
332 "explain", "technical", "detailed", "comprehensive", "algorithm",
333 "implementation", "architecture", "design", "optimize", "complex"
334 ]
335
336 term_count = sum(1 for term in complex_terms if term in last_message.lower())
337 term_factor = min(term_count / 10, 1.0) * 0.4 # 40% weight to complex terms
338
339 # 3. Sentence complexity (approximated by average sentence length)
340 sentences = [s.strip() for s in last_message.split(".") if s.strip()]
341 if sentences:
342 avg_sentence_length = sum(len(s.split()) for s in sentences) / len(sentences)
343 sentence_factor = min(avg_sentence_length / 25, 1.0) * 0.3 # 30% weight to sentence complexity
344 else:
345 sentence_factor = 0.0
346
347 # Combined complexity score
348 complexity = length_factor + term_factor + sentence_factor
349
350 return complexity
351
352 def _estimate_token_count(self, messages: List[Dict[str, str]]) -> int:
353 """Estimate the token count for the messages."""
354 # Simple approximation: 1 token ≈ 4 characters
355 combined_text = " ".join([msg.get("content", "") or "" for msg in messages])
356 return len(combined_text) // 4
357
358 async def _generate_openai_completion(
359 self,
360 messages: List[Dict[str, str]],
361 model: str,
362 tools: Optional[List[Dict[str, Any]]] = None,
363 stream: bool = False,
364 temperature: float = 0.7,
365 max_tokens: Optional[int] = None,
366 user: Optional[str] = None,
367 **kwargs
368 ) -> Dict[str, Any]:
369 """Generate a completion using OpenAI."""
370 completion_kwargs = {
371 "model": model,
372 "messages": messages,
373 "temperature": temperature,
374 "stream": stream
375 }
376
377 if max_tokens:
378 completion_kwargs["max_tokens"] = max_tokens
379
380 if tools:
381 completion_kwargs["tools"] = tools
382
383 if "tool_choice" in kwargs:
384 completion_kwargs["tool_choice"] = kwargs["tool_choice"]
385
386 if "response_format" in kwargs:
387 completion_kwargs["response_format"] = kwargs["response_format"]
388
389 if user:
390 completion_kwargs["user"] = user
391
392 if stream:
393 response_stream = await self.openai_client.chat.completions.create(**completion_kwargs)
394
395 full_response = None
396 async for chunk in response_stream:
397 if not full_response:
398 full_response = chunk
399 yield chunk.model_dump()
400 else:
401 response = await self.openai_client.chat.completions.create(**completion_kwargs)
402 return response.model_dump()
403
404 async def _stream_openai_completion(
405 self,
406 messages: List[Dict[str, str]],
407 model: str,
408 tools: Optional[List[Dict[str, Any]]] = None,
409 temperature: float = 0.7,
410 max_tokens: Optional[int] = None,
411 user: Optional[str] = None,
412 **kwargs
413 ) -> AsyncGenerator[Dict[str, Any], None]:
414 """Stream a completion from OpenAI."""
415 # This is just a wrapper around _generate_openai_completion with stream=True
416 async for chunk in self._generate_openai_completion(
417 messages, model, tools, True, temperature, max_tokens, user, **kwargs
418 ):
419 yield chunk
420
421 async def _generate_ollama_completion(
422 self,
423 messages: List[Dict[str, str]],
424 model: str,
425 tools: Optional[List[Dict[str, Any]]] = None,
426 stream: bool = False,
427 temperature: float = 0.7,
428 max_tokens: Optional[int] = None,
429 **kwargs
430 ) -> Dict[str, Any]:
431 """Generate a completion using Ollama."""
432 if stream:
433 # For streaming, return the first chunk to maintain API consistency
434 async for chunk in self.ollama_service.generate_completion(
435 messages=messages,
436 model=model,
437 temperature=temperature,
438 max_tokens=max_tokens,
439 tools=tools,
440 stream=True,
441 **kwargs
442 ):
443 return chunk
444 else:
445 return await self.ollama_service.generate_completion(
446 messages=messages,
447 model=model,
448 temperature=temperature,
449 max_tokens=max_tokens,
450 tools=tools,
451 stream=False,
452 **kwargs
453 )
454
455 async def _stream_ollama_completion(
456 self,
457 messages: List[Dict[str, str]],
458 model: str,
459 tools: Optional[List[Dict[str, Any]]] = None,
460 temperature: float = 0.7,
461 max_tokens: Optional[int] = None,
462 **kwargs
463 ) -> AsyncGenerator[Dict[str, Any], None]:
464 """Stream a completion from Ollama."""
465 async for chunk in self.ollama_service.generate_completion(
466 messages=messages,
467 model=model,
468 temperature=temperature,
469 max_tokens=max_tokens,
470 tools=tools,
471 stream=True,
472 **kwargs
473 ):
474 yield chunk
475
476 def _generate_cache_key(self, *args) -> str:
477 """Generate a cache key based on the input parameters."""
478 # Convert complex objects to JSON strings first
479 args_str = json.dumps([arg if not isinstance(arg, (dict, list)) else json.dumps(arg, sort_keys=True) for arg in args])
480 return hashlib.md5(args_str.encode()).hexdigest()
481
482 def _get_from_cache(self, key: str) -> Optional[Dict[str, Any]]:
483 """Get a response from cache if available and not expired."""
484 if key not in self.cache:
485 return None
486
487 cached_item = self.cache[key]
488 if time.time() - cached_item["timestamp"] > self.cache_ttl:
489 # Expired
490 del self.cache[key]
491 return None
492
493 return cached_item["response"]
494
495 def _add_to_cache(self, key: str, response: Dict[str, Any]):
496 """Add a response to the cache."""
497 self.cache[key] = {
498 "response": response,
499 "timestamp": time.time()
500 }
501
502 # Simple cache size management - remove oldest if too many items
503 max_cache_size = getattr(settings, "RESPONSE_CACHE_MAX_ITEMS", 1000)
504 if len(self.cache) > max_cache_size:
505 # Remove oldest 10% of items
506 items_to_remove = max(1, int(max_cache_size * 0.1))
507 oldest_keys = sorted(
508 self.cache.keys(),
509 key=lambda k: self.cache[k]["timestamp"]
510 )[:items_to_remove]
511
512 for old_key in oldest_keys:
513 del self.cache[old_key]

Configuration Settings

python
1# app/config.py
2import os
3from pydantic_settings import BaseSettings
4from typing import List, Optional, Dict, Any
5from dotenv import load_dotenv
6
7# Load environment variables from .env file
8load_dotenv()
9
10class Settings(BaseSettings):
11 # API Keys and Authentication
12 OPENAI_API_KEY: str
13 OPENAI_ORG_ID: Optional[str] = None
14
15 # Model Configuration
16 OPENAI_MODEL: str = "gpt-4o"
17 OLLAMA_MODEL: str = "llama2"
18 OLLAMA_HOST: str = "http://localhost:11434"
19
20 # System Behavior
21 TEMPERATURE: float = 0.7
22 MAX_TOKENS: int = 4096
23 REQUEST_TIMEOUT: int = 120
24
25 # Routing Configuration
26 COMPLEXITY_THRESHOLD: float = 0.65
27 PRIVACY_SENSITIVE_TOKENS: str = "password,secret,token,key,credential"
28
29 # Caching Configuration
30 ENABLE_RESPONSE_CACHE: bool = True
31 RESPONSE_CACHE_TTL: int = 3600 # 1 hour
32 RESPONSE_CACHE_MAX_ITEMS: int = 1000
33
34 # Logging Configuration
35 LOG_LEVEL: str = "INFO"
36
37 # Database Configuration
38 DATABASE_URL: Optional[str] = None
39
40 # Advanced Ollama Configuration
41 OLLAMA_MODELS_MAPPING: Dict[str, str] = {
42 "gpt-3.5-turbo": "llama2",
43 "gpt-4": "llama2",
44 "gpt-4o": "mistral",
45 "code-llama": "codellama"
46 }
47
48 class Config:
49 env_file = ".env"
50 env_file_encoding = "utf-8"
51
52settings = Settings()

Model Selection and Configuration

Below is a table of recommended Ollama models and their optimal use cases:

python
1# app/models/model_catalog.py
2from typing import Dict, List, Any, Optional
3
4class ModelCapability:
5 """Represents the capabilities of a model."""
6 def __init__(
7 self,
8 context_window: int,
9 strengths: List[str],
10 supports_tools: bool,
11 recommended_temperature: float,
12 approximate_speed: str # "fast", "medium", "slow"
13 ):
14 self.context_window = context_window
15 self.strengths = strengths
16 self.supports_tools = supports_tools
17 self.recommended_temperature = recommended_temperature
18 self.approximate_speed = approximate_speed
19
20# Ollama model catalog
21OLLAMA_MODELS = {
22 "llama2": ModelCapability(
23 context_window=4096,
24 strengths=["general_knowledge", "reasoning", "instruction_following"],
25 supports_tools=False,
26 recommended_temperature=0.7,
27 approximate_speed="medium"
28 ),
29 "llama2:13b": ModelCapability(
30 context_window=4096,
31 strengths=["general_knowledge", "reasoning", "instruction_following"],
32 supports_tools=False,
33 recommended_temperature=0.7,
34 approximate_speed="medium"
35 ),
36 "llama2:70b": ModelCapability(
37 context_window=4096,
38 strengths=["general_knowledge", "reasoning", "instruction_following"],
39 supports_tools=False,
40 recommended_temperature=0.65,
41 approximate_speed="slow"
42 ),
43 "mistral": ModelCapability(
44 context_window=8192,
45 strengths=["instruction_following", "reasoning", "versatility"],
46 supports_tools=False,
47 recommended_temperature=0.7,
48 approximate_speed="medium"
49 ),
50 "mistral:7b-instruct": ModelCapability(
51 context_window=8192,
52 strengths=["instruction_following", "chat", "versatility"],
53 supports_tools=False,
54 recommended_temperature=0.7,
55 approximate_speed="medium"
56 ),
57 "codellama": ModelCapability(
58 context_window=16384,
59 strengths=["code_generation", "code_explanation", "technical_writing"],
60 supports_tools=False,
61 recommended_temperature=0.5,
62 approximate_speed="medium"
63 ),
64 "codellama:34b": ModelCapability(
65 context_window=16384,
66 strengths=["code_generation", "code_explanation", "technical_writing"],
67 supports_tools=False,
68 recommended_temperature=0.5,
69 approximate_speed="slow"
70 ),
71 "dolphin-mistral": ModelCapability(
72 context_window=8192,
73 strengths=["conversational", "creative", "helpfulness"],
74 supports_tools=False,
75 recommended_temperature=0.7,
76 approximate_speed="medium"
77 ),
78 "neural-chat": ModelCapability(
79 context_window=8192,
80 strengths=["conversational", "instruction_following", "helpfulness"],
81 supports_tools=False,
82 recommended_temperature=0.7,
83 approximate_speed="medium"
84 ),
85 "orca-mini": ModelCapability(
86 context_window=4096,
87 strengths=["efficiency", "general_knowledge", "basic_reasoning"],
88 supports_tools=False,
89 recommended_temperature=0.8,
90 approximate_speed="fast"
91 ),
92 "vicuna": ModelCapability(
93 context_window=4096,
94 strengths=["conversational", "instruction_following"],
95 supports_tools=False,
96 recommended_temperature=0.7,
97 approximate_speed="medium"
98 ),
99 "wizard-math": ModelCapability(
100 context_window=4096,
101 strengths=["mathematics", "problem_solving", "logical_reasoning"],
102 supports_tools=False,
103 recommended_temperature=0.5,
104 approximate_speed="medium"
105 ),
106 "phi": ModelCapability(
107 context_window=2048,
108 strengths=["efficiency", "basic_tasks", "lightweight"],
109 supports_tools=False,
110 recommended_temperature=0.7,
111 approximate_speed="fast"
112 )
113}
114
115# OpenAI -> Ollama model mapping for fallback scenarios
116OPENAI_TO_OLLAMA_MAPPING = {
117 "gpt-3.5-turbo": "llama2",
118 "gpt-3.5-turbo-16k": "mistral:7b-instruct",
119 "gpt-4": "llama2:70b",
120 "gpt-4o": "mistral",
121 "gpt-4-turbo": "mistral",
122 "code-llama": "codellama"
123}
124
125# Use case to model recommendations
126USE_CASE_RECOMMENDATIONS = {
127 "code_generation": ["codellama:34b", "codellama"],
128 "creative_writing": ["dolphin-mistral", "mistral:7b-instruct"],
129 "mathematical_reasoning": ["wizard-math", "llama2:70b"],
130 "conversational": ["neural-chat", "dolphin-mistral"],
131 "knowledge_intensive": ["llama2:70b", "mistral"],
132 "resource_constrained": ["phi", "orca-mini"]
133}
134
135def recommend_ollama_model(use_case: str, performance_tier: str = "medium") -> str:
136 """Recommend an Ollama model based on use case and performance requirements."""
137 if use_case in USE_CASE_RECOMMENDATIONS:
138 models = USE_CASE_RECOMMENDATIONS[use_case]
139
140 # Filter by performance tier if needed
141 if performance_tier == "high":
142 for model in models:
143 if ":70b" in model or ":34b" in model:
144 return model
145 return models[0] # Return first if no high-tier match
146 elif performance_tier == "low":
147 return "orca-mini" if use_case != "code_generation" else "codellama"
148 else: # medium tier
149 return models[0]
150
151 # Default recommendations
152 if performance_tier == "high":
153 return "llama2:70b"
154 elif performance_tier == "low":
155 return "orca-mini"
156 else:
157 return "mistral"

Agent Adapter for Model Selection

python
1# app/agents/adaptive_agent.py
2from typing import List, Dict, Any, Optional
3import logging
4from app.agents.base_agent import BaseAgent
5from app.models.message import Message, MessageRole
6from app.services.provider_service import ProviderService, Provider
7from app.models.model_catalog import recommend_ollama_model, OLLAMA_MODELS
8
9logger = logging.getLogger(__name__)
10
11class AdaptiveAgent(BaseAgent):
12 """Agent that adapts its model selection based on task requirements."""
13
14 def __init__(self, *args, **kwargs):
15 super().__init__(*args, **kwargs)
16 self.last_used_model = None
17 self.last_used_provider = None
18 self.performance_metrics = {}
19
20 async def _generate_response(self, user_id: str) -> str:
21 """Generate a response with dynamic model selection."""
22 # Extract the last user message
23 last_user_message = next(
24 (msg for msg in reversed(self.state.conversation_history)
25 if msg.role == MessageRole.USER),
26 None
27 )
28
29 if not last_user_message:
30 return "I don't have any messages to respond to."
31
32 # Analyze the message to determine the best model
33 provider, model = await self._select_optimal_model(last_user_message.content)
34
35 logger.info(f"Selected model for response: {provider}:{model}")
36
37 # Track the selected model for monitoring
38 self.last_used_model = model
39 self.last_used_provider = provider
40
41 # Get model-specific parameters
42 params = self._get_model_parameters(provider, model)
43
44 # Start timing for performance metrics
45 import time
46 start_time = time.time()
47
48 # Generate the response
49 response = await self.provider_service.generate_completion(
50 messages=[msg.model_dump() for msg in self.state.conversation_history],
51 model=f"{provider}:{model}" if provider != "auto" else None,
52 provider=provider,
53 tools=self.tools,
54 temperature=params.get("temperature", 0.7),
55 max_tokens=params.get("max_tokens"),
56 user=user_id
57 )
58
59 # Record performance metrics
60 execution_time = time.time() - start_time
61 self._update_performance_metrics(provider, model, execution_time, response)
62
63 if response.get("tool_calls"):
64 # Process tool calls if needed
65 # ... (tool call handling code)
66 pass
67
68 return response["message"]["content"]
69
70 async def _select_optimal_model(self, message: str) -> tuple[str, str]:
71 """Select the optimal model based on message analysis."""
72 # 1. Analyze for use case
73 use_case = await self._determine_use_case(message)
74
75 # 2. Determine performance needs
76 performance_tier = self._determine_performance_tier(message)
77
78 # 3. Check if tools are required
79 tools_required = len(self.tools) > 0
80
81 # 4. Check message complexity
82 is_complex = await self._is_complex_request(message)
83
84 # Decision logic
85 if tools_required:
86 # OpenAI is better for tool usage
87 return "openai", "gpt-4o"
88
89 if is_complex:
90 # For complex requests, prefer OpenAI or high-tier Ollama models
91 if performance_tier == "high":
92 return "openai", "gpt-4o"
93 else:
94 ollama_model = recommend_ollama_model(use_case, "high")
95 return "ollama", ollama_model
96
97 # For standard requests, use Ollama with appropriate model
98 ollama_model = recommend_ollama_model(use_case, performance_tier)
99 return "ollama", ollama_model
100
101 async def _determine_use_case(self, message: str) -> str:
102 """Determine the use case based on message content."""
103 message_lower = message.lower()
104
105 # Simple heuristic classification
106 if any(term in message_lower for term in ["code", "program", "function", "class", "algorithm"]):
107 return "code_generation"
108
109 if any(term in message_lower for term in ["story", "creative", "imagine", "write", "novel"]):
110 return "creative_writing"
111
112 if any(term in message_lower for term in ["math", "calculate", "equation", "solve", "formula"]):
113 return "mathematical_reasoning"
114
115 if any(term in message_lower for term in ["chat", "talk", "discuss", "conversation"]):
116 return "conversational"
117
118 if len(message.split()) > 50 or any(term in message_lower for term in ["explain", "detail", "analysis"]):
119 return "knowledge_intensive"
120
121 # Default to conversational
122 return "conversational"
123
124 def _determine_performance_tier(self, message: str) -> str:
125 """Determine the performance tier needed based on message characteristics."""
126 # Length-based heuristic
127 word_count = len(message.split())
128
129 if word_count > 100 or "detailed" in message.lower() or "comprehensive" in message.lower():
130 return "high"
131
132 if word_count < 20 and not any(term in message.lower() for term in ["complex", "difficult", "advanced"]):
133 return "low"
134
135 return "medium"
136
137 async def _is_complex_request(self, message: str) -> bool:
138 """Determine if this is a complex request requiring more powerful models."""
139 # Check for indicators of complexity
140 complexity_indicators = [
141 "complex", "detailed", "thorough", "comprehensive", "in-depth",
142 "analyze", "compare", "synthesize", "evaluate", "technical",
143 "step by step", "advanced", "sophisticated", "nuanced"
144 ]
145
146 indicator_count = sum(1 for indicator in complexity_indicators if indicator in message.lower())
147
148 # Length is also an indicator of complexity
149 is_long = len(message.split()) > 50
150
151 # Multiple questions indicate complexity
152 question_count = message.count("?")
153 has_multiple_questions = question_count > 1
154
155 return (indicator_count >= 2) or (is_long and indicator_count >= 1) or has_multiple_questions
156
157 def _get_model_parameters(self, provider: str, model: str) -> Dict[str, Any]:
158 """Get model-specific parameters."""
159 if provider == "ollama":
160 if model in OLLAMA_MODELS:
161 capabilities = OLLAMA_MODELS[model]
162 return {
163 "temperature": capabilities.recommended_temperature,
164 "max_tokens": capabilities.context_window // 2 # Conservative estimate
165 }
166 else:
167 # Default Ollama parameters
168 return {"temperature": 0.7, "max_tokens": 2048}
169 else:
170 # OpenAI models
171 if "gpt-4" in model:
172 return {"temperature": 0.7, "max_tokens": 4096}
173 else:
174 return {"temperature": 0.7, "max_tokens": 2048}
175
176 def _update_performance_metrics(
177 self,
178 provider: str,
179 model: str,
180 execution_time: float,
181 response: Dict[str, Any]
182 ):
183 """Update performance metrics for this model."""
184 model_key = f"{provider}:{model}"
185
186 if model_key not in self.performance_metrics:
187 self.performance_metrics[model_key] = {
188 "calls": 0,
189 "total_time": 0,
190 "avg_time": 0,
191 "token_usage": {
192 "prompt": 0,
193 "completion": 0,
194 "total": 0
195 }
196 }
197
198 metrics = self.performance_metrics[model_key]
199 metrics["calls"] += 1
200 metrics["total_time"] += execution_time
201 metrics["avg_time"] = metrics["total_time"] / metrics["calls"]
202
203 # Update token usage if available
204 if "usage" in response:
205 usage = response["usage"]
206 metrics["token_usage"]["prompt"] += usage.get("prompt_tokens", 0)
207 metrics["token_usage"]["completion"] += usage.get("completion_tokens", 0)
208 metrics["token_usage"]["total"] += usage.get("total_tokens", 0)

Agent Controller with Model Selection

python
1# app/controllers/agent_controller.py
2from fastapi import APIRouter, Depends, HTTPException, Query, BackgroundTasks
3from pydantic import BaseModel, Field
4from typing import List, Dict, Any, Optional
5import logging
6
7from app.agents.agent_factory import AgentFactory
8from app.agents.adaptive_agent import AdaptiveAgent
9from app.services.provider_service import Provider
10from app.services.auth_service import get_current_user
11from app.config import settings
12
13logger = logging.getLogger(__name__)
14
15router = APIRouter(prefix="/api/v1/agents", tags=["agents"])
16
17class ModelSelectionParams(BaseModel):
18 """Parameters for model selection."""
19 provider: Optional[str] = Field(None, description="Provider to use (openai, ollama, auto)")
20 model: Optional[str] = Field(None, description="Specific model to use")
21 auto_select: bool = Field(True, description="Whether to auto-select the optimal model")
22 use_case: Optional[str] = Field(None, description="Specific use case for model recommendation")
23 performance_tier: Optional[str] = Field("medium", description="Performance tier (low, medium, high)")
24
25class ChatRequest(BaseModel):
26 message: str
27 session_id: Optional[str] = None
28 model_params: Optional[ModelSelectionParams] = None
29 stream: bool = False
30
31class ChatResponse(BaseModel):
32 response: str
33 session_id: str
34 model_used: str
35 provider_used: str
36 execution_metrics: Optional[Dict[str, Any]] = None
37
38# Agent sessions storage
39agent_sessions = {}
40
41# Get agent factory instance
42agent_factory = Depends(lambda: get_agent_factory())
43
44def get_agent_factory():
45 # Initialize and return agent factory
46 # In a real implementation, this would be properly initialized
47 return AgentFactory()
48
49@router.post("/chat", response_model=ChatResponse)
50async def chat(
51 request: ChatRequest,
52 background_tasks: BackgroundTasks,
53 current_user: Dict = Depends(get_current_user),
54 factory: AgentFactory = agent_factory
55):
56 """Chat with an agent that intelligently selects the appropriate model."""
57 user_id = current_user["id"]
58
59 # Create or retrieve session
60 session_id = request.session_id
61 if not session_id or session_id not in agent_sessions:
62 # Create a new adaptive agent
63 agent = factory.create_agent(
64 agent_type="adaptive",
65 agent_class=AdaptiveAgent,
66 system_prompt="You are a helpful assistant that provides accurate, relevant information."
67 )
68
69 session_id = f"session_{user_id}_{len(agent_sessions) + 1}"
70 agent_sessions[session_id] = agent
71 else:
72 agent = agent_sessions[session_id]
73
74 # Apply model selection parameters if provided
75 if request.model_params:
76 if not request.model_params.auto_select:
77 # Force specific provider/model
78 provider = request.model_params.provider or "auto"
79 model = request.model_params.model
80
81 if provider != "auto" and model:
82 logger.info(f"Forcing model selection: {provider}:{model}")
83 # Set for next generation
84 agent.last_used_provider = provider
85 agent.last_used_model = model
86
87 try:
88 # Process the message
89 if request.stream:
90 # Implement streaming logic if needed
91 pass
92 else:
93 response = await agent.process_message(request.message, user_id)
94
95 # Get the model and provider that were used
96 model_used = agent.last_used_model or "unknown"
97 provider_used = agent.last_used_provider or "unknown"
98
99 # Get execution metrics
100 model_key = f"{provider_used}:{model_used}"
101 execution_metrics = agent.performance_metrics.get(model_key)
102
103 # Schedule background task to analyze performance and adjust preferences
104 background_tasks.add_task(
105 analyze_performance,
106 agent,
107 model_key,
108 execution_metrics
109 )
110
111 return ChatResponse(
112 response=response,
113 session_id=session_id,
114 model_used=model_used,
115 provider_used=provider_used,
116 execution_metrics=execution_metrics
117 )
118 except Exception as e:
119 logger.exception(f"Error processing message: {str(e)}")
120 raise HTTPException(status_code=500, detail=f"Error processing message: {str(e)}")
121
122@router.get("/models/recommend")
123async def recommend_model(
124 use_case: str = Query(..., description="The use case (code_generation, creative_writing, etc.)"),
125 performance_tier: str = Query("medium", description="Performance tier (low, medium, high)"),
126 current_user: Dict = Depends(get_current_user)
127):
128 """Get model recommendations for a specific use case."""
129 from app.models.model_catalog import recommend_ollama_model, OLLAMA_MODELS
130
131 # Get recommended Ollama model
132 recommended_model = recommend_ollama_model(use_case, performance_tier)
133
134 # Get OpenAI equivalent
135 openai_equivalent = "gpt-4o" if performance_tier == "high" else "gpt-3.5-turbo"
136
137 # Get model capabilities if available
138 capabilities = OLLAMA_MODELS.get(recommended_model, {})
139
140 return {
141 "ollama_recommendation": recommended_model,
142 "openai_recommendation": openai_equivalent,
143 "capabilities": capabilities,
144 "use_case": use_case,
145 "performance_tier": performance_tier
146 }
147
148async def analyze_performance(agent, model_key, metrics):
149 """Analyze model performance and adjust preferences."""
150 if not metrics or metrics["calls"] < 5:
151 # Not enough data to analyze
152 return
153
154 # Analyze average response time
155 avg_time = metrics["avg_time"]
156
157 # If response time is too slow, consider adjusting default models
158 if avg_time > 5.0: # More than 5 seconds
159 logger.info(f"Model {model_key} showing slow performance: {avg_time}s avg")
160
161 # In a real implementation, we might adjust preferred models here
162 pass

Dockerfile for Local Deployment

dockerfile
1# Dockerfile
2FROM python:3.11-slim
3
4WORKDIR /app
5
6# Install system dependencies
7RUN apt-get update && apt-get install -y --no-install-recommends \
8 curl \
9 && rm -rf /var/lib/apt/lists/*
10
11# Copy requirements
12COPY requirements.txt .
13RUN pip install --no-cache-dir -r requirements.txt
14
15# Copy application code
16COPY . .
17
18# Set up environment
19ENV PYTHONPATH=/app
20ENV OPENAI_API_KEY="your-api-key-here"
21ENV OLLAMA_HOST="http://ollama:11434"
22ENV OLLAMA_MODEL="llama2"
23
24# Default command
25CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Docker Compose for Development

yaml
1# docker-compose.yml
2version: '3.8'
3
4services:
5 app:
6 build: .
7 ports:
8 - "8000:8000"
9 volumes:
10 - .:/app
11 environment:
12 - OLLAMA_HOST=http://ollama:11434
13 - OPENAI_API_KEY=${OPENAI_API_KEY}
14 - OPENAI_MODEL=${OPENAI_MODEL:-gpt-4o}
15 - OLLAMA_MODEL=${OLLAMA_MODEL:-llama2}
16 depends_on:
17 - ollama
18 restart: unless-stopped
19
20 ollama:
21 image: ollama/ollama:latest
22 volumes:
23 - ollama_data:/root/.ollama
24 ports:
25 - "11434:11434"
26 deploy:
27 resources:
28 reservations:
29 devices:
30 - driver: nvidia
31 count: all
32 capabilities: [gpu]
33
34volumes:
35 ollama_data:

Model Preload Script

python
1# scripts/preload_models.py
2#!/usr/bin/env python
3import argparse
4import requests
5import time
6import sys
7import os
8from typing import List, Dict
9
10def main():
11 parser = argparse.ArgumentParser(description='Preload Ollama models')
12 parser.add_argument('--host', default="http://localhost:11434", help='Ollama host URL')
13 parser.add_argument('--models', default="llama2,mistral,codellama", help='Comma-separated list of models to preload')
14 parser.add_argument('--timeout', type=int, default=3600, help='Timeout in seconds for each model pull')
15 args = parser.parse_args()
16
17 models = [m.strip() for m in args.models.split(',')]
18 preload_models(args.host, models, args.timeout)
19
20def preload_models(host: str, models: List[str], timeout: int):
21 """Preload models into Ollama."""
22 print(f"Preloading {len(models)} models on {host}...")
23
24 # Check Ollama availability
25 try:
26 response = requests.get(f"{host}/api/tags")
27 if response.status_code != 200:
28 print(f"Error connecting to Ollama: Status {response.status_code}")
29 sys.exit(1)
30
31 available_models = [m["name"] for m in response.json().get("models", [])]
32 print(f"Currently available models: {', '.join(available_models)}")
33 except Exception as e:
34 print(f"Error connecting to Ollama: {str(e)}")
35 sys.exit(1)
36
37 # Pull each model
38 for model in models:
39 if model in available_models:
40 print(f"Model {model} is already available, skipping...")
41 continue
42
43 print(f"Pulling model: {model}")
44 try:
45 start_time = time.time()
46 response = requests.post(
47 f"{host}/api/pull",
48 json={"name": model},
49 timeout=timeout
50 )
51
52 if response.status_code != 200:
53 print(f"Error pulling model {model}: Status {response.status_code}")
54 print(response.text)
55 continue
56
57 elapsed = time.time() - start_time
58 print(f"Successfully pulled {model} in {elapsed:.1f} seconds")
59 except Exception as e:
60 print(f"Error pulling model {model}: {str(e)}")
61
62 # Verify available models after pulling
63 try:
64 response = requests.get(f"{host}/api/tags")
65 if response.status_code == 200:
66 available_models = [m["name"] for m in response.json().get("models", [])]
67 print(f"Available models: {', '.join(available_models)}")
68 except Exception as e:
69 print(f"Error checking available models: {str(e)}")
70
71if __name__ == "__main__":
72 main()

Implementation Guide

Setting up Ollama

  1. Installation:

    bash
    1# macOS
    2brew install ollama
    3
    4# Linux
    5curl -fsSL https://ollama.com/install.sh | sh
    6
    7# Windows
    8# Download from https://ollama.com/download/windows
  2. Pull Base Models:

    bash
    1ollama pull llama2
    2ollama pull mistral
    3ollama pull codellama
  3. Start Ollama Server:

    bash
    1ollama serve

Application Configuration

  1. Create .env file:

    text
    1OPENAI_API_KEY=sk-...
    2OPENAI_ORG_ID=org-... # Optional
    3OPENAI_MODEL=gpt-4o
    4OLLAMA_MODEL=llama2
    5OLLAMA_HOST=http://localhost:11434
    6COMPLEXITY_THRESHOLD=0.65
    7PRIVACY_SENSITIVE_TOKENS=password,secret,token,key,credential
  2. Initialize Application:

    bash
    1# Install dependencies
    2pip install -r requirements.txt
    3
    4# Start the application
    5uvicorn app.main:app --reload

Model Selection Criteria

The system determines which provider (OpenAI or Ollama) to use based on several criteria:

  1. Complexity Analysis:

    • Messages are analyzed for complexity based on length, specialized terminology, and sentence structure.
    • The COMPLEXITY_THRESHOLD setting (default: 0.65) determines when to route to OpenAI for more complex queries.
  2. Privacy Concerns:

    • Messages containing sensitive terms (configured in PRIVACY_SENSITIVE_TOKENS) are preferentially routed to Ollama.
    • This ensures sensitive information remains on local infrastructure.
  3. Tool Requirements:

    • Requests requiring tools/functions are routed to OpenAI as Ollama has limited native tool support.
    • The system simulates tool usage in Ollama using prompt engineering when necessary.
  4. Resource Constraints:

    • Token budget constraints can trigger routing to OpenAI for longer conversations.
    • Local hardware capabilities are considered when selecting Ollama models.

Ollama Model Selection

The system intelligently selects the appropriate Ollama model based on the query's requirements:

  1. For code generation: codellama (default) or codellama:34b (high performance)
  2. For creative tasks: dolphin-mistral or neural-chat
  3. For mathematical reasoning: wizard-math
  4. For general knowledge: llama2 (base), llama2:13b (medium), or llama2:70b (high performance)
  5. For resource-constrained environments: phi or orca-mini

Performance Optimization

  1. Response Caching:

    • Common responses are cached to improve performance.
    • Cache TTL and maximum items are configurable.
  2. Dynamic Temperature Adjustment:

    • Each model has recommended temperature settings for optimal performance.
    • The system adjusts temperature based on the task type.
  3. Adaptive Routing:

    • The system learns from performance metrics and adjusts routing preferences over time.
    • Models with consistently poor performance receive fewer requests.

Fallback Mechanisms

The system implements robust fallback mechanisms:

  1. Provider Fallback:

    • If OpenAI is unavailable, the system falls back to Ollama.
    • If Ollama fails, the system falls back to OpenAI.
  2. Model Fallback:

    • If a requested model is unavailable, the system selects an appropriate alternative.
    • Fallback chains are configured for each model to ensure graceful degradation.
  3. Error Handling:

    • Network errors, timeout issues, and model limitations are handled gracefully.
    • The system provides informative error messages when fallbacks are exhausted.

Conclusion

The integration of Ollama with OpenAI's Agent SDK creates a sophisticated hybrid architecture that combines the strengths of both local and cloud-based inference. This implementation provides:

  1. Enhanced privacy by keeping sensitive information local when appropriate
  2. Cost optimization by routing suitable queries to local infrastructure
  3. Robust fallbacks ensuring system resilience against failures
  4. Task-appropriate model selection based on sophisticated analysis
  5. Seamless integration with the agent framework and tools ecosystem

This architecture represents a significant advancement in responsible AI deployment, balancing the power of cloud-based models with the privacy and cost benefits of local inference. By intelligently routing requests based on their characteristics, the system provides optimal performance while respecting critical constraints around privacy, latency, and resource utilization.

Comprehensive Testing Strategy for OpenAI-Ollama Hybrid Agent System

Theoretical Framework for Validation Methodology

The integration of cloud-based and local inferencing capabilities within a unified agent architecture necessitates a multifaceted testing approach that encompasses both individual components and their systemic interactions. This document establishes a rigorous testing framework that addresses the unique challenges of validating a hybrid AI system across multiple dimensions of functionality, performance, and reliability.

Strategic Testing Layers

1. Unit Testing Framework

Core Component Isolation Testing

python
1# tests/unit/test_provider_service.py
2import pytest
3import asyncio
4from unittest.mock import AsyncMock, patch, MagicMock
5import json
6
7from app.services.provider_service import ProviderService, Provider
8from app.services.ollama_service import OllamaService
9
10class TestProviderService:
11 @pytest.fixture
12 def provider_service(self):
13 """Create a provider service with mocked dependencies for testing."""
14 service = ProviderService()
15 service.openai_client = AsyncMock()
16 service.ollama_service = AsyncMock(spec=OllamaService)
17 return service
18
19 @pytest.mark.asyncio
20 async def test_select_provider_and_model_explicit(self, provider_service):
21 """Test explicit provider and model selection."""
22 # Test explicit provider:model format
23 provider, model = await provider_service._select_provider_and_model(
24 messages=[{"role": "user", "content": "Hello"}],
25 model="openai:gpt-4"
26 )
27 assert provider == Provider.OPENAI
28 assert model == "gpt-4"
29
30 # Test explicit provider with default model
31 provider, model = await provider_service._select_provider_and_model(
32 messages=[{"role": "user", "content": "Hello"}],
33 provider="ollama"
34 )
35 assert provider == Provider.OLLAMA
36 assert model == provider_service.default_ollama_model
37
38 @pytest.mark.asyncio
39 async def test_auto_routing_complex_content(self, provider_service):
40 """Test auto-routing with complex content."""
41 # Mock complexity assessment to return high complexity
42 provider_service._assess_complexity = AsyncMock(return_value=0.8)
43 provider_service.model_selection_criteria.complexity_threshold = 0.7
44
45 provider = await provider_service._auto_route(
46 messages=[{"role": "user", "content": "Complex technical question"}]
47 )
48
49 assert provider == Provider.OPENAI
50 provider_service._assess_complexity.assert_called_once()
51
52 @pytest.mark.asyncio
53 async def test_auto_routing_privacy_sensitive(self, provider_service):
54 """Test auto-routing with privacy sensitive content."""
55 provider_service.model_selection_criteria.privacy_sensitive_tokens = ["password", "secret"]
56
57 provider = await provider_service._auto_route(
58 messages=[{"role": "user", "content": "What is my password?"}]
59 )
60
61 assert provider == Provider.OLLAMA
62
63 @pytest.mark.asyncio
64 async def test_auto_routing_with_tools(self, provider_service):
65 """Test auto-routing with tool requirements."""
66 provider = await provider_service._auto_route(
67 messages=[{"role": "user", "content": "Simple question"}],
68 tools=[{"type": "function", "function": {"name": "get_weather"}}]
69 )
70
71 assert provider == Provider.OPENAI
72
73 @pytest.mark.asyncio
74 async def test_generate_completion_openai(self, provider_service):
75 """Test generating completion with OpenAI."""
76 # Setup mock response
77 mock_response = MagicMock()
78 mock_response.model_dump.return_value = {
79 "id": "test-id",
80 "object": "chat.completion",
81 "model": "gpt-4",
82 "usage": {"total_tokens": 10},
83 "message": {"content": "Test response"}
84 }
85 provider_service.openai_client.chat.completions.create = AsyncMock(return_value=mock_response)
86
87 response = await provider_service._generate_openai_completion(
88 messages=[{"role": "user", "content": "Hello"}],
89 model="gpt-4"
90 )
91
92 assert response["message"]["content"] == "Test response"
93 provider_service.openai_client.chat.completions.create.assert_called_once()
94
95 @pytest.mark.asyncio
96 async def test_generate_completion_ollama(self, provider_service):
97 """Test generating completion with Ollama."""
98 provider_service.ollama_service.generate_completion.return_value = {
99 "id": "ollama-test",
100 "model": "llama2",
101 "provider": "ollama",
102 "message": {"content": "Ollama response"}
103 }
104
105 response = await provider_service._generate_ollama_completion(
106 messages=[{"role": "user", "content": "Hello"}],
107 model="llama2"
108 )
109
110 assert response["message"]["content"] == "Ollama response"
111 provider_service.ollama_service.generate_completion.assert_called_once()
112
113 @pytest.mark.asyncio
114 async def test_fallback_mechanism(self, provider_service):
115 """Test fallback mechanism when primary provider fails."""
116 # Mock the primary provider (OpenAI) to fail
117 provider_service._generate_openai_completion = AsyncMock(side_effect=Exception("API error"))
118
119 # Mock the fallback provider (Ollama) to succeed
120 provider_service._generate_ollama_completion = AsyncMock(return_value={
121 "id": "ollama-fallback",
122 "provider": "ollama",
123 "message": {"content": "Fallback response"}
124 })
125
126 # Test the generate_completion method with auto provider
127 response = await provider_service.generate_completion(
128 messages=[{"role": "user", "content": "Hello"}],
129 provider="auto"
130 )
131
132 # Check that fallback was used
133 assert response["provider"] == "ollama"
134 assert response["message"]["content"] == "Fallback response"
135 provider_service._generate_openai_completion.assert_called_once()
136 provider_service._generate_ollama_completion.assert_called_once()

Model Selection Logic Testing

python
1# tests/unit/test_model_selection.py
2import pytest
3from unittest.mock import AsyncMock, patch
4import json
5
6from app.models.model_catalog import recommend_ollama_model, OLLAMA_MODELS
7from app.agents.adaptive_agent import AdaptiveAgent
8
9class TestModelSelection:
10 @pytest.mark.parametrize("use_case,performance_tier,expected_model", [
11 ("code_generation", "high", "codellama:34b"),
12 ("creative_writing", "medium", "dolphin-mistral"),
13 ("mathematical_reasoning", "low", "orca-mini"),
14 ("conversational", "high", "neural-chat"),
15 ("knowledge_intensive", "high", "llama2:70b"),
16 ("resource_constrained", "low", "phi"),
17 ])
18 def test_model_recommendations(self, use_case, performance_tier, expected_model):
19 """Test model recommendation logic for different use cases."""
20 model = recommend_ollama_model(use_case, performance_tier)
21 assert model == expected_model
22
23 @pytest.mark.asyncio
24 async def test_adaptive_agent_use_case_detection(self):
25 """Test adaptive agent's use case detection logic."""
26 provider_service = AsyncMock()
27 agent = AdaptiveAgent(
28 provider_service=provider_service,
29 system_prompt="You are a helpful assistant."
30 )
31
32 # Test code-related message
33 code_use_case = await agent._determine_use_case(
34 "Can you help me write a Python function to calculate Fibonacci numbers?"
35 )
36 assert code_use_case == "code_generation"
37
38 # Test creative writing message
39 creative_use_case = await agent._determine_use_case(
40 "Write a short story about a robot discovering emotions."
41 )
42 assert creative_use_case == "creative_writing"
43
44 # Test mathematical reasoning message
45 math_use_case = await agent._determine_use_case(
46 "Solve this equation: 3x² + 2x - 5 = 0"
47 )
48 assert math_use_case == "mathematical_reasoning"
49
50 @pytest.mark.asyncio
51 async def test_complexity_assessment(self):
52 """Test complexity assessment logic."""
53 provider_service = AsyncMock()
54 agent = AdaptiveAgent(
55 provider_service=provider_service,
56 system_prompt="You are a helpful assistant."
57 )
58
59 # Simple message
60 simple_message = "What time is it?"
61 is_complex_simple = await agent._is_complex_request(simple_message)
62 assert not is_complex_simple
63
64 # Complex message
65 complex_message = "Can you provide a detailed analysis of the socioeconomic factors that contributed to the Industrial Revolution in England, and compare those with the conditions in contemporary developing economies?"
66 is_complex_detailed = await agent._is_complex_request(complex_message)
67 assert is_complex_detailed
68
69 # Multiple questions
70 multi_question = "What is quantum computing? How does it differ from classical computing? What are its potential applications?"
71 is_complex_multi = await agent._is_complex_request(multi_question)
72 assert is_complex_multi

Ollama Service Testing

python
1# tests/unit/test_ollama_service.py
2import pytest
3import json
4import asyncio
5from unittest.mock import AsyncMock, patch, MagicMock
6
7from app.services.ollama_service import OllamaService
8
9class TestOllamaService:
10 @pytest.fixture
11 def ollama_service(self):
12 """Create an Ollama service with mocked session for testing."""
13 service = OllamaService()
14 service.session = AsyncMock()
15 return service
16
17 @pytest.mark.asyncio
18 async def test_list_models(self, ollama_service):
19 """Test listing available models."""
20 mock_response = AsyncMock()
21 mock_response.status = 200
22 mock_response.json = AsyncMock(return_value={"models": [
23 {"name": "llama2"},
24 {"name": "mistral"}
25 ]})
26
27 # Mock the context manager
28 ollama_service.session.get = AsyncMock()
29 ollama_service.session.get.return_value.__aenter__.return_value = mock_response
30
31 models = await ollama_service.list_models()
32
33 assert len(models) == 2
34 assert models[0]["name"] == "llama2"
35 assert models[1]["name"] == "mistral"
36
37 @pytest.mark.asyncio
38 async def test_generate_completion(self, ollama_service):
39 """Test generating a completion."""
40 # Mock the response
41 mock_response = AsyncMock()
42 mock_response.status = 200
43 mock_response.json = AsyncMock(return_value={
44 "id": "test-id",
45 "response": "This is a test response",
46 "created_at": 1677858242
47 })
48
49 # Mock the context manager
50 ollama_service.session.post = AsyncMock()
51 ollama_service.session.post.return_value.__aenter__.return_value = mock_response
52
53 # Test the completion generation
54 response = await ollama_service._generate_completion_sync({
55 "model": "llama2",
56 "prompt": "Hello, world!",
57 "stream": False,
58 "options": {"temperature": 0.7}
59 })
60
61 # Check the formatted response
62 assert "message" in response
63 assert response["message"]["content"] == "This is a test response"
64 assert response["provider"] == "ollama"
65
66 @pytest.mark.asyncio
67 async def test_format_messages_for_ollama(self, ollama_service):
68 """Test formatting messages for Ollama."""
69 messages = [
70 {"role": "system", "content": "You are a helpful assistant."},
71 {"role": "user", "content": "Hello!"},
72 {"role": "assistant", "content": "Hi there!"},
73 {"role": "user", "content": "How are you?"}
74 ]
75
76 formatted = ollama_service._format_messages_for_ollama(messages)
77
78 assert "[System]" in formatted
79 assert "[User]" in formatted
80 assert "[Assistant]" in formatted
81 assert "You are a helpful assistant." in formatted
82 assert "Hello!" in formatted
83 assert "How are you?" in formatted
84
85 @pytest.mark.asyncio
86 async def test_tool_call_extraction(self, ollama_service):
87 """Test extracting tool calls from response text."""
88 # Response with a tool call
89 response_with_tool = """
90 I'll help you get the weather information.
91
92 <tool>
93 {
94 "name": "get_weather",
95 "parameters": {
96 "location": "New York",
97 "unit": "celsius"
98 }
99 }
100 </tool>
101
102 Let me check the weather for you.
103 """
104
105 tool_calls = ollama_service._extract_tool_calls(response_with_tool)
106
107 assert tool_calls is not None
108 assert len(tool_calls) == 1
109 assert tool_calls[0]["function"]["name"] == "get_weather"
110 assert "New York" in tool_calls[0]["function"]["arguments"]
111
112 # Response without a tool call
113 response_without_tool = "The weather in New York is sunny."
114 assert ollama_service._extract_tool_calls(response_without_tool) is None
115
116 @pytest.mark.asyncio
117 async def test_clean_tool_calls_from_text(self, ollama_service):
118 """Test cleaning tool calls from response text."""
119 response_with_tool = """
120 I'll help you get the weather information.
121
122 <tool>
123 {
124 "name": "get_weather",
125 "parameters": {
126 "location": "New York",
127 "unit": "celsius"
128 }
129 }
130 </tool>
131
132 Let me check the weather for you.
133 """
134
135 cleaned = ollama_service._clean_tool_calls_from_text(response_with_tool)
136
137 assert "<tool>" not in cleaned
138 assert "get_weather" not in cleaned
139 assert "I'll help you get the weather information." in cleaned
140 assert "Let me check the weather for you." in cleaned

Tool Integration Testing

python
1# tests/unit/test_tool_integration.py
2import pytest
3from unittest.mock import AsyncMock, patch
4import json
5
6from app.agents.task_agent import TaskManagementAgent
7from app.models.message import Message, MessageRole
8
9class TestToolIntegration:
10 @pytest.fixture
11 def task_agent(self):
12 """Create a task agent with mocked services."""
13 provider_service = AsyncMock()
14 task_service = AsyncMock()
15
16 agent = TaskManagementAgent(
17 provider_service=provider_service,
18 task_service=task_service,
19 system_prompt="You are a task management agent."
20 )
21
22 return agent
23
24 @pytest.mark.asyncio
25 async def test_process_tool_calls_list_tasks(self, task_agent):
26 """Test processing the list_tasks tool call."""
27 # Mock task service response
28 task_agent.task_service.list_tasks.return_value = [
29 {
30 "id": "task1",
31 "title": "Complete report",
32 "status": "pending",
33 "priority": "high",
34 "due_date": "2023-04-15",
35 "description": "Finish quarterly report"
36 }
37 ]
38
39 # Create a tool call for list_tasks
40 tool_calls = [{
41 "id": "call_123",
42 "function": {
43 "name": "list_tasks",
44 "arguments": json.dumps({
45 "status": "pending",
46 "limit": 5
47 })
48 }
49 }]
50
51 # Process the tool calls
52 tool_responses = await task_agent._process_tool_calls(tool_calls, "user123")
53
54 # Verify the response
55 assert len(tool_responses) == 1
56 assert tool_responses[0]["tool_call_id"] == "call_123"
57 assert "Complete report" in tool_responses[0]["content"]
58 assert "pending" in tool_responses[0]["content"]
59
60 # Verify service was called correctly
61 task_agent.task_service.list_tasks.assert_called_once_with(
62 user_id="user123",
63 status="pending",
64 limit=5
65 )
66
67 @pytest.mark.asyncio
68 async def test_process_tool_calls_create_task(self, task_agent):
69 """Test processing the create_task tool call."""
70 # Mock task service response
71 task_agent.task_service.create_task.return_value = {
72 "id": "new_task",
73 "title": "New test task"
74 }
75
76 # Create a tool call for create_task
77 tool_calls = [{
78 "id": "call_456",
79 "function": {
80 "name": "create_task",
81 "arguments": json.dumps({
82 "title": "New test task",
83 "description": "This is a test task",
84 "priority": "medium"
85 })
86 }
87 }]
88
89 # Process the tool calls
90 tool_responses = await task_agent._process_tool_calls(tool_calls, "user123")
91
92 # Verify the response
93 assert len(tool_responses) == 1
94 assert tool_responses[0]["tool_call_id"] == "call_456"
95 assert "Task created successfully" in tool_responses[0]["content"]
96 assert "New test task" in tool_responses[0]["content"]
97
98 # Verify service was called correctly
99 task_agent.task_service.create_task.assert_called_once_with(
100 user_id="user123",
101 title="New test task",
102 description="This is a test task",
103 due_date=None,
104 priority="medium"
105 )
106
107 @pytest.mark.asyncio
108 async def test_generate_response_with_tools(self, task_agent):
109 """Test the full generate_response flow with tool usage."""
110 # Set up the conversation history
111 task_agent.state.conversation_history = [
112 Message(role=MessageRole.SYSTEM, content="You are a task management agent."),
113 Message(role=MessageRole.USER, content="List my pending tasks")
114 ]
115
116 # Mock provider service to return a response with tool calls first
117 mock_response_with_tools = {
118 "message": {
119 "content": "I'll list your tasks",
120 "tool_calls": [{
121 "id": "call_123",
122 "function": {
123 "name": "list_tasks",
124 "arguments": json.dumps({
125 "status": "pending",
126 "limit": 10
127 })
128 }
129 }]
130 },
131 "tool_calls": [{
132 "id": "call_123",
133 "function": {
134 "name": "list_tasks",
135 "arguments": json.dumps({
136 "status": "pending",
137 "limit": 10
138 })
139 }
140 }]
141 }
142
143 # Mock task service
144 task_agent.task_service.list_tasks.return_value = [
145 {
146 "id": "task1",
147 "title": "Complete report",
148 "status": "pending",
149 "priority": "high",
150 "due_date": "2023-04-15",
151 "description": "Finish quarterly report"
152 }
153 ]
154
155 # Mock final response after tool processing
156 mock_final_response = {
157 "message": {
158 "content": "You have 1 pending task: Complete report (high priority, due Apr 15)"
159 }
160 }
161
162 # Set up the mocked provider service
163 task_agent.provider_service.generate_completion = AsyncMock()
164 task_agent.provider_service.generate_completion.side_effect = [
165 mock_response_with_tools, # First call returns tool calls
166 mock_final_response # Second call returns final response
167 ]
168
169 # Generate the response
170 response = await task_agent._generate_response("user123")
171
172 # Verify the final response
173 assert response == "You have 1 pending task: Complete report (high priority, due Apr 15)"
174
175 # Verify the provider service was called twice
176 assert task_agent.provider_service.generate_completion.call_count == 2
177
178 # Verify the task service was called
179 task_agent.task_service.list_tasks.assert_called_once()
180
181 # Verify tool response was added to conversation history
182 tool_messages = [msg for msg in task_agent.state.conversation_history if msg.role == MessageRole.TOOL]
183 assert len(tool_messages) == 1

2. Integration Testing Framework

API Endpoint Testing

python
1# tests/integration/test_api_endpoints.py
2import pytest
3from fastapi.testclient import TestClient
4import json
5import os
6from unittest.mock import patch, AsyncMock
7
8from app.main import app
9from app.services.provider_service import ProviderService
10
11client = TestClient(app)
12
13class TestAPIEndpoints:
14 @pytest.fixture(autouse=True)
15 def setup_mocks(self):
16 """Set up mocks for services."""
17 # Patch the provider service
18 with patch('app.controllers.agent_controller.get_agent_factory') as mock_factory:
19 mock_provider = AsyncMock(spec=ProviderService)
20 mock_factory.return_value.provider_service = mock_provider
21 yield
22
23 def test_health_endpoint(self):
24 """Test the health check endpoint."""
25 response = client.get("/api/health")
26 assert response.status_code == 200
27 assert response.json()["status"] == "ok"
28
29 def test_chat_endpoint_auth_required(self):
30 """Test that chat endpoint requires authentication."""
31 response = client.post(
32 "/api/v1/chat",
33 json={"message": "Hello"}
34 )
35 assert response.status_code == 401 # Unauthorized
36
37 def test_chat_endpoint_with_auth(self):
38 """Test the chat endpoint with proper authentication."""
39 # Mock the authentication
40 with patch('app.services.auth_service.get_current_user') as mock_auth:
41 mock_auth.return_value = {"id": "test_user"}
42
43 # Mock the agent's process_message
44 with patch('app.agents.base_agent.BaseAgent.process_message') as mock_process:
45 mock_process.return_value = "Hello, I'm an AI assistant."
46
47 response = client.post(
48 "/api/v1/chat",
49 json={"message": "Hi there"},
50 headers={"Authorization": "Bearer test_token"}
51 )
52
53 assert response.status_code == 200
54 assert "response" in response.json()
55 assert response.json()["response"] == "Hello, I'm an AI assistant."
56
57 def test_model_recommendation_endpoint(self):
58 """Test the model recommendation endpoint."""
59 # Mock the authentication
60 with patch('app.services.auth_service.get_current_user') as mock_auth:
61 mock_auth.return_value = {"id": "test_user"}
62
63 response = client.get(
64 "/api/v1/agents/models/recommend?use_case=code_generation&performance_tier=high",
65 headers={"Authorization": "Bearer test_token"}
66 )
67
68 assert response.status_code == 200
69 data = response.json()
70 assert "ollama_recommendation" in data
71 assert data["use_case"] == "code_generation"
72 assert data["performance_tier"] == "high"
73
74 def test_streaming_endpoint(self):
75 """Test the streaming endpoint."""
76 # Mock the authentication
77 with patch('app.services.auth_service.get_current_user') as mock_auth:
78 mock_auth.return_value = {"id": "test_user"}
79
80 # Mock the streaming generator
81 async def mock_stream_generator():
82 yield {"id": "1", "content": "Hello"}
83 yield {"id": "2", "content": " World"}
84
85 # Mock the stream method
86 with patch('app.services.provider_service.ProviderService.stream_completion') as mock_stream:
87 mock_stream.return_value = mock_stream_generator()
88
89 response = client.post(
90 "/api/v1/chat/streaming",
91 json={"message": "Hi", "stream": True},
92 headers={"Authorization": "Bearer test_token"}
93 )
94
95 assert response.status_code == 200
96 assert response.headers["content-type"] == "text/event-stream"
97
98 # Parse the streaming response
99 content = response.content.decode()
100 assert "data:" in content
101 assert "Hello" in content
102 assert "World" in content

End-to-End Agent Flow Testing

python
1# tests/integration/test_agent_flows.py
2import pytest
3import asyncio
4from unittest.mock import AsyncMock, patch
5import json
6
7from app.agents.meta_agent import MetaAgent, AgentSubsystem
8from app.agents.research_agent import ResearchAgent
9from app.agents.conversation_manager import ConversationManager
10from app.models.message import Message, MessageRole
11
12class TestAgentFlows:
13 @pytest.fixture
14 async def meta_agent_setup(self):
15 """Set up a meta agent with subsystems for testing."""
16 # Create mocked services
17 provider_service = AsyncMock()
18 knowledge_service = AsyncMock()
19 memory_service = AsyncMock()
20
21 # Create subsystem agents
22 research_agent = ResearchAgent(
23 provider_service=provider_service,
24 knowledge_service=knowledge_service,
25 system_prompt="You are a research agent."
26 )
27
28 conversation_agent = ConversationManager(
29 provider_service=provider_service,
30 system_prompt="You are a conversation management agent."
31 )
32
33 # Create meta agent
34 meta_agent = MetaAgent(
35 provider_service=provider_service,
36 system_prompt="You are a meta agent that coordinates specialized agents."
37 )
38
39 # Add subsystems
40 meta_agent.add_subsystem(AgentSubsystem(
41 name="research",
42 agent=research_agent,
43 role="Knowledge retrieval specialist"
44 ))
45
46 meta_agent.add_subsystem(AgentSubsystem(
47 name="conversation",
48 agent=conversation_agent,
49 role="Conversation flow manager"
50 ))
51
52 # Return the setup
53 return {
54 "meta_agent": meta_agent,
55 "provider_service": provider_service,
56 "knowledge_service": knowledge_service,
57 "research_agent": research_agent,
58 "conversation_agent": conversation_agent
59 }
60
61 @pytest.mark.asyncio
62 async def test_meta_agent_routing(self, meta_agent_setup):
63 """Test the meta agent's routing logic."""
64 meta_agent = meta_agent_setup["meta_agent"]
65 provider_service = meta_agent_setup["provider_service"]
66
67 # Setup conversation history
68 meta_agent.state.conversation_history = [
69 Message(role=MessageRole.SYSTEM, content="You are a meta agent."),
70 Message(role=MessageRole.USER, content="Tell me about quantum computing")
71 ]
72
73 # Mock the routing response to use research subsystem
74 routing_response = {
75 "message": {
76 "content": "I'll route this to the research subsystem"
77 },
78 "tool_calls": [{
79 "id": "call_123",
80 "function": {
81 "name": "route_to_subsystem",
82 "arguments": json.dumps({
83 "subsystem": "research",
84 "task": "Tell me about quantum computing",
85 "context": {}
86 })
87 }
88 }]
89 }
90
91 # Mock the research agent's response
92 research_response = "Quantum computing is a type of computing that uses quantum-mechanical phenomena, such as superposition and entanglement, to perform operations on data."
93 meta_agent_setup["research_agent"].process_message = AsyncMock(return_value=research_response)
94
95 # Mock the provider service responses
96 provider_service.generate_completion.side_effect = [
97 routing_response, # First call for routing decision
98 ]
99
100 # Generate response
101 response = await meta_agent._generate_response("user123")
102
103 # Verify routing happened correctly
104 assert "[research" in response
105 assert "Quantum computing" in response
106
107 # Verify the research agent was called
108 meta_agent_setup["research_agent"].process_message.assert_called_once_with(
109 "Tell me about quantum computing", "user123"
110 )
111
112 @pytest.mark.asyncio
113 async def test_meta_agent_parallel_processing(self, meta_agent_setup):
114 """Test the meta agent's parallel processing logic."""
115 meta_agent = meta_agent_setup["meta_agent"]
116 provider_service = meta_agent_setup["provider_service"]
117
118 # Setup conversation history
119 meta_agent.state.conversation_history = [
120 Message(role=MessageRole.SYSTEM, content="You are a meta agent."),
121 Message(role=MessageRole.USER, content="Explain the impacts of AI on society")
122 ]
123
124 # Mock the routing response to use parallel processing
125 routing_response = {
126 "message": {
127 "content": "I'll process this with multiple subsystems"
128 },
129 "tool_calls": [{
130 "id": "call_456",
131 "function": {
132 "name": "parallel_processing",
133 "arguments": json.dumps({
134 "task": "Explain the impacts of AI on society",
135 "subsystems": ["research", "conversation"]
136 })
137 }
138 }]
139 }
140
141 # Mock each agent's response
142 research_response = "From a research perspective, AI impacts society through automation, economic transformation, and ethical considerations."
143 conversation_response = "From a conversational perspective, AI is changing how we interact with technology and each other."
144
145 meta_agent_setup["research_agent"].process_message = AsyncMock(return_value=research_response)
146 meta_agent_setup["conversation_agent"].process_message = AsyncMock(return_value=conversation_response)
147
148 # Mock synthesis response
149 synthesis_response = {
150 "message": {
151 "content": "AI has multifaceted impacts on society. From a research perspective, it drives automation and economic transformation. From a conversational perspective, it changes human-technology interaction patterns."
152 }
153 }
154
155 # Mock the provider service responses
156 provider_service.generate_completion.side_effect = [
157 routing_response, # First call for routing decision
158 synthesis_response # Second call for synthesis
159 ]
160
161 # Generate response
162 response = await meta_agent._generate_response("user123")
163
164 # Verify synthesis happened correctly
165 assert "multifaceted impacts" in response
166 assert provider_service.generate_completion.call_count == 2
167
168 # Verify both agents were called
169 meta_agent_setup["research_agent"].process_message.assert_called_once()
170 meta_agent_setup["conversation_agent"].process_message.assert_called_once()
171
172 @pytest.mark.asyncio
173 async def test_research_agent_knowledge_retrieval(self, meta_agent_setup):
174 """Test the research agent's knowledge retrieval capabilities."""
175 research_agent = meta_agent_setup["research_agent"]
176 provider_service = meta_agent_setup["provider_service"]
177 knowledge_service = meta_agent_setup["knowledge_service"]
178
179 # Setup conversation history
180 research_agent.state.conversation_history = [
181 Message(role=MessageRole.SYSTEM, content="You are a research agent."),
182 Message(role=MessageRole.USER, content="What are the latest developments in fusion energy?")
183 ]
184
185 # Mock knowledge retrieval results
186 knowledge_service.search.return_value = [
187 {
188 "id": "doc1",
189 "title": "Recent Fusion Breakthrough",
190 "content": "Scientists achieved net energy gain in fusion reaction at NIF in December 2022.",
191 "relevance_score": 0.95
192 },
193 {
194 "id": "doc2",
195 "title": "Commercial Fusion Startups",
196 "content": "Several startups including Commonwealth Fusion Systems are working on commercial fusion reactors.",
197 "relevance_score": 0.89
198 }
199 ]
200
201 # Mock initial response with tool calls
202 tool_call_response = {
203 "message": {
204 "content": "Let me search for information on fusion energy."
205 },
206 "tool_calls": [{
207 "id": "call_789",
208 "function": {
209 "name": "search_knowledge_base",
210 "arguments": json.dumps({
211 "query": "latest developments fusion energy",
212 "max_results": 3
213 })
214 }
215 }]
216 }
217
218 # Mock final response with knowledge incorporated
219 final_response = {
220 "message": {
221 "content": "Recent developments in fusion energy include a breakthrough at NIF in December 2022 achieving net energy gain, and advances from startups like Commonwealth Fusion Systems working on commercial reactors."
222 }
223 }
224
225 # Mock the provider service responses
226 provider_service.generate_completion.side_effect = [
227 tool_call_response, # First call with tool request
228 final_response # Second call with knowledge incorporated
229 ]
230
231 # Generate response
232 response = await research_agent._generate_response("user123")
233
234 # Verify response includes knowledge
235 assert "NIF" in response
236 assert "Commonwealth Fusion Systems" in response
237
238 # Verify knowledge service was called
239 knowledge_service.search.assert_called_once_with(
240 query="latest developments fusion energy",
241 max_results=3
242 )

Cross-Provider Integration Testing

python
1# tests/integration/test_cross_provider.py
2import pytest
3import os
4from unittest.mock import patch, AsyncMock
5import json
6
7from app.services.provider_service import ProviderService, Provider
8from app.services.ollama_service import OllamaService
9
10class TestCrossProviderIntegration:
11 @pytest.fixture
12 async def real_services(self):
13 """Set up real services for integration testing."""
14 # Skip tests if API keys aren't available in the environment
15 if not os.environ.get("OPENAI_API_KEY"):
16 pytest.skip("OPENAI_API_KEY environment variable not set")
17
18 # Initialize real services
19 ollama_service = OllamaService()
20 provider_service = ProviderService()
21
22 # Initialize the services
23 try:
24 await ollama_service.initialize()
25 await provider_service.initialize()
26 except Exception as e:
27 pytest.skip(f"Failed to initialize services: {str(e)}")
28
29 yield {
30 "ollama_service": ollama_service,
31 "provider_service": provider_service
32 }
33
34 # Cleanup
35 await ollama_service.cleanup()
36 await provider_service.cleanup()
37
38 @pytest.mark.asyncio
39 async def test_provider_selection_complex_query(self, real_services):
40 """Test that complex queries route to OpenAI."""
41 provider_service = real_services["provider_service"]
42
43 # Adjust complexity threshold to ensure predictable routing
44 provider_service.model_selection_criteria.complexity_threshold = 0.5
45
46 # Complex query that should route to OpenAI
47 complex_messages = [
48 {"role": "user", "content": "Provide a detailed analysis of the philosophical implications of artificial general intelligence, considering perspectives from epistemology, ethics, and metaphysics."}
49 ]
50
51 # Select provider
52 provider, model = await provider_service._select_provider_and_model(
53 messages=complex_messages,
54 provider="auto"
55 )
56
57 # Verify routing decision
58 assert provider == Provider.OPENAI
59
60 @pytest.mark.asyncio
61 async def test_provider_selection_simple_query(self, real_services):
62 """Test that simple queries route to Ollama."""
63 provider_service = real_services["provider_service"]
64
65 # Adjust complexity threshold to ensure predictable routing
66 provider_service.model_selection_criteria.complexity_threshold = 0.5
67
68 # Simple query that should route to Ollama
69 simple_messages = [
70 {"role": "user", "content": "What's the weather like today?"}
71 ]
72
73 # Select provider
74 provider, model = await provider_service._select_provider_and_model(
75 messages=simple_messages,
76 provider="auto"
77 )
78
79 # Verify routing decision
80 assert provider == Provider.OLLAMA
81
82 @pytest.mark.asyncio
83 async def test_fallback_mechanism_real(self, real_services):
84 """Test the fallback mechanism with real services."""
85 provider_service = real_services["provider_service"]
86
87 # Intentionally cause OpenAI to fail by using an invalid model
88 messages = [
89 {"role": "user", "content": "Simple test message"}
90 ]
91
92 try:
93 # This should fail with OpenAI but succeed with Ollama fallback
94 response = await provider_service.generate_completion(
95 messages=messages,
96 model="openai:non-existent-model", # Invalid model
97 provider="auto" # Enable auto-fallback
98 )
99
100 # If we get here, fallback worked
101 assert response["provider"] == "ollama"
102 assert "content" in response["message"]
103 except Exception as e:
104 pytest.fail(f"Fallback mechanism failed: {str(e)}")
105
106 @pytest.mark.asyncio
107 async def test_ollama_response_format(self, real_services):
108 """Test that Ollama responses are properly formatted to match OpenAI's structure."""
109 ollama_service = real_services["ollama_service"]
110
111 # Generate a basic response
112 messages = [
113 {"role": "user", "content": "What is 2+2?"}
114 ]
115
116 response = await ollama_service.generate_completion(
117 messages=messages,
118 model="llama2" # Specify a model that should exist
119 )
120
121 # Verify response structure matches expected format
122 assert "id" in response
123 assert "object" in response
124 assert "model" in response
125 assert "usage" in response
126 assert "message" in response
127 assert "content" in response["message"]
128 assert response["provider"] == "ollama"

3. Performance Testing Framework

Response Latency Benchmarking

python
1# tests/performance/test_latency.py
2import pytest
3import time
4import asyncio
5import statistics
6from typing import List, Dict, Any
7import pandas as pd
8import matplotlib.pyplot as plt
9import os
10
11from app.services.provider_service import ProviderService, Provider
12from app.services.ollama_service import OllamaService
13
14# Skip tests if it's CI environment
15SKIP_PERFORMANCE_TESTS = os.environ.get("CI") == "true"
16
17@pytest.mark.skipif(SKIP_PERFORMANCE_TESTS, reason="Performance tests skipped in CI environment")
18class TestResponseLatency:
19 @pytest.fixture
20 async def services(self):
21 """Set up services for latency testing."""
22 if not os.environ.get("OPENAI_API_KEY"):
23 pytest.skip("OPENAI_API_KEY environment variable not set")
24
25 # Initialize services
26 ollama_service = OllamaService()
27 provider_service = ProviderService()
28
29 try:
30 await ollama_service.initialize()
31 await provider_service.initialize()
32 except Exception as e:
33 pytest.skip(f"Failed to initialize services: {str(e)}")
34
35 yield {
36 "ollama_service": ollama_service,
37 "provider_service": provider_service
38 }
39
40 # Cleanup
41 await ollama_service.cleanup()
42 await provider_service.cleanup()
43
44 async def measure_latency(self, provider_service, provider, model, messages):
45 """Measure response latency for a given provider and model."""
46 start_time = time.time()
47
48 if provider == "openai":
49 await provider_service._generate_openai_completion(
50 messages=messages,
51 model=model
52 )
53 else: # ollama
54 await provider_service._generate_ollama_completion(
55 messages=messages,
56 model=model
57 )
58
59 end_time = time.time()
60 return end_time - start_time
61
62 @pytest.mark.asyncio
63 async def test_latency_comparison(self, services):
64 """Compare latency between OpenAI and Ollama for different query types."""
65 provider_service = services["provider_service"]
66
67 # Test messages of different complexity
68 test_messages = [
69 {
70 "name": "simple_factual",
71 "messages": [{"role": "user", "content": "What is the capital of France?"}]
72 },
73 {
74 "name": "medium_explanation",
75 "messages": [{"role": "user", "content": "Explain how photosynthesis works in plants."}]
76 },
77 {
78 "name": "complex_analysis",
79 "messages": [{"role": "user", "content": "Analyze the economic factors that contributed to the 2008 financial crisis and their long-term impacts."}]
80 }
81 ]
82
83 # Models to test
84 models = {
85 "openai": ["gpt-3.5-turbo", "gpt-4"],
86 "ollama": ["llama2", "mistral"]
87 }
88
89 # Number of repetitions for each test
90 repetitions = 3
91
92 # Collect results
93 results = []
94
95 for message_type in test_messages:
96 for provider in models:
97 for model in models[provider]:
98 for i in range(repetitions):
99 try:
100 latency = await self.measure_latency(
101 provider_service,
102 provider,
103 model,
104 message_type["messages"]
105 )
106
107 results.append({
108 "provider": provider,
109 "model": model,
110 "message_type": message_type["name"],
111 "repetition": i,
112 "latency": latency
113 })
114
115 # Add a small delay to avoid rate limits
116 await asyncio.sleep(1)
117 except Exception as e:
118 print(f"Error testing {provider}:{model} - {str(e)}")
119
120 # Analyze results
121 df = pd.DataFrame(results)
122
123 # Calculate average latency by provider, model, and message type
124 avg_latency = df.groupby(['provider', 'model', 'message_type'])['latency'].mean().reset_index()
125
126 # Generate summary statistics
127 summary = avg_latency.pivot_table(
128 index=['provider', 'model'],
129 columns='message_type',
130 values='latency'
131 ).reset_index()
132
133 # Print summary
134 print("\nLatency Benchmark Results (seconds):")
135 print(summary)
136
137 # Create visualization
138 plt.figure(figsize=(12, 8))
139
140 for message_type in test_messages:
141 subset = avg_latency[avg_latency['message_type'] == message_type['name']]
142 x = range(len(subset))
143 labels = [f"{row['provider']}\n{row['model']}" for _, row in subset.iterrows()]
144
145 plt.subplot(1, len(test_messages), test_messages.index(message_type) + 1)
146 plt.bar(x, subset['latency'])
147 plt.xticks(x, labels, rotation=45)
148 plt.title(f"Latency: {message_type['name']}")
149 plt.ylabel("Seconds")
150
151 plt.tight_layout()
152 plt.savefig('latency_benchmark.png')
153
154 # Assert something meaningful
155 assert len(results) > 0, "No benchmark results collected"

Memory Usage Monitoring

python
1# tests/performance/test_memory_usage.py
2import pytest
3import os
4import asyncio
5import psutil
6import time
7import resource
8import matplotlib.pyplot as plt
9import pandas as pd
10from typing import List, Dict, Any
11
12from app.services.provider_service import ProviderService, Provider
13from app.services.ollama_service import OllamaService
14
15# Skip tests if it's CI environment
16SKIP_PERFORMANCE_TESTS = os.environ.get("CI") == "true"
17
18@pytest.mark.skipif(SKIP_PERFORMANCE_TESTS, reason="Performance tests skipped in CI environment")
19class TestMemoryUsage:
20 @pytest.fixture
21 async def services(self):
22 """Set up services for memory testing."""
23 if not os.environ.get("OPENAI_API_KEY"):
24 pytest.skip("OPENAI_API_KEY environment variable not set")
25
26 # Initialize services
27 ollama_service = OllamaService()
28 provider_service = ProviderService()
29
30 try:
31 await ollama_service.initialize()
32 await provider_service.initialize()
33 except Exception as e:
34 pytest.skip(f"Failed to initialize services: {str(e)}")
35
36 yield {
37 "ollama_service": ollama_service,
38 "provider_service": provider_service
39 }
40
41 # Cleanup
42 await ollama_service.cleanup()
43 await provider_service.cleanup()
44
45 def get_memory_usage(self):
46 """Get current memory usage of the process."""
47 process = psutil.Process(os.getpid())
48 memory_info = process.memory_info()
49 return memory_info.rss / (1024 * 1024) # Convert to MB
50
51 async def monitor_memory_during_request(self, provider_service, provider, model, messages):
52 """Monitor memory usage during a request."""
53 memory_samples = []
54
55 # Start memory monitoring thread
56 monitoring = True
57
58 async def memory_monitor():
59 start_time = time.time()
60 while monitoring:
61 memory_samples.append({
62 "time": time.time() - start_time,
63 "memory_mb": self.get_memory_usage()
64 })
65 await asyncio.sleep(0.1) # Sample every 100ms
66
67 # Start monitoring
68 monitor_task = asyncio.create_task(memory_monitor())
69
70 # Make the request
71 start_time = time.time()
72 try:
73 if provider == "openai":
74 await provider_service._generate_openai_completion(
75 messages=messages,
76 model=model
77 )
78 else: # ollama
79 await provider_service._generate_ollama_completion(
80 messages=messages,
81 model=model
82 )
83 finally:
84 end_time = time.time()
85
86 # Stop monitoring
87 monitoring = False
88 await monitor_task
89
90 return {
91 "samples": memory_samples,
92 "duration": end_time - start_time,
93 "peak_memory": max(sample["memory_mb"] for sample in memory_samples) if memory_samples else 0,
94 "mean_memory": sum(sample["memory_mb"] for sample in memory_samples) / len(memory_samples) if memory_samples else 0
95 }
96
97 @pytest.mark.asyncio
98 async def test_memory_usage_comparison(self, services):
99 """Compare memory usage between OpenAI and Ollama."""
100 provider_service = services["provider_service"]
101
102 # Test messages
103 test_message = {"role": "user", "content": "Write a detailed essay about climate change and its global impact."}
104
105 # Models to test
106 models = {
107 "openai": ["gpt-3.5-turbo"],
108 "ollama": ["llama2"]
109 }
110
111 # Collect results
112 results = []
113 memory_data = {}
114
115 for provider in models:
116 for model in models[provider]:
117 # Collect initial memory
118 initial_memory = self.get_memory_usage()
119
120 # Monitor during request
121 memory_result = await self.monitor_memory_during_request(
122 provider_service,
123 provider,
124 model,
125 [test_message]
126 )
127
128 # Store results
129 key = f"{provider}:{model}"
130 memory_data[key] = memory_result["samples"]
131
132 results.append({
133 "provider": provider,
134 "model": model,
135 "initial_memory_mb": initial_memory,
136 "peak_memory_mb": memory_result["peak_memory"],
137 "mean_memory_mb": memory_result["mean_memory"],
138 "memory_increase_mb": memory_result["peak_memory"] - initial_memory,
139 "duration_seconds": memory_result["duration"]
140 })
141
142 # Wait a bit to let memory stabilize
143 await asyncio.sleep(2)
144
145 # Analyze results
146 df = pd.DataFrame(results)
147
148 # Print summary
149 print("\nMemory Usage Results:")
150 print(df.to_string(index=False))
151
152 # Create visualization
153 plt.figure(figsize=(15, 10))
154
155 # Plot memory over time
156 plt.subplot(2, 1, 1)
157 for key, samples in memory_data.items():
158 times = [s["time"] for s in samples]
159 memory = [s["memory_mb"] for s in samples]
160 plt.plot(times, memory, label=key)
161
162 plt.xlabel("Time (seconds)")
163 plt.ylabel("Memory Usage (MB)")
164 plt.title("Memory Usage Over Time During Request")
165 plt.legend()
166 plt.grid(True)
167
168 # Plot peak and increase
169 plt.subplot(2, 1, 2)
170 providers = df["provider"].tolist()
171 models = df["model"].tolist()
172 labels = [f"{p}\n{m}" for p, m in zip(providers, models)]
173 x = range(len(labels))
174
175 plt.bar(x, df["memory_increase_mb"], label="Memory Increase")
176 plt.xticks(x, labels)
177 plt.ylabel("Memory (MB)")
178 plt.title("Memory Increase by Provider/Model")
179 plt.legend()
180 plt.grid(True)
181
182 plt.tight_layout()
183 plt.savefig('memory_benchmark.png')
184
185 # Assert something meaningful
186 assert len(results) > 0, "No memory benchmark results collected"

Response Quality Benchmarking

python
1# tests/performance/test_response_quality.py
2import pytest
3import os
4import asyncio
5import json
6import pandas as pd
7import matplotlib.pyplot as plt
8from typing import List, Dict, Any
9
10from app.services.provider_service import ProviderService, Provider
11from app.services.ollama_service import OllamaService
12
13# Skip tests if it's CI environment
14SKIP_PERFORMANCE_TESTS = os.environ.get("CI") == "true"
15
16@pytest.mark.skipif(SKIP_PERFORMANCE_TESTS, reason="Performance tests skipped in CI environment")
17class TestResponseQuality:
18 @pytest.fixture
19 async def services(self):
20 """Set up services for quality testing."""
21 if not os.environ.get("OPENAI_API_KEY"):
22 pytest.skip("OPENAI_API_KEY environment variable not set")
23
24 # Initialize services
25 ollama_service = OllamaService()
26 provider_service = ProviderService()
27
28 try:
29 await ollama_service.initialize()
30 await provider_service.initialize()
31 except Exception as e:
32 pytest.skip(f"Failed to initialize services: {str(e)}")
33
34 yield {
35 "ollama_service": ollama_service,
36 "provider_service": provider_service
37 }
38
39 # Cleanup
40 await ollama_service.cleanup()
41 await provider_service.cleanup()
42
43 async def get_response(self, provider_service, provider, model, messages):
44 """Get a response from a specific provider and model."""
45 if provider == "openai":
46 response = await provider_service._generate_openai_completion(
47 messages=messages,
48 model=model
49 )
50 else: # ollama
51 response = await provider_service._generate_ollama_completion(
52 messages=messages,
53 model=model
54 )
55
56 return response["message"]["content"]
57
58 async def evaluate_response(self, provider_service, response, criteria):
59 """Evaluate a response using GPT-4 as a judge."""
60 evaluation_prompt = [
61 {"role": "system", "content": """
62 You are an expert evaluator of AI responses. Evaluate the given response based on the specified criteria.
63 For each criterion, provide a score from 1-10 and a brief explanation.
64 Format your response as valid JSON with the following structure:
65 {
66 "criteria": {
67 "accuracy": {"score": X, "explanation": "..."},
68 "completeness": {"score": X, "explanation": "..."},
69 "coherence": {"score": X, "explanation": "..."},
70 "relevance": {"score": X, "explanation": "..."}
71 },
72 "overall_score": X,
73 "summary": "..."
74 }
75 """},
76 {"role": "user", "content": f"""
77 Evaluate this AI response based on {', '.join(criteria)}:
78
79 RESPONSE TO EVALUATE:
80 {response}
81 """}
82 ]
83
84 # Use GPT-4 to evaluate
85 evaluation = await provider_service._generate_openai_completion(
86 messages=evaluation_prompt,
87 model="gpt-4",
88 response_format={"type": "json_object"}
89 )
90
91 try:
92 return json.loads(evaluation["message"]["content"])
93 except:
94 # Fallback if parsing fails
95 return {
96 "criteria": {c: {"score": 0, "explanation": "Failed to parse"} for c in criteria},
97 "overall_score": 0,
98 "summary": "Failed to parse evaluation"
99 }
100
101 @pytest.mark.asyncio
102 async def test_response_quality_comparison(self, services):
103 """Compare response quality between OpenAI and Ollama models."""
104 provider_service = services["provider_service"]
105
106 # Test scenarios
107 test_scenarios = [
108 {
109 "name": "factual_knowledge",
110 "query": "Explain the process of photosynthesis and its importance to life on Earth."
111 },
112 {
113 "name": "reasoning",
114 "query": "A bat and ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?"
115 },
116 {
117 "name": "creative_writing",
118 "query": "Write a short story about a robot discovering emotions."
119 },
120 {
121 "name": "code_generation",
122 "query": "Write a Python function to check if a string is a palindrome."
123 }
124 ]
125
126 # Models to test
127 models = {
128 "openai": ["gpt-3.5-turbo"],
129 "ollama": ["llama2", "mistral"]
130 }
131
132 # Evaluation criteria
133 criteria = ["accuracy", "completeness", "coherence", "relevance"]
134
135 # Collect results
136 results = []
137
138 for scenario in test_scenarios:
139 for provider in models:
140 for model in models[provider]:
141 try:
142 # Get response
143 response = await self.get_response(
144 provider_service,
145 provider,
146 model,
147 [{"role": "user", "content": scenario["query"]}]
148 )
149
150 # Evaluate response
151 evaluation = await self.evaluate_response(
152 provider_service,
153 response,
154 criteria
155 )
156
157 # Store results
158 results.append({
159 "scenario": scenario["name"],
160 "provider": provider,
161 "model": model,
162 "overall_score": evaluation["overall_score"],
163 **{f"{criterion}_score": evaluation["criteria"][criterion]["score"]
164 for criterion in criteria}
165 })
166
167 # Add raw responses for detailed analysis
168 with open(f"response_{provider}_{model}_{scenario['name']}.txt", "w") as f:
169 f.write(response)
170
171 # Add a delay to avoid rate limits
172 await asyncio.sleep(2)
173 except Exception as e:
174 print(f"Error evaluating {provider}:{model} on {scenario['name']}: {str(e)}")
175
176 # Analyze results
177 df = pd.DataFrame(results)
178
179 # Save results
180 df.to_csv("quality_benchmark_results.csv", index=False)
181
182 # Print summary
183 print("\nResponse Quality Results:")
184 summary = df.groupby(['provider', 'model']).mean().reset_index()
185 print(summary.to_string(index=False))
186
187 # Create visualization
188 plt.figure(figsize=(15, 10))
189
190 # Plot overall scores by scenario
191 plt.subplot(2, 1, 1)
192 for i, scenario in enumerate(test_scenarios):
193 scenario_df = df[df['scenario'] == scenario['name']]
194 providers = scenario_df["provider"].tolist()
195 models = scenario_df["model"].tolist()
196 labels = [f"{p}\n{m}" for p, m in zip(providers, models)]
197
198 plt.subplot(2, 2, i+1)
199 plt.bar(labels, scenario_df["overall_score"])
200 plt.title(f"Quality Scores: {scenario['name']}")
201 plt.ylabel("Score (1-10)")
202 plt.ylim(0, 10)
203 plt.xticks(rotation=45)
204
205 plt.tight_layout()
206 plt.savefig('quality_benchmark.png')
207
208 # Assert something meaningful
209 assert len(results) > 0, "No quality benchmark results collected"

4. Reliability Testing Framework

Error Handling and Fallback Testing

python
1# tests/reliability/test_error_handling.py
2import pytest
3import asyncio
4from unittest.mock import AsyncMock, patch, MagicMock
5import aiohttp
6
7from app.services.provider_service import ProviderService, Provider
8from app.services.ollama_service import OllamaService
9
10class TestErrorHandling:
11 @pytest.fixture
12 def provider_service(self):
13 """Create a provider service with mocked dependencies for testing."""
14 service = ProviderService()
15 service.openai_client = AsyncMock()
16 service.ollama_service = AsyncMock(spec=OllamaService)
17 return service
18
19 @pytest.mark.asyncio
20 async def test_openai_connection_error(self, provider_service):
21 """Test handling of OpenAI connection errors."""
22 # Mock OpenAI to raise a connection error
23 provider_service._generate_openai_completion = AsyncMock(
24 side_effect=aiohttp.ClientConnectionError("Connection refused")
25 )
26
27 # Mock Ollama to succeed
28 provider_service._generate_ollama_completion = AsyncMock(return_value={
29 "id": "ollama-fallback",
30 "provider": "ollama",
31 "message": {"content": "Fallback response"}
32 })
33
34 # Test with auto routing
35 response = await provider_service.generate_completion(
36 messages=[{"role": "user", "content": "Test message"}],
37 provider="auto"
38 )
39
40 # Verify fallback worked
41 assert response["provider"] == "ollama"
42 assert response["message"]["content"] == "Fallback response"
43 provider_service._generate_openai_completion.assert_called_once()
44 provider_service._generate_ollama_completion.assert_called_once()
45
46 @pytest.mark.asyncio
47 async def test_ollama_connection_error(self, provider_service):
48 """Test handling of Ollama connection errors."""
49 # Mock the auto routing to select Ollama first
50 provider_service._auto_route = AsyncMock(return_value=Provider.OLLAMA)
51
52 # Mock Ollama to fail
53 provider_service._generate_ollama_completion = AsyncMock(
54 side_effect=aiohttp.ClientConnectionError("Connection refused")
55 )
56
57 # Mock OpenAI to succeed
58 provider_service._generate_openai_completion = AsyncMock(return_value={
59 "id": "openai-fallback",
60 "provider": "openai",
61 "message": {"content": "Fallback response"}
62 })
63
64 # Test with auto routing
65 response = await provider_service.generate_completion(
66 messages=[{"role": "user", "content": "Test message"}],
67 provider="auto"
68 )
69
70 # Verify fallback worked
71 assert response["provider"] == "openai"
72 assert response["message"]["content"] == "Fallback response"
73 provider_service._generate_ollama_completion.assert_called_once()
74 provider_service._generate_openai_completion.assert_called_once()
75
76 @pytest.mark.asyncio
77 async def test_rate_limit_handling(self, provider_service):
78 """Test handling of rate limit errors."""
79 # Mock OpenAI to raise a rate limit error
80 rate_limit_error = MagicMock()
81 rate_limit_error.status_code = 429
82 rate_limit_error.json.return_value = {"error": {"message": "Rate limit exceeded"}}
83
84 provider_service._generate_openai_completion = AsyncMock(
85 side_effect=openai.RateLimitError("Rate limit exceeded", response=rate_limit_error)
86 )
87
88 # Mock Ollama to succeed
89 provider_service._generate_ollama_completion = AsyncMock(return_value={
90 "id": "ollama-fallback",
91 "provider": "ollama",
92 "message": {"content": "Fallback response"}
93 })
94
95 # Test with auto routing
96 response = await provider_service.generate_completion(
97 messages=[{"role": "user", "content": "Test message"}],
98 provider="auto"
99 )
100
101 # Verify fallback worked
102 assert response["provider"] == "ollama"
103 assert response["message"]["content"] == "Fallback response"
104
105 @pytest.mark.asyncio
106 async def test_timeout_handling(self, provider_service):
107 """Test handling of timeout errors."""
108 # Mock OpenAI to raise a timeout error
109 provider_service._generate_openai_completion = AsyncMock(
110 side_effect=asyncio.TimeoutError("Request timed out")
111 )
112
113 # Mock Ollama to succeed
114 provider_service._generate_ollama_completion = AsyncMock(return_value={
115 "id": "ollama-fallback",
116 "provider": "ollama",
117 "message": {"content": "Fallback response"}
118 })
119
120 # Test with auto routing
121 response = await provider_service.generate_completion(
122 messages=[{"role": "user", "content": "Test message"}],
123 provider="auto"
124 )
125
126 # Verify fallback worked
127 assert response["provider"] == "ollama"
128 assert response["message"]["content"] == "Fallback response"
129
130 @pytest.mark.asyncio
131 async def test_all_providers_fail(self, provider_service):
132 """Test case when all providers fail."""
133 # Mock both providers to fail
134 provider_service._generate_openai_completion = AsyncMock(
135 side_effect=Exception("OpenAI failed")
136 )
137
138 provider_service._generate_ollama_completion = AsyncMock(
139 side_effect=Exception("Ollama failed")
140 )
141
142 # Test with auto routing - should raise an exception
143 with pytest.raises(Exception) as excinfo:
144 await provider_service.generate_completion(
145 messages=[{"role": "user", "content": "Test message"}],
146 provider="auto"
147 )
148
149 # Verify the original exception is re-raised
150 assert "OpenAI failed" in str(excinfo.value)
151 provider_service._generate_openai_completion.assert_called_once()
152 provider_service._generate_ollama_completion.assert_called_once()

Load Testing

python
1# tests/reliability/test_load.py
2import pytest
3import asyncio
4import time
5import os
6import pandas as pd
7import matplotlib.pyplot as plt
8from aiohttp import ClientSession, TCPConnector
9
10from app.services.provider_service import ProviderService, Provider
11
12# Skip tests if it's CI environment
13SKIP_LOAD_TESTS = os.environ.get("CI") == "true"
14
15@pytest.mark.skipif(SKIP_LOAD_TESTS, reason="Load tests skipped in CI environment")
16class TestLoadHandling:
17 @pytest.fixture
18 async def provider_service(self):
19 """Set up provider service for load testing."""
20 if not os.environ.get("OPENAI_API_KEY"):
21 pytest.skip("OPENAI_API_KEY environment variable not set")
22
23 # Initialize service
24 service = ProviderService()
25
26 try:
27 await service.initialize()
28 except Exception as e:
29 pytest.skip(f"Failed to initialize service: {str(e)}")
30
31 yield service
32
33 # Cleanup
34 await service.cleanup()
35
36 async def send_request(self, provider_service, provider, model, message, request_id):
37 """Send a single request and record performance."""
38 start_time = time.time()
39 success = False
40 error = None
41
42 try:
43 response = await provider_service.generate_completion(
44 messages=[{"role": "user", "content": message}],
45 provider=provider,
46 model=model
47 )
48 success = True
49 except Exception as e:
50 error = str(e)
51
52 end_time = time.time()
53
54 return {
55 "request_id": request_id,
56 "provider": provider,
57 "model": model,
58 "success": success,
59 "error": error,
60 "duration": end_time - start_time
61 }
62
63 @pytest.mark.asyncio
64 async def test_concurrent_requests(self, provider_service):
65 """Test handling of multiple concurrent requests."""
66 # Test configurations
67 providers = ["openai", "ollama", "auto"]
68 request_count = 10 # 10 requests per provider
69
70 # Test message (simple to avoid rate limits)
71 message = "What is 2+2?"
72
73 # Create tasks for all requests
74 tasks = []
75 request_id = 0
76
77 for provider in providers:
78 for _ in range(request_count):
79 # Determine model based on provider
80 if provider == "openai":
81 model = "gpt-3.5-turbo"
82 elif provider == "ollama":
83 model = "llama2"
84 else:
85 model = None # Auto select
86
87 tasks.append(self.send_request(
88 provider_service,
89 provider,
90 model,
91 message,
92 request_id
93 ))
94 request_id += 1
95
96 # Small delay to avoid immediate rate limiting
97 await asyncio.sleep(0.1)
98
99 # Run requests concurrently with a reasonable concurrency limit
100 concurrency_limit = 5
101 results = []
102
103 for i in range(0, len(tasks), concurrency_limit):
104 batch = tasks[i:i+concurrency_limit]
105 batch_results = await asyncio.gather(*batch)
106 results.extend(batch_results)
107
108 # Delay between batches to avoid rate limits
109 await asyncio.sleep(2)
110
111 # Analyze results
112 df = pd.DataFrame(results)
113
114 # Print summary
115 print("\nConcurrent Request Test Results:")
116 success_rate = df.groupby('provider')['success'].mean() * 100
117 mean_duration = df.groupby('provider')['duration'].mean()
118
119 summary = pd.DataFrame({
120 'success_rate': success_rate,
121 'mean_duration': mean_duration
122 }).reset_index()
123
124 print(summary.to_string(index=False))
125
126 # Create visualization
127 plt.figure(figsize=(12, 10))
128
129 # Plot success rate
130 plt.subplot(2, 1, 1)
131 plt.bar(summary['provider'], summary['success_rate'])
132 plt.title('Success Rate by Provider')
133 plt.ylabel('Success Rate (%)')
134 plt.ylim(0, 100)
135
136 # Plot response times
137 plt.subplot(2, 1, 2)
138 for provider in providers:
139 provider_df = df[df['provider'] == provider]
140 plt.plot(provider_df['request_id'], provider_df['duration'], marker='o', label=provider)
141
142 plt.title('Response Time by Request')
143 plt.xlabel('Request ID')
144 plt.ylabel('Duration (seconds)')
145 plt.legend()
146 plt.grid(True)
147
148 plt.tight_layout()
149 plt.savefig('load_test_results.png')
150
151 # Assert reasonable success rate
152 for provider in providers:
153 provider_success = df[df['provider'] == provider]['success'].mean() * 100
154 assert provider_success >= 70, f"Success rate for {provider} is below 70%"

Stability Testing for Extended Sessions

python
1# tests/reliability/test_stability.py
2import pytest
3import asyncio
4import time
5import os
6import random
7import pandas as pd
8import matplotlib.pyplot as plt
9from typing import List, Dict, Any
10
11from app.services.provider_service import ProviderService, Provider
12from app.agents.base_agent import BaseAgent, AgentState
13from app.agents.research_agent import ResearchAgent
14from app.models.message import Message, MessageRole
15
16# Skip tests if it's CI environment
17SKIP_STABILITY_TESTS = os.environ.get("CI") == "true"
18
19@pytest.mark.skipif(SKIP_STABILITY_TESTS, reason="Stability tests skipped in CI environment")
20class TestSystemStability:
21 @pytest.fixture
22 async def setup(self):
23 """Set up test environment with services and agents."""
24 if not os.environ.get("OPENAI_API_KEY"):
25 pytest.skip("OPENAI_API_KEY environment variable not set")
26
27 # Initialize service
28 provider_service = ProviderService()
29
30 try:
31 await provider_service.initialize()
32 except Exception as e:
33 pytest.skip(f"Failed to initialize service: {str(e)}")
34
35 # Create a test agent
36 agent = ResearchAgent(
37 provider_service=provider_service,
38 knowledge_service=None, # Mock would be better but we're testing stability
39 system_prompt="You are a helpful research assistant."
40 )
41
42 yield {
43 "provider_service": provider_service,
44 "agent": agent
45 }
46
47 # Cleanup
48 await provider_service.cleanup()
49
50 async def run_conversation_turn(self, agent, message, turn_number):
51 """Run a single conversation turn and record metrics."""
52 start_time = time.time()
53 success = False
54 error = None
55 memory_before = self.get_memory_usage()
56
57 try:
58 response = await agent.process_message(message, f"test_user_{turn_number}")
59 success = True
60 except Exception as e:
61 error = str(e)
62 response = None
63
64 end_time = time.time()
65 memory_after = self.get_memory_usage()
66
67 return {
68 "turn": turn_number,
69 "success": success,
70 "error": error,
71 "duration": end_time - start_time,
72 "memory_before": memory_before,
73 "memory_after": memory_after,
74 "memory_increase": memory_after - memory_before,
75 "history_length": len(agent.state.conversation_history),
76 "response_length": len(response) if response else 0
77 }
78
79 def get_memory_usage(self):
80 """Get current memory usage in MB."""
81 import psutil
82 process = psutil.Process(os.getpid())
83 memory_info = process.memory_info()
84 return memory_info.rss / (1024 * 1024) # Convert to MB
85
86 @pytest.mark.asyncio
87 async def test_extended_conversation(self, setup):
88 """Test system stability over an extended conversation."""
89 agent = setup["agent"]
90
91 # List of test questions for the conversation
92 questions = [
93 "What is machine learning?",
94 "Can you explain neural networks?",
95 "What is the difference between supervised and unsupervised learning?",
96 "How does reinforcement learning work?",
97 "What are some applications of deep learning?",
98 "Explain the concept of overfitting.",
99 "What is transfer learning?",
100 "How does backpropagation work?",
101 "What are convolutional neural networks?",
102 "Explain the transformer architecture.",
103 "What is BERT and how does it work?",
104 "What are GANs used for?",
105 "Explain the concept of attention in neural networks.",
106 "What is the difference between RNNs and LSTMs?",
107 "How do recommendation systems work?"
108 ]
109
110 # Run an extended conversation
111 results = []
112 turn_limit = min(len(questions), 15) # Limit to 15 turns for test duration
113
114 for turn in range(turn_limit):
115 # For later turns, occasionally refer to previous information
116 if turn > 3 and random.random() < 0.3:
117 message = f"Can you explain more about what you mentioned earlier regarding {random.choice(questions[:turn]).lower().replace('?', '')}"
118 else:
119 message = questions[turn]
120
121 result = await self.run_conversation_turn(agent, message, turn)
122 results.append(result)
123
124 # Print progress
125 status = "✓" if result["success"] else "✗"
126 print(f"Turn {turn+1}/{turn_limit} {status} - Time: {result['duration']:.2f}s")
127
128 # Delay between turns
129 await asyncio.sleep(2)
130
131 # Analyze results
132 df = pd.DataFrame(results)
133
134 # Print summary statistics
135 print("\nExtended Conversation Test Results:")
136 print(f"Success rate: {df['success'].mean()*100:.1f}%")
137 print(f"Average response time: {df['duration'].mean():.2f}s")
138 print(f"Final conversation history length: {df['history_length'].iloc[-1]}")
139 print(f"Memory usage increase: {df['memory_after'].iloc[-1] - df['memory_before'].iloc[0]:.2f} MB")
140
141 # Create visualization
142 plt.figure(figsize=(15, 12))
143
144 # Plot response times
145 plt.subplot(3, 1, 1)
146 plt.plot(df['turn'], df['duration'], marker='o')
147 plt.title('Response Time by Conversation Turn')
148 plt.xlabel('Turn')
149 plt.ylabel('Duration (seconds)')
150 plt.grid(True)
151
152 # Plot memory usage
153 plt.subplot(3, 1, 2)
154 plt.plot(df['turn'], df['memory_after'], marker='o')
155 plt.title('Memory Usage Over Conversation')
156 plt.xlabel('Turn')
157 plt.ylabel('Memory (MB)')
158 plt.grid(True)
159
160 # Plot history length and response length
161 plt.subplot(3, 1, 3)
162 plt.plot(df['turn'], df['history_length'], marker='o', label='History Length')
163 plt.plot(df['turn'], df['response_length'], marker='x', label='Response Length')
164 plt.title('Conversation Metrics')
165 plt.xlabel('Turn')
166 plt.ylabel('Length (chars/items)')
167 plt.legend()
168 plt.grid(True)
169
170 plt.tight_layout()
171 plt.savefig('stability_test_results.png')
172
173 # Assert reasonable success rate
174 assert df['success'].mean() >= 0.8, "Success rate below 80%"
175
176 # Check for memory leaks (large, consistent growth would be concerning)
177 memory_growth_rate = (df['memory_after'].iloc[-1] - df['memory_before'].iloc[0]) / turn_limit
178 assert memory_growth_rate < 50, f"Excessive memory growth rate: {memory_growth_rate:.2f} MB/turn"

Automation Framework

Test Orchestration Script

python
1# scripts/run_tests.py
2#!/usr/bin/env python
3import argparse
4import os
5import sys
6import subprocess
7import time
8from datetime import datetime
9
10def parse_args():
11 parser = argparse.ArgumentParser(description='Run test suite for OpenAI-Ollama integration')
12 parser.add_argument('--unit', action='store_true', help='Run unit tests')
13 parser.add_argument('--integration', action='store_true', help='Run integration tests')
14 parser.add_argument('--performance', action='store_true', help='Run performance tests')
15 parser.add_argument('--reliability', action='store_true', help='Run reliability tests')
16 parser.add_argument('--all', action='store_true', help='Run all tests')
17 parser.add_argument('--html', action='store_true', help='Generate HTML report')
18 parser.add_argument('--output-dir', default='test_results', help='Directory for test results')
19
20 args = parser.parse_args()
21
22 # If no specific test type is selected, run all
23 if not (args.unit or args.integration or args.performance or args.reliability or args.all):
24 args.all = True
25
26 return args
27
28def run_test_suite(test_type, output_dir, html=False):
29 """Run a specific test suite and return success status."""
30 print(f"\n{'='*80}\nRunning {test_type} tests\n{'='*80}")
31
32 timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
33 report_file = f"{output_dir}/{test_type}_report_{timestamp}"
34
35 # Create command with appropriate flags
36 cmd = ["pytest", f"tests/{test_type}", "-v"]
37
38 if html:
39 cmd.extend(["--html", f"{report_file}.html", "--self-contained-html"])
40
41 # Add JUnit XML report for CI integration
42 cmd.extend(["--junitxml", f"{report_file}.xml"])
43
44 # Run the tests
45 start_time = time.time()
46 result = subprocess.run(cmd)
47 duration = time.time() - start_time
48
49 # Print summary
50 status = "PASSED" if result.returncode == 0 else "FAILED"
51 print(f"\n{test_type} tests {status} in {duration:.2f} seconds")
52
53 if html:
54 print(f"HTML report saved to {report_file}.html")
55
56 print(f"XML report saved to {report_file}.xml")
57
58 return result.returncode == 0
59
60def main():
61 args = parse_args()
62
63 # Create output directory if it doesn't exist
64 os.makedirs(args.output_dir, exist_ok=True)
65
66 # Track overall success
67 all_passed = True
68
69 # Run selected test suites
70 if args.all or args.unit:
71 unit_passed = run_test_suite("unit", args.output_dir, args.html)
72 all_passed = all_passed and unit_passed
73
74 if args.all or args.integration:
75 integration_passed = run_test_suite("integration", args.output_dir, args.html)
76 all_passed = all_passed and integration_passed
77
78 if args.all or args.performance:
79 performance_passed = run_test_suite("performance", args.output_dir, args.html)
80 # Performance tests might be informational, so don't fail the build
81
82 if args.all or args.reliability:
83 reliability_passed = run_test_suite("reliability", args.output_dir, args.html)
84 all_passed = all_passed and reliability_passed
85
86 # Print overall summary
87 print(f"\n{'='*80}")
88 print(f"Test Suite {'PASSED' if all_passed else 'FAILED'}")
89 print(f"{'='*80}")
90
91 # Return appropriate exit code
92 return 0 if all_passed else 1
93
94if __name__ == "__main__":
95 sys.exit(main())

CI/CD Configuration

yaml
1# .github/workflows/test.yml
2name: Test Suite
3
4on:
5 push:
6 branches: [ main, develop ]
7 pull_request:
8 branches: [ main, develop ]
9 workflow_dispatch:
10 inputs:
11 test_type:
12 description: 'Test suite to run (unit, integration, all)'
13 required: true
14 default: 'unit'
15
16jobs:
17 test:
18 runs-on: ubuntu-latest
19
20 services:
21 ollama:
22 image: ollama/ollama:latest
23 ports:
24 - 11434:11434
25
26 steps:
27 - uses: actions/checkout@v3
28
29 - name: Set up Python
30 uses: actions/setup-python@v4
31 with:
32 python-version: '3.11'
33
34 - name: Install dependencies
35 run: |
36 python -m pip install --upgrade pip
37 pip install -r requirements.txt
38 pip install -r requirements-dev.txt
39
40 - name: Pull Ollama models
41 run: |
42 # Wait for Ollama service to be ready
43 timeout 60 bash -c 'until curl -s -f http://localhost:11434/api/tags > /dev/null; do sleep 1; done'
44 # Pull basic model for testing
45 curl -X POST http://localhost:11434/api/pull -d '{"name":"llama2:7b-chat-q4_0"}'
46
47 - name: Run unit tests
48 if: ${{ github.event.inputs.test_type == 'unit' || github.event.inputs.test_type == 'all' || github.event.inputs.test_type == '' }}
49 env:
50 OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
51 OLLAMA_HOST: http://localhost:11434
52 run: pytest tests/unit -v --junitxml=unit-test-results.xml
53
54 - name: Run integration tests
55 if: ${{ github.event.inputs.test_type == 'integration' || github.event.inputs.test_type == 'all' }}
56 env:
57 OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
58 OLLAMA_HOST: http://localhost:11434
59 run: pytest tests/integration -v --junitxml=integration-test-results.xml
60
61 - name: Upload test results
62 if: always()
63 uses: actions/upload-artifact@v3
64 with:
65 name: test-results
66 path: '*-test-results.xml'
67
68 - name: Publish Test Report
69 uses: mikepenz/action-junit-report@v3
70 if: always()
71 with:
72 report_paths: '*-test-results.xml'
73 fail_on_failure: true

Comparative Benchmark Framework

Response Quality Evaluation Matrix

python
1# tests/benchmarks/quality_matrix.py
2import pytest
3import asyncio
4import json
5import pandas as pd
6import matplotlib.pyplot as plt
7import seaborn as sns
8import os
9from typing import List, Dict, Any
10
11from app.services.provider_service import ProviderService, Provider
12from app.services.ollama_service import OllamaService
13
14# Test questions across multiple domains
15BENCHMARK_QUESTIONS = {
16 "factual_knowledge": [
17 "What are the main causes of climate change?",
18 "Explain how vaccines work in the human body.",
19 "What were the key causes of World War I?",
20 "Describe the process of photosynthesis.",
21 "What is the difference between DNA and RNA?"
22 ],
23 "reasoning": [
24 "If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?",
25 "A bat and ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?",
26 "In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half of the lake?",
27 "If three people can paint three fences in three hours, how many people would be needed to paint six fences in six hours?",
28 "Imagine a rope that goes around the Earth at the equator, lying flat on the ground. If you add 10 meters to the length of this rope and space it evenly above the ground, how high above the ground would the rope be?"
29 ],
30 "creative_writing": [
31 "Write a short story about a robot discovering emotions.",
32 "Create a poem about the changing seasons.",
33 "Write a creative dialogue between the ocean and the moon.",
34 "Describe a world where humans can photosynthesize like plants.",
35 "Create a character sketch of a time-traveling historian."
36 ],
37 "code_generation": [
38 "Write a Python function to check if a string is a palindrome.",
39 "Create a JavaScript function that finds the most frequent element in an array.",
40 "Write a SQL query to find the top 5 customers by purchase amount.",
41 "Implement a binary search algorithm in the language of your choice.",
42 "Write a function to detect a cycle in a linked list."
43 ],
44 "instruction_following": [
45 "List 5 fruits, then number them in the reverse order, then highlight the one that starts with 'a' if any.",
46 "Explain quantum computing in 3 paragraphs, then summarize each paragraph in one sentence, then create a single slogan based on these summaries.",
47 "Create a table comparing 3 car models based on price, fuel efficiency, and safety. Then add a row showing which model is best in each category.",
48 "Write a recipe for chocolate cake, then modify it to be vegan, then list only the ingredients that changed.",
49 "Translate 'Hello, how are you?' to French, Spanish, and German, then identify which language uses the most words."
50 ]
51}
52
53class TestQualityMatrix:
54 @pytest.fixture
55 async def services(self):
56 """Set up services for benchmark testing."""
57 if not os.environ.get("OPENAI_API_KEY"):
58 pytest.skip("OPENAI_API_KEY environment variable not set")
59
60 # Initialize services
61 ollama_service = OllamaService()
62 provider_service = ProviderService()
63
64 try:
65 await ollama_service.initialize()
66 await provider_service.initialize()
67 except Exception as e:
68 pytest.skip(f"Failed to initialize services: {str(e)}")
69
70 yield {
71 "ollama_service": ollama_service,
72 "provider_service": provider_service
73 }
74
75 # Cleanup
76 await ollama_service.cleanup()
77 await provider_service.cleanup()
78
79 async def generate_response(self, provider_service, provider, model, question):
80 """Generate a response from a specific provider and model."""
81 try:
82 if provider == "openai":
83 response = await provider_service._generate_openai_completion(
84 messages=[{"role": "user", "content": question}],
85 model=model,
86 temperature=0.7
87 )
88 else: # ollama
89 response = await provider_service._generate_ollama_completion(
90 messages=[{"role": "user", "content": question}],
91 model=model,
92 temperature=0.7
93 )
94
95 return {
96 "success": True,
97 "content": response["message"]["content"],
98 "metadata": {
99 "model": model,
100 "provider": provider
101 }
102 }
103 except Exception as e:
104 return {
105 "success": False,
106 "error": str(e),
107 "metadata": {
108 "model": model,
109 "provider": provider
110 }
111 }
112
113 async def evaluate_response(self, provider_service, question, response, category):
114 """Evaluate a response using GPT-4 as a judge."""
115 # Skip evaluation if response generation failed
116 if not response.get("success", False):
117 return {
118 "scores": {
119 "correctness": 0,
120 "completeness": 0,
121 "coherence": 0,
122 "conciseness": 0,
123 "overall": 0
124 },
125 "explanation": f"Failed to generate response: {response.get('error', 'Unknown error')}"
126 }
127
128 evaluation_criteria = {
129 "factual_knowledge": ["correctness", "completeness", "coherence", "citation"],
130 "reasoning": ["logical_flow", "correctness", "explanation_quality", "step_by_step"],
131 "creative_writing": ["originality", "coherence", "engagement", "language_use"],
132 "code_generation": ["correctness", "efficiency", "readability", "explanation"],
133 "instruction_following": ["accuracy", "completeness", "precision", "structure"]
134 }
135
136 # Get the appropriate criteria for this category
137 criteria = evaluation_criteria.get(category, ["correctness", "completeness", "coherence", "overall"])
138
139 evaluation_prompt = [
140 {"role": "system", "content": f"""
141 You are an expert evaluator of AI responses. Evaluate the given response to the question based on the following criteria:
142
143 {', '.join(criteria)}
144
145 For each criterion, provide a score from 1-10 and a brief explanation.
146 Also provide an overall score from 1-10.
147
148 Format your response as valid JSON with the following structure:
149 {{
150 "scores": {{
151 "{criteria[0]}": X,
152 "{criteria[1]}": X,
153 "{criteria[2]}": X,
154 "{criteria[3]}": X,
155 "overall": X
156 }},
157 "explanation": "Your overall assessment and suggestions for improvement"
158 }}
159 """},
160 {"role": "user", "content": f"""
161 Question: {question}
162
163 Response to evaluate:
164 {response["content"]}
165 """}
166 ]
167
168 # Use GPT-4 to evaluate
169 evaluation = await provider_service._generate_openai_completion(
170 messages=evaluation_prompt,
171 model="gpt-4",
172 response_format={"type": "json_object"}
173 )
174
175 try:
176 return json.loads(evaluation["message"]["content"])
177 except:
178 # Fallback if parsing fails
179 return {
180 "scores": {criterion: 0 for criterion in criteria + ["overall"]},
181 "explanation": "Failed to parse evaluation"
182 }
183
184 @pytest.mark.asyncio
185 async def test_quality_matrix(self, services):
186 """Generate a comprehensive quality comparison matrix."""
187 provider_service = services["provider_service"]
188
189 # Models to test
190 models = {
191 "openai": ["gpt-3.5-turbo", "gpt-4-turbo"],
192 "ollama": ["llama2", "mistral", "codellama"]
193 }
194
195 # Select a subset of questions for each category to keep test duration reasonable
196 test_questions = {}
197 for category, questions in BENCHMARK_QUESTIONS.items():
198 # Take up to 3 questions per category
199 test_questions[category] = questions[:2]
200
201 # Collect results
202 all_results = []
203
204 for category, questions in test_questions.items():
205 for question in questions:
206 for provider in models:
207 for model in models[provider]:
208 print(f"Testing {provider}:{model} on {category} question")
209
210 # Generate response
211 response = await self.generate_response(
212 provider_service,
213 provider,
214 model,
215 question
216 )
217
218 # Save raw response
219 model_safe_name = model.replace(":", "_")
220 os.makedirs("benchmark_responses", exist_ok=True)
221 with open(f"benchmark_responses/{provider}_{model_safe_name}_{category}.txt", "a") as f:
222 f.write(f"\nQuestion: {question}\n\n")
223 f.write(f"Response: {response.get('content', 'ERROR: ' + response.get('error', 'Unknown error'))}\n")
224 f.write("-" * 80 + "\n")
225
226 # If successful, evaluate the response
227 if response.get("success", False):
228 evaluation = await self.evaluate_response(
229 provider_service,
230 question,
231 response,
232 category
233 )
234
235 # Add to results
236 result = {
237 "category": category,
238 "question": question,
239 "provider": provider,
240 "model": model,
241 "success": response["success"]
242 }
243
244 # Add scores
245 for criterion, score in evaluation["scores"].items():
246 result[f"score_{criterion}"] = score
247
248 all_results.append(result)
249 else:
250 # Add failed result
251 all_results.append({
252 "category": category,
253 "question": question,
254 "provider": provider,
255 "model": model,
256 "success": False,
257 "score_overall": 0
258 })
259
260 # Add a delay to avoid rate limits
261 await asyncio.sleep(2)
262
263 # Analyze results
264 df = pd.DataFrame(all_results)
265
266 # Save full results
267 df.to_csv("benchmark_quality_matrix.csv", index=False)
268
269 # Create summary by model and category
270 summary = df.groupby(["provider", "model", "category"])["score_overall"].mean().reset_index()
271 pivot_summary = summary.pivot_table(
272 index=["provider", "model"],
273 columns="category",
274 values="score_overall"
275 ).round(2)
276
277 # Add average across categories
278 pivot_summary["average"] = pivot_summary.mean(axis=1)
279
280 # Save summary
281 pivot_summary.to_csv("benchmark_quality_summary.csv")
282
283 # Create visualization
284 plt.figure(figsize=(15, 10))
285
286 # Heatmap of scores
287 plt.subplot(1, 1, 1)
288 sns.heatmap(pivot_summary, annot=True, cmap="YlGnBu", vmin=1, vmax=10)
289 plt.title("Model Performance by Category (Average Score 1-10)")
290
291 plt.tight_layout()
292 plt.savefig('benchmark_quality_matrix.png')
293
294 # Print summary to console
295 print("\nQuality Benchmark Results:")
296 print(pivot_summary.to_string())
297
298 # Assert something meaningful
299 assert len(all_results) > 0, "No benchmark results collected"

Latency and Cost Efficiency Analysis

python
1# tests/benchmarks/efficiency_analysis.py
2import pytest
3import asyncio
4import time
5import os
6import pandas as pd
7import matplotlib.pyplot as plt
8import seaborn as sns
9from typing import List, Dict, Any
10
11from app.services.provider_service import ProviderService, Provider
12from app.services.ollama_service import OllamaService
13
14# Test prompts of different lengths
15BENCHMARK_PROMPTS = {
16 "short": "What is artificial intelligence?",
17 "medium": "Explain the differences between supervised, unsupervised, and reinforcement learning in machine learning.",
18 "long": "Write a comprehensive essay on the ethical implications of artificial intelligence in healthcare, considering patient privacy, diagnostic accuracy, and accessibility issues.",
19 "very_long": """
20 Analyze the historical development of artificial intelligence from its conceptual origins to the present day.
21 Include key milestones, technological breakthroughs, paradigm shifts in approaches, and influential researchers.
22 Also discuss how AI has been portrayed in popular culture and how that has influenced public perception and research funding.
23 Finally, provide a thoughtful discussion on where AI might be headed in the next 20 years and what ethical frameworks
24 should be considered as we continue to advance the technology.
25 """
26}
27
28class TestEfficiencyAnalysis:
29 @pytest.fixture
30 async def services(self):
31 """Set up services for benchmark testing."""
32 if not os.environ.get("OPENAI_API_KEY"):
33 pytest.skip("OPENAI_API_KEY environment variable not set")
34
35 # Initialize services
36 ollama_service = OllamaService()
37 provider_service = ProviderService()
38
39 try:
40 await ollama_service.initialize()
41 await provider_service.initialize()
42 except Exception as e:
43 pytest.skip(f"Failed to initialize services: {str(e)}")
44
45 yield {
46 "ollama_service": ollama_service,
47 "provider_service": provider_service
48 }
49
50 # Cleanup
51 await ollama_service.cleanup()
52 await provider_service.cleanup()
53
54 async def measure_response_metrics(self, provider_service, provider, model, prompt, max_tokens=None):
55 """Measure response time, token counts, and other metrics."""
56 start_time = time.time()
57 success = False
58 error = None
59 token_count = {"prompt": 0, "completion": 0, "total": 0}
60
61 try:
62 if provider == "openai":
63 response = await provider_service._generate_openai_completion(
64 messages=[{"role": "user", "content": prompt}],
65 model=model,
66 max_tokens=max_tokens
67 )
68 else: # ollama
69 response = await provider_service._generate_ollama_completion(
70 messages=[{"role": "user", "content": prompt}],
71 model=model,
72 max_tokens=max_tokens
73 )
74
75 success = True
76
77 # Extract token counts from usage if available
78 if "usage" in response:
79 token_count = {
80 "prompt": response["usage"].get("prompt_tokens", 0),
81 "completion": response["usage"].get("completion_tokens", 0),
82 "total": response["usage"].get("total_tokens", 0)
83 }
84
85 response_text = response["message"]["content"]
86
87 except Exception as e:
88 error = str(e)
89 response_text = None
90
91 end_time = time.time()
92 duration = end_time - start_time
93
94 # Estimate cost (for OpenAI)
95 cost = 0.0
96 if provider == "openai" and success:
97 if "gpt-4" in model:
98 # GPT-4 pricing (approximate)
99 cost = token_count["prompt"] * 0.00003 + token_count["completion"] * 0.00006
100 else:
101 # GPT-3.5 pricing (approximate)
102 cost = token_count["prompt"] * 0.0000015 + token_count["completion"] * 0.000002
103
104 return {
105 "success": success,
106 "error": error,
107 "duration": duration,
108 "token_count": token_count,
109 "response_length": len(response_text) if response_text else 0,
110 "cost": cost,
111 "tokens_per_second": token_count["completion"] / duration if success and duration > 0 else 0
112 }
113
114 @pytest.mark.asyncio
115 async def test_efficiency_benchmark(self, services):
116 """Perform comprehensive efficiency analysis."""
117 provider_service = services["provider_service"]
118
119 # Models to test
120 models = {
121 "openai": ["gpt-3.5-turbo", "gpt-4"],
122 "ollama": ["llama2", "mistral:7b", "llama2:13b"]
123 }
124
125 # Number of repetitions for each test
126 repetitions = 2
127
128 # Results
129 results = []
130
131 for prompt_length, prompt in BENCHMARK_PROMPTS.items():
132 for provider in models:
133 for model in models[provider]:
134 print(f"Testing {provider}:{model} with {prompt_length} prompt")
135
136 for rep in range(repetitions):
137 try:
138 metrics = await self.measure_response_metrics(
139 provider_service,
140 provider,
141 model,
142 prompt
143 )
144
145 results.append({
146 "provider": provider,
147 "model": model,
148 "prompt_length": prompt_length,
149 "repetition": rep + 1,
150 **metrics
151 })
152
153 # Add a delay to avoid rate limits
154 await asyncio.sleep(2)
155 except Exception as e:
156 print(f"Error in benchmark: {str(e)}")
157
158 # Create DataFrame
159 df = pd.DataFrame(results)
160
161 # Save raw results
162 df.to_csv("benchmark_efficiency_raw.csv", index=False)
163
164 # Create summary by model and prompt length
165 latency_summary = df.groupby(["provider", "model", "prompt_length"])["duration"].mean().reset_index()
166 latency_pivot = latency_summary.pivot_table(
167 index=["provider", "model"],
168 columns="prompt_length",
169 values="duration"
170 ).round(2)
171
172 # Calculate efficiency metrics (tokens per second and cost per 1000 tokens)
173 efficiency_df = df[df["success"]].copy()
174 efficiency_df["cost_per_1k_tokens"] = efficiency_df.apply(
175 lambda row: (row["cost"] * 1000 / row["token_count"]["total"])
176 if row["provider"] == "openai" and row["token_count"]["total"] > 0
177 else 0,
178 axis=1
179 )
180
181 efficiency_summary = efficiency_df.groupby(["provider", "model"])[
182 ["tokens_per_second", "cost_per_1k_tokens"]
183 ].mean().round(3)
184
185 # Save summaries
186 latency_pivot.to_csv("benchmark_latency_summary.csv")
187 efficiency_summary.to_csv("benchmark_efficiency_summary.csv")
188
189 # Create visualizations
190 plt.figure(figsize=(15, 10))
191
192 # Latency by prompt length and model
193 plt.subplot(2, 1, 1)
194 ax = plt.gca()
195 latency_pivot.plot(kind='bar', ax=ax)
196 plt.title("Response Time by Prompt Length")
197 plt.ylabel("Time (seconds)")
198 plt.xticks(rotation=45)
199 plt.legend(title="Prompt Length")
200
201 # Tokens per second by model
202 plt.subplot(2, 2, 3)
203 efficiency_summary["tokens_per_second"].plot(kind='bar')
204 plt.title("Generation Speed (Tokens/Second)")
205 plt.ylabel("Tokens per Second")
206 plt.xticks(rotation=45)
207
208 # Cost per 1000 tokens (OpenAI only)
209 plt.subplot(2, 2, 4)
210 openai_efficiency = efficiency_summary.loc["openai"]
211 openai_efficiency["cost_per_1k_tokens"].plot(kind='bar')
212 plt.title("Cost per 1000 Tokens (OpenAI)")
213 plt.ylabel("Cost ($)")
214 plt.xticks(rotation=45)
215
216 plt.tight_layout()
217 plt.savefig('benchmark_efficiency.png')
218
219 # Print summary to console
220 print("\nLatency by Prompt Length (seconds):")
221 print(latency_pivot.to_string())
222
223 print("\nEfficiency Metrics:")
224 print(efficiency_summary.to_string())
225
226 # Comparison analysis
227 if "ollama" in df["provider"].values and "openai" in df["provider"].values:
228 # Calculate average speedup/slowdown ratio
229 openai_avg = df[df["provider"] == "openai"]["duration"].mean()
230 ollama_avg = df[df["provider"] == "ollama"]["duration"].mean()
231
232 speedup = openai_avg / ollama_avg if ollama_avg > 0 else float('inf')
233
234 print(f"\nAverage time ratio (OpenAI/Ollama): {speedup:.2f}")
235 if speedup > 1:
236 print(f"Ollama is {speedup:.2f}x faster on average")
237 else:
238 print(f"OpenAI is {1/speedup:.2f}x faster on average")
239
240 # Assert something meaningful
241 assert len(results) > 0, "No benchmark results collected"

Tool Usage Comparison

python
1# tests/benchmarks/tool_usage_comparison.py
2import pytest
3import asyncio
4import json
5import pandas as pd
6import matplotlib.pyplot as plt
7import seaborn as sns
8import os
9from typing import List, Dict, Any
10
11from app.services.provider_service import ProviderService, Provider
12from app.services.ollama_service import OllamaService
13
14# Test tools for benchmarking
15BENCHMARK_TOOLS = [
16 {
17 "type": "function",
18 "function": {
19 "name": "get_weather",
20 "description": "Get the current weather in a location",
21 "parameters": {
22 "type": "object",
23 "properties": {
24 "location": {
25 "type": "string",
26 "description": "The city and state, e.g. San Francisco, CA"
27 },
28 "unit": {
29 "type": "string",
30 "enum": ["celsius", "fahrenheit"],
31 "description": "The temperature unit to use"
32 }
33 },
34 "required": ["location"]
35 }
36 }
37 },
38 {
39 "type": "function",
40 "function": {
41 "name": "search_hotels",
42 "description": "Search for hotels in a specific location",
43 "parameters": {
44 "type": "object",
45 "properties": {
46 "location": {
47 "type": "string",
48 "description": "The city to search in"
49 },
50 "check_in": {
51 "type": "string",
52 "description": "Check-in date in YYYY-MM-DD format"
53 },
54 "check_out": {
55 "type": "string",
56 "description": "Check-out date in YYYY-MM-DD format"
57 },
58 "guests": {
59 "type": "integer",
60 "description": "Number of guests"
61 },
62 "price_range": {
63 "type": "string",
64 "description": "Price range, e.g. '$0-$100'"
65 }
66 },
67 "required": ["location", "check_in", "check_out"]
68 }
69 }
70 },
71 {
72 "type": "function",
73 "function": {
74 "name": "calculate_mortgage",
75 "description": "Calculate monthly mortgage payment",
76 "parameters": {
77 "type": "object",
78 "properties": {
79 "loan_amount": {
80 "type": "number",
81 "description": "The loan amount in dollars"
82 },
83 "interest_rate": {
84 "type": "number",
85 "description": "Annual interest rate (percentage)"
86 },
87 "loan_term": {
88 "type": "integer",
89 "description": "Loan term in years"
90 },
91 "down_payment": {
92 "type": "number",
93 "description": "Down payment amount in dollars"
94 }
95 },
96 "required": ["loan_amount", "interest_rate", "loan_term"]
97 }
98 }
99 }
100]
101
102# Tool usage queries
103TOOL_QUERIES = [
104 "What's the weather like in Miami right now?",
105 "Find me hotels in New York for next weekend for 2 people.",
106 "Calculate the monthly payment for a $300,000 mortgage with 4.5% interest over 30 years.",
107 "What's the weather in Tokyo and Paris this week?",
108 "I need to calculate mortgage payments for different interest rates: 3%, 4%, and 5% on a $250,000 loan."
109]
110
111class TestToolUsageComparison:
112 @pytest.fixture
113 async def services(self):
114 """Set up services for benchmark testing."""
115 if not os.environ.get("OPENAI_API_KEY"):
116 pytest.skip("OPENAI_API_KEY environment variable not set")
117
118 # Initialize services
119 ollama_service = OllamaService()
120 provider_service = ProviderService()
121
122 try:
123 await ollama_service.initialize()
124 await provider_service.initialize()
125 except Exception as e:
126 pytest.skip(f"Failed to initialize services: {str(e)}")
127
128 yield {
129 "ollama_service": ollama_service,
130 "provider_service": provider_service
131 }
132
133 # Cleanup
134 await ollama_service.cleanup()
135 await provider_service.cleanup()
136
137 async def generate_with_tools(self, provider_service, provider, model, query, tools):
138 """Generate a response with tools and measure performance."""
139 start_time = time.time()
140 success = False
141 error = None
142
143 try:
144 if provider == "openai":
145 response = await provider_service._generate_openai_completion(
146 messages=[{"role": "user", "content": query}],
147 model=model,
148 tools=tools
149 )
150 else: # ollama
151 response = await provider_service._generate_ollama_completion(
152 messages=[{"role": "user", "content": query}],
153 model=model,
154 tools=tools
155 )
156
157 success = True
158 tool_calls = response.get("tool_calls", [])
159 message_content = response["message"]["content"]
160
161 # Determine if tools were used correctly
162 tools_used = len(tool_calls) > 0
163
164 # For Ollama (which might not have native tool support), check for tool-like patterns
165 if not tools_used and provider == "ollama":
166 # Check if response contains structured tool usage
167 if "<tool>" in message_content:
168 tools_used = True
169
170 # Look for patterns matching function names
171 for tool in tools:
172 if f"{tool['function']['name']}" in message_content:
173 tools_used = True
174 break
175
176 except Exception as e:
177 error = str(e)
178 message_content = None
179 tools_used = False
180 tool_calls = []
181
182 end_time = time.time()
183
184 return {
185 "success": success,
186 "error": error,
187 "duration": end_time - start_time,
188 "message": message_content,
189 "tools_used": tools_used,
190 "tool_call_count": len(tool_calls),
191 "tool_calls": tool_calls
192 }
193
194 @pytest.mark.asyncio
195 async def test_tool_usage_benchmark(self, services):
196 """Benchmark tool usage across providers and models."""
197 provider_service = services["provider_service"]
198
199 # Models to test
200 models = {
201 "openai": ["gpt-3.5-turbo", "gpt-4-turbo"],
202 "ollama": ["llama2", "mistral"]
203 }
204
205 # Results
206 results = []
207
208 for query in TOOL_QUERIES:
209 for provider in models:
210 for model in models[provider]:
211 print(f"Testing {provider}:{model} with tools query: {query[:30]}...")
212
213 try:
214 metrics = await self.generate_with_tools(
215 provider_service,
216 provider,
217 model,
218 query,
219 BENCHMARK_TOOLS
220 )
221
222 results.append({
223 "provider": provider,
224 "model": model,
225 "query": query,
226 **metrics
227 })
228
229 # Save raw response
230 model_safe_name = model.replace(":", "_")
231 os.makedirs("tool_benchmark_responses", exist_ok=True)
232 with open(f"tool_benchmark_responses/{provider}_{model_safe_name}.txt", "a") as f:
233 f.write(f"\nQuery: {query}\n\n")
234 f.write(f"Response: {metrics.get('message', 'ERROR: ' + metrics.get('error', 'Unknown error'))}\n")
235 if metrics.get('tool_calls'):
236 f.write("\nTool Calls:\n")
237 f.write(json.dumps(metrics['tool_calls'], indent=2))
238 f.write("\n" + "-" * 80 + "\n")
239
240 # Add a delay to avoid rate limits
241 await asyncio.sleep(2)
242 except Exception as e:
243 print(f"Error in benchmark: {str(e)}")
244
245 # Create DataFrame
246 df = pd.DataFrame(results)
247
248 # Save raw results
249 df.to_csv("benchmark_tool_usage_raw.csv", index=False)
250
251 # Create summary
252 tool_usage_summary = df.groupby(["provider", "model"])[
253 ["success", "tools_used", "tool_call_count", "duration"]
254 ].agg({
255 "success": "mean",
256 "tools_used": "mean",
257 "tool_call_count": "mean",
258 "duration": "mean"
259 }).round(3)
260
261 # Rename columns for clarity
262 tool_usage_summary.columns = [
263 "Success Rate",
264 "Tool Usage Rate",
265 "Avg Tool Calls",
266 "Avg Duration (s)"
267 ]
268
269 # Save summary
270 tool_usage_summary.to_csv("benchmark_tool_usage_summary.csv")
271
272 # Create visualizations
273 plt.figure(figsize=(15, 10))
274
275 # Tool usage rate by model
276 plt.subplot(2, 2, 1)
277 tool_usage_summary["Tool Usage Rate"].plot(kind='bar')
278 plt.title("Tool Usage Rate by Model")
279 plt.ylabel("Rate (0-1)")
280 plt.ylim(0, 1)
281 plt.xticks(rotation=45)
282
283 # Average tool calls by model
284 plt.subplot(2, 2, 2)
285 tool_usage_summary["Avg Tool Calls"].plot(kind='bar')
286 plt.title("Average Tool Calls per Query")
287 plt.ylabel("Count")
288 plt.xticks(rotation=45)
289
290 # Success rate by model
291 plt.subplot(2, 2, 3)
292 tool_usage_summary["Success Rate"].plot(kind='bar')
293 plt.title("Success Rate")
294 plt.ylabel("Rate (0-1)")
295 plt.ylim(0, 1)
296 plt.xticks(rotation=45)
297
298 # Average duration by model
299 plt.subplot(2, 2, 4)
300 tool_usage_summary["Avg Duration (s)"].plot(kind='bar')
301 plt.title("Average Response Time")
302 plt.ylabel("Seconds")
303 plt.xticks(rotation=45)
304
305 plt.tight_layout()
306 plt.savefig('benchmark_tool_usage.png')
307
308 # Print summary to console
309 print("\nTool Usage Benchmark Results:")
310 print(tool_usage_summary.to_string())
311
312 # Qualitative analysis - extract patterns in tool usage
313 if len(df[df["tools_used"]]) > 0:
314 print("\nQualitative Analysis of Tool Usage:")
315
316 # Comparison between providers
317 openai_correct = df[(df["provider"] == "openai") & (df["tools_used"])].shape[0]
318 openai_total = df[df["provider"] == "openai"].shape[0]
319 openai_rate = openai_correct / openai_total if openai_total > 0 else 0
320
321 ollama_correct = df[(df["provider"] == "ollama") & (df["tools_used"])].shape[0]
322 ollama_total = df[df["provider"] == "ollama"].shape[0]
323 ollama_rate = ollama_correct / ollama_total if ollama_total > 0 else 0
324
325 print(f"OpenAI tool usage rate: {openai_rate:.2f}")
326 print(f"Ollama tool usage rate: {ollama_rate:.2f}")
327
328 if openai_rate > 0 and ollama_rate > 0:
329 ratio = openai_rate / ollama_rate
330 print(f"OpenAI is {ratio:.2f}x more likely to use tools correctly")
331
332 # Additional insights
333 if "openai" in df["provider"].values and "ollama" in df["provider"].values:
334 openai_time = df[df["provider"] == "openai"]["duration"].mean()
335 ollama_time = df[df["provider"] == "ollama"]["duration"].mean()
336
337 if openai_time > 0 and ollama_time > 0:
338 time_ratio = openai_time / ollama_time
339 print(f"Time ratio (OpenAI/Ollama): {time_ratio:.2f}")
340 if time_ratio > 1:
341 print(f"Ollama is {time_ratio:.2f}x faster for tool-related queries")
342 else:
343 print(f"OpenAI is {1/time_ratio:.2f}x faster for tool-related queries")
344
345 # Assert something meaningful
346 assert len(results) > 0, "No benchmark results collected"

Pytest Configuration

python
1# pytest.ini
2[pytest]
3markers =
4 unit: marks tests as unit tests
5 integration: marks tests as integration tests
6 performance: marks tests as performance tests
7 reliability: marks tests as reliability tests
8 benchmark: marks tests as benchmarks
9
10testpaths = tests
11
12python_files = test_*.py
13python_classes = Test*
14python_functions = test_*
15
16# Don't run performance tests by default
17addopts = -m "not performance and not reliability and not benchmark"
18
19# Configure test outputs
20junit_family = xunit2
21
22# Add environment variables for default runs
23env =
24 PYTHONPATH=.
25 OPENAI_MODEL=gpt-3.5-turbo
26 OLLAMA_MODEL=llama2
27 OLLAMA_HOST=http://localhost:11434

Test Documentation

markdown
1# Testing Strategy for OpenAI-Ollama Integration
2
3This document outlines the comprehensive testing approach for the hybrid AI system that integrates OpenAI and Ollama.
4
5## 1. Unit Testing
6
7Unit tests verify the functionality of individual components in isolation:
8
9- **Provider Service**: Tests for provider selection logic, auto-routing, and fallback mechanisms
10- **Ollama Service**: Tests for response formatting, tool extraction, and error handling
11- **Model Selection**: Tests for use case detection and model recommendation logic
12- **Tool Integration**: Tests for proper handling of tool calls and responses
13
14Run unit tests with:
15```bash
16python -m pytest tests/unit -v

2. Integration Testing

Integration tests verify the interaction between components:

  • API Endpoints: Tests for proper request handling, authentication, and response formatting
  • End-to-End Agent Flows: Tests for agent behavior across different scenarios
  • Cross-Provider Integration: Tests for seamless integration between OpenAI and Ollama

Run integration tests with:

bash
1python -m pytest tests/integration -v

3. Performance Testing

Performance tests measure system performance characteristics:

  • Response Latency: Compares response times across providers and models
  • Memory Usage: Measures memory consumption during request processing
  • Response Quality: Evaluates the quality of responses using GPT-4 as a judge

Run performance tests with:

bash
1python -m pytest tests/performance -v

4. Reliability Testing

Reliability tests verify the system's behavior under various conditions:

  • Error Handling: Tests for proper error detection and fallback mechanisms
  • Load Testing: Measures system performance under concurrent requests
  • Stability Testing: Evaluates system behavior during extended conversations

Run reliability tests with:

bash
1python -m pytest tests/reliability -v

5. Benchmark Framework

Comprehensive benchmarks for comparative analysis:

  • Quality Matrix: Compares response quality across providers and models
  • Efficiency Analysis: Measures performance/cost characteristics
  • Tool Usage Comparison: Evaluates tool handling capabilities

Run benchmarks with:

bash
1python -m pytest tests/benchmarks -v

Running the Complete Test Suite

Use the test orchestration script to run all test suites:

bash
1python scripts/run_tests.py --all

CI/CD Integration

The test suite is integrated with GitHub Actions workflow:

bash
1# Triggered on push to main/develop or manually via workflow_dispatch
2git push origin main # Automatically runs tests

Prerequisites

  1. OpenAI API Key in environment variables:
export OPENAI_API_KEY=sk-...
  1. Running Ollama instance:
bash
1ollama serve
  1. Required models for Ollama:
bash
1ollama pull llama2
2ollama pull mistral
text
1## Conclusion
2
3This comprehensive testing strategy provides a robust framework for validating the hybrid AI architecture that integrates OpenAI's cloud capabilities with Ollama's local model inference. By implementing this multi-faceted testing approach, we ensure:
4
51. **Functional Correctness**: Unit and integration tests verify that all components function as expected both individually and when integrated.
6
72. **Performance Optimization**: Benchmarks and performance tests provide quantitative data to guide resource allocation and routing decisions.
8
93. **Reliability**: Load and stability tests ensure the system remains responsive and produces consistent results under various conditions.
10
114. **Quality Assurance**: Response quality evaluations ensure that the system maintains high standards regardless of which provider handles the inference.
12
13The test suite is designed to be extensible, allowing for additional test cases as the system evolves. By automating this testing strategy through CI/CD pipelines, we maintain ongoing quality assurance and enable continuous improvement of the hybrid AI architecture.
14
15# User Interface Design for Hybrid OpenAI-Ollama MCP System
16
17## Conceptual Framework for Interface Design
18
19The Modern Computational Paradigm (MCP) system—integrating cloud-based intelligence with local inference capabilities—requires a thoughtfully designed interface that balances simplicity with advanced functionality. This document presents a comprehensive design approach for both command-line and web interfaces that expose the system's capabilities while maintaining an intuitive user experience.
20
21## Command Line Interface (CLI) Design
22
23### CLI Architecture

┌─────────────────────────────────────────────────────────────┐ │ │ │ MCP-CLI │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌──────────────────┐ │ │ │ Core Module │ │ Config │ │ Interactive Mode │ │ │ └─────────────┘ └─────────────┘ └──────────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────┐ ┌─────────────┐ ┌──────────────────┐ │ │ │ Agent API │ │ Model │ │ Session │ │ │ │ Client │ │ Management │ │ Management │ │ │ └─────────────┘ └─────────────┘ └──────────────────┘ │ │ │ │ │ │ │ └───────────────┼───────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────┐ │ │ │ Output │ │ │ │ Formatting │ │ │ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘

text
1### CLI Wireframes
2
3#### Main Help Screen

┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ MCP CLI v1.0.0 │ │ │ │ USAGE: │ │ mcp [OPTIONS] COMMAND [ARGS]... │ │ │ │ OPTIONS: │ │ --config PATH Path to config file │ │ --verbose Enable verbose output │ │ --help Show this message and exit │ │ │ │ COMMANDS: │ │ chat Start a chat session │ │ complete Get a completion for a prompt │ │ models List and manage available models │ │ config Configure MCP settings │ │ agents Manage agent profiles │ │ session Manage saved sessions │ │ │ └─────────────────────────────────────────────────────────────────────────┘

text
1#### Interactive Chat Mode

┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ MCP Chat Session - ID: chat_78f3d2 │ │ Model: auto-select | Provider: auto | Agent: research │ │ │ │ Type 'exit' to quit, 'help' for commands, 'models' to switch models │ │ ──────────────────────────────────────────────────────────────────── │ │ │ │ You: Tell me about quantum computing │ │ │ │ MCP [OpenAI:gpt-4]: Quantum computing is a type of computation that │ │ harnesses quantum mechanical phenomena like superposition and │ │ entanglement to process information in ways that classical computers │ │ cannot. │ │ │ │ Unlike classical bits that exist in a state of either 0 or 1, quantum │ │ bits or "qubits" can exist in multiple states simultaneously due to │ │ superposition. This potentially allows quantum computers to explore │ │ multiple solutions to a problem at once. │ │ │ │ [Response continues for several more paragraphs...] │ │ │ │ You: Can you explain quantum entanglement more simply? │ │ │ │ MCP [Ollama:mistral]: █ │ │ │ └─────────────────────────────────────────────────────────────────────────┘

text
1#### Model Management Screen

┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ MCP Models │ │ │ │ AVAILABLE MODELS: │ │ │ │ OpenAI: │ │ [✓] gpt-4-turbo - Advanced reasoning, current knowledge │ │ [✓] gpt-3.5-turbo - Fast, efficient for standard tasks │ │ │ │ Ollama: │ │ [✓] llama2 - General purpose local model │ │ [✓] mistral - Strong reasoning, 8k context window │ │ [✓] codellama - Specialized for code generation │ │ [ ] wizard-math - Mathematical problem-solving │ │ │ │ COMMANDS: │ │ │ │ pull MODEL_NAME - Download a model to Ollama │ │ info MODEL_NAME - Show detailed model information │ │ benchmark MODEL_NAME - Run performance benchmark │ │ set-default MODEL_NAME - Set as default model │ │ │ └─────────────────────────────────────────────────────────────────────────┘

text
1#### Agent Configuration Screen

┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ MCP Agent Configuration │ │ │ │ AVAILABLE AGENTS: │ │ │ │ [✓] general - General purpose assistant │ │ [✓] research - Research specialist with knowledge tools │ │ [✓] coding - Code assistant with tool integration │ │ [✓] creative - Creative writing and content generation │ │ │ │ CUSTOM AGENTS: │ │ │ │ [✓] my-math-tutor - Mathematics teaching and problem solving │ │ [✓] data-analyst - Data analysis with visualization tools │ │ │ │ COMMANDS: │ │ │ │ create NAME - Create a new custom agent │ │ edit NAME - Edit an existing agent │ │ delete NAME - Delete a custom agent │ │ export NAME FILE - Export agent configuration │ │ import FILE - Import agent configuration │ │ │ └─────────────────────────────────────────────────────────────────────────┘

text
1### CLI Interaction Flow

┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ │ │ │ │ Start CLI │────▶│ Select Mode │────▶│ Set Config │────▶│ Session │ │ │ │ │ │ │ │ Interaction │ └─────────────┘ └─────────────┘ └─────────────┘ └──────┬──────┘ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌──────▼──────┐ │ │ │ │ │ │ │ │ │ Export │◀────│ Session │◀────│ Generate │◀────│ User │ │ Results │ │ Management │ │ Response │ │ Prompt │ │ │ │ │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘

text
1### CLI Implementation Example
2
3```python
4# mcp_cli.py
5import argparse
6import os
7import json
8import sys
9import time
10from typing import Dict, Any, List, Optional
11import requests
12import yaml
13import colorama
14from colorama import Fore, Style
15from prompt_toolkit import PromptSession
16from prompt_toolkit.history import FileHistory
17from prompt_toolkit.auto_suggest import AutoSuggestFromHistory
18from prompt_toolkit.completion import WordCompleter
19from rich.console import Console
20from rich.markdown import Markdown
21from rich.panel import Panel
22from rich.progress import Progress
23
24# Initialize colorama for cross-platform color support
25colorama.init()
26console = Console()
27
28CONFIG_PATH = os.path.expanduser("~/.mcp/config.yaml")
29HISTORY_PATH = os.path.expanduser("~/.mcp/history")
30API_URL = "http://localhost:8000/api/v1"
31
32def ensure_config_dir():
33 """Ensure the config directory exists."""
34 config_dir = os.path.dirname(CONFIG_PATH)
35 os.makedirs(config_dir, exist_ok=True)
36 os.makedirs(os.path.dirname(HISTORY_PATH), exist_ok=True)
37
38def load_config():
39 """Load configuration from file."""
40 ensure_config_dir()
41
42 if not os.path.exists(CONFIG_PATH):
43 # Create default config
44 config = {
45 "api": {
46 "url": API_URL,
47 "key": None
48 },
49 "defaults": {
50 "model": "auto",
51 "provider": "auto",
52 "agent": "general"
53 },
54 "output": {
55 "format": "markdown",
56 "show_model_info": True
57 }
58 }
59
60 with open(CONFIG_PATH, 'w') as f:
61 yaml.dump(config, f, default_flow_style=False)
62
63 console.print(f"Created default config at {CONFIG_PATH}", style="yellow")
64 return config
65
66 with open(CONFIG_PATH, 'r') as f:
67 return yaml.safe_load(f)
68
69def save_config(config):
70 """Save configuration to file."""
71 with open(CONFIG_PATH, 'w') as f:
72 yaml.dump(config, f, default_flow_style=False)
73
74def get_api_key(config):
75 """Get API key from config or environment."""
76 if config["api"]["key"]:
77 return config["api"]["key"]
78
79 env_key = os.environ.get("MCP_API_KEY")
80 if env_key:
81 return env_key
82
83 # If no key is configured, prompt the user
84 console.print("No API key found. Please enter your API key:", style="yellow")
85 key = input("> ")
86
87 if key:
88 config["api"]["key"] = key
89 save_config(config)
90 return key
91
92 console.print("No API key provided. Some features may not work.", style="red")
93 return None
94
95def make_api_request(endpoint, method="GET", data=None, config=None):
96 """Make an API request to the MCP backend."""
97 if config is None:
98 config = load_config()
99
100 api_key = get_api_key(config)
101 headers = {
102 "Content-Type": "application/json"
103 }
104
105 if api_key:
106 headers["Authorization"] = f"Bearer {api_key}"
107
108 url = f"{config['api']['url']}/{endpoint.lstrip('/')}"
109
110 try:
111 if method == "GET":
112 response = requests.get(url, headers=headers)
113 elif method == "POST":
114 response = requests.post(url, headers=headers, json=data)
115 else:
116 raise ValueError(f"Unsupported HTTP method: {method}")
117
118 response.raise_for_status()
119 return response.json()
120 except requests.exceptions.RequestException as e:
121 console.print(f"API request failed: {str(e)}", style="red")
122 return None
123
124def display_response(response_text, format_type="markdown"):
125 """Display a response with appropriate formatting."""
126 if format_type == "markdown":
127 console.print(Markdown(response_text))
128 else:
129 console.print(response_text)
130
131def chat_command(args, config):
132 """Start an interactive chat session."""
133 session_id = args.session_id
134 model_name = args.model or config["defaults"]["model"]
135 provider = args.provider or config["defaults"]["provider"]
136 agent_type = args.agent or config["defaults"]["agent"]
137
138 console.print(Panel(f"Starting MCP Chat Session\nModel: {model_name} | Provider: {provider} | Agent: {agent_type}"))
139 console.print("Type 'exit' to quit, 'help' for commands", style="dim")
140
141 # Set up prompt session with history
142 ensure_config_dir()
143 history_file = os.path.join(HISTORY_PATH, "chat_history")
144 session = PromptSession(
145 history=FileHistory(history_file),
146 auto_suggest=AutoSuggestFromHistory(),
147 completer=WordCompleter(['exit', 'help', 'models', 'clear', 'save', 'switch'])
148 )
149
150 # Initial session data
151 if not session_id:
152 # Create a new session
153 pass
154
155 while True:
156 try:
157 user_input = session.prompt(f"{Fore.GREEN}You: {Style.RESET_ALL}")
158
159 if user_input.lower() in ('exit', 'quit'):
160 break
161
162 if not user_input.strip():
163 continue
164
165 # Handle special commands
166 if user_input.lower() == 'help':
167 console.print(Panel("""
168 Available commands:
169 - exit/quit: Exit the chat session
170 - clear: Clear the current conversation
171 - save FILENAME: Save conversation to file
172 - models: List available models
173 - switch MODEL: Switch to a different model
174 - provider PROVIDER: Switch to a different provider
175 """))
176 continue
177
178 # For normal input, send to API
179 with Progress() as progress:
180 task = progress.add_task("[cyan]Generating response...", total=None)
181
182 data = {
183 "message": user_input,
184 "session_id": session_id,
185 "model_params": {
186 "provider": provider,
187 "model": model_name,
188 "auto_select": provider == "auto"
189 }
190 }
191
192 response = make_api_request("chat", method="POST", data=data, config=config)
193 progress.update(task, completed=100)
194
195 if response:
196 session_id = response["session_id"]
197 model_used = response.get("model_used", model_name)
198 provider_used = response.get("provider_used", provider)
199
200 # Display provider and model info if configured
201 if config["output"]["show_model_info"]:
202 console.print(f"\n{Fore.BLUE}MCP [{provider_used}:{model_used}]:{Style.RESET_ALL}")
203 else:
204 console.print(f"\n{Fore.BLUE}MCP:{Style.RESET_ALL}")
205
206 display_response(response["response"], config["output"]["format"])
207 console.print() # Empty line for readability
208
209 except KeyboardInterrupt:
210 break
211 except EOFError:
212 break
213 except Exception as e:
214 console.print(f"Error: {str(e)}", style="red")
215
216 console.print("Chat session ended")
217
218def models_command(args, config):
219 """List and manage available models."""
220 if args.pull:
221 # Pull a new model for Ollama
222 console.print(f"Pulling Ollama model: {args.pull}")
223
224 with Progress() as progress:
225 task = progress.add_task(f"[cyan]Pulling {args.pull}...", total=None)
226
227 # This would actually call Ollama API
228 time.sleep(2) # Simulating download
229
230 progress.update(task, completed=100)
231
232 console.print(f"Successfully pulled {args.pull}", style="green")
233 return
234
235 # List available models
236 console.print(Panel("Available Models"))
237
238 console.print("\n[bold]OpenAI Models:[/bold]")
239 openai_models = [
240 {"name": "gpt-4-turbo", "description": "Advanced reasoning, current knowledge"},
241 {"name": "gpt-3.5-turbo", "description": "Fast, efficient for standard tasks"}
242 ]
243
244 for model in openai_models:
245 console.print(f" • {model['name']} - {model['description']}")
246
247 console.print("\n[bold]Ollama Models:[/bold]")
248
249 # In a real implementation, this would fetch from Ollama API
250 ollama_models = [
251 {"name": "llama2", "description": "General purpose local model", "installed": True},
252 {"name": "mistral", "description": "Strong reasoning, 8k context window", "installed": True},
253 {"name": "codellama", "description": "Specialized for code generation", "installed": True},
254 {"name": "wizard-math", "description": "Mathematical problem-solving", "installed": False}
255 ]
256
257 for model in ollama_models:
258 status = "[green]✓[/green]" if model["installed"] else "[red]✗[/red]"
259 console.print(f" {status} {model['name']} - {model['description']}")
260
261 console.print("\nUse 'mcp models --pull MODEL_NAME' to download a model")
262
263def config_command(args, config):
264 """View or edit configuration."""
265 if args.set:
266 # Set a configuration value
267 key, value = args.set.split('=', 1)
268 keys = key.split('.')
269
270 # Navigate to the nested key
271 current = config
272 for k in keys[:-1]:
273 if k not in current:
274 current[k] = {}
275 current = current[k]
276
277 # Set the value (with type conversion)
278 if value.lower() == 'true':
279 current[keys[-1]] = True
280 elif value.lower() == 'false':
281 current[keys[-1]] = False
282 elif value.isdigit():
283 current[keys[-1]] = int(value)
284 else:
285 current[keys[-1]] = value
286
287 save_config(config)
288 console.print(f"Configuration updated: {key} = {value}", style="green")
289 return
290
291 # Display current configuration
292 console.print(Panel("MCP Configuration"))
293 console.print(yaml.dump(config))
294 console.print("\nUse 'mcp config --set key.path=value' to change settings")
295
296def agent_command(args, config):
297 """Manage agent profiles."""
298 if args.create:
299 # Create a new agent profile
300 console.print(f"Creating agent profile: {args.create}")
301 # Implementation would collect agent parameters
302 return
303
304 if args.edit:
305 # Edit an existing agent profile
306 console.print(f"Editing agent profile: {args.edit}")
307 return
308
309 # List available agents
310 console.print(Panel("Available Agents"))
311
312 console.print("\n[bold]System Agents:[/bold]")
313 system_agents = [
314 {"name": "general", "description": "General purpose assistant"},
315 {"name": "research", "description": "Research specialist with knowledge tools"},
316 {"name": "coding", "description": "Code assistant with tool integration"},
317 {"name": "creative", "description": "Creative writing and content generation"}
318 ]
319
320 for agent in system_agents:
321 console.print(f" • {agent['name']} - {agent['description']}")
322
323 # In a real implementation, this would load from user config
324 custom_agents = [
325 {"name": "my-math-tutor", "description": "Mathematics teaching and problem solving"},
326 {"name": "data-analyst", "description": "Data analysis with visualization tools"}
327 ]
328
329 if custom_agents:
330 console.print("\n[bold]Custom Agents:[/bold]")
331 for agent in custom_agents:
332 console.print(f" • {agent['name']} - {agent['description']}")
333
334 console.print("\nUse 'mcp agents --create NAME' to create a new agent")
335
336def main():
337 """Main entry point for the CLI."""
338 parser = argparse.ArgumentParser(description="MCP Command Line Interface")
339 parser.add_argument('--config', help="Path to config file")
340 parser.add_argument('--verbose', action='store_true', help="Enable verbose output")
341
342 subparsers = parser.add_subparsers(dest='command', help='Command to run')
343
344 # Chat command
345 chat_parser = subparsers.add_parser('chat', help='Start a chat session')
346 chat_parser.add_argument('--model', help='Model to use')
347 chat_parser.add_argument('--provider', choices=['openai', 'ollama', 'auto'], help='Provider to use')
348 chat_parser.add_argument('--agent', help='Agent type to use')
349 chat_parser.add_argument('--session-id', help='Resume an existing session')
350
351 # Complete command (one-shot completion)
352 complete_parser = subparsers.add_parser('complete', help='Get a completion for a prompt')
353 complete_parser.add_argument('prompt', help='Prompt text')
354 complete_parser.add_argument('--model', help='Model to use')
355 complete_parser.add_argument('--provider', choices=['openai', 'ollama', 'auto'], help='Provider to use')
356
357 # Models command
358 models_parser = subparsers.add_parser('models', help='List and manage available models')
359 models_parser.add_argument('--pull', metavar='MODEL_NAME', help='Download a model to Ollama')
360 models_parser.add_argument('--info', metavar='MODEL_NAME', help='Show detailed model information')
361 models_parser.add_argument('--benchmark', metavar='MODEL_NAME', help='Run performance benchmark')
362
363 # Config command
364 config_parser = subparsers.add_parser('config', help='Configure MCP settings')
365 config_parser.add_argument('--set', metavar='KEY=VALUE', help='Set a configuration value')
366
367 # Agents command
368 agents_parser = subparsers.add_parser('agents', help='Manage agent profiles')
369 agents_parser.add_argument('--create', metavar='NAME', help='Create a new custom agent')
370 agents_parser.add_argument('--edit', metavar='NAME', help='Edit an existing agent')
371 agents_parser.add_argument('--delete', metavar='NAME', help='Delete a custom agent')
372
373 # Session command
374 session_parser = subparsers.add_parser('session', help='Manage saved sessions')
375 session_parser.add_argument('--list', action='store_true', help='List saved sessions')
376 session_parser.add_argument('--delete', metavar='SESSION_ID', help='Delete a session')
377 session_parser.add_argument('--export', metavar='SESSION_ID', help='Export a session')
378
379 args = parser.parse_args()
380
381 # Load configuration
382 config_path = args.config if args.config else CONFIG_PATH
383
384 if args.config and not os.path.exists(args.config):
385 console.print(f"Config file not found: {args.config}", style="red")
386 return 1
387
388 config = load_config()
389
390 # Execute the appropriate command
391 if args.command == 'chat':
392 chat_command(args, config)
393 elif args.command == 'complete':
394 # Implementation for complete command
395 pass
396 elif args.command == 'models':
397 models_command(args, config)
398 elif args.command == 'config':
399 config_command(args, config)
400 elif args.command == 'agents':
401 agent_command(args, config)
402 elif args.command == 'session':
403 # Implementation for session command
404 pass
405 else:
406 # No command specified, show help
407 parser.print_help()
408
409 return 0
410
411if __name__ == "__main__":
412 sys.exit(main())

Web Interface Design

Web Interface Architecture

text
1┌────────────────────────────────────────────────────────────────────┐
2│ │
3│ React Frontend │
4│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
5│ │ Chat │ │ Model │ │ Agent │ │ Settings │ │
6│ │ Interface │ │ Management │ │ Configuration│ │ Manager │ │
7│ └──────────────┘ └──────────────┘ └──────────────┘ └───────────┘ │
8│ │ │ │ │ │
9│ └───────────────┼────────────────┼───────────────┘ │
10│ │ │ │
11│ ▼ ▼ │
12│ ┌─────────────┐ ┌────────────┐ │
13│ │ Auth │ │ API Client │ │
14│ │ Management │ │ │ │
15│ └─────────────┘ └────────────┘ │
16│ │
17└────────────────────────────────────────────────────────────────────┘
18
19
20┌────────────────────────────────────────────────────────────────────┐
21│ │
22│ FastAPI Backend │
23│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
24│ │ Chat │ │ Model │ │ Agent │ │ User │ │
25│ │ Controller │ │ Controller │ │ Controller │ │ Controller│ │
26│ └──────────────┘ └──────────────┘ └──────────────┘ └───────────┘ │
27│ │ │ │ │ │
28│ └───────────────┼────────────────┼───────────────┘ │
29│ │ │ │
30│ ▼ ▼ │
31│ ┌───────────────────┐ ┌────────────────────┐ │
32│ │ Provider Service │ │ Agent Factory │ │
33│ └───────────────────┘ └────────────────────┘ │
34│ │ │ │
35│ ▼ ▼ │
36│ ┌─────────────┐ ┌─────────────┐ │
37│ │ OpenAI API │ │ Ollama API │ │
38│ └─────────────┘ └─────────────┘ │
39│ │
40└────────────────────────────────────────────────────────────────────┘

Web Interface Wireframes

Chat Interface

text
1┌─────────────────────────────────────────────────────────────────────────┐
2│ MCP Assistant 🔄 New Chat ⚙️ │
3├─────────────────────────────────────────────────────────────────────────┤
4│ │
5│ ┌─────────────────────────┐ ┌───────────────────────────────────────┐ │
6│ │ Chat History │ │ │ │
7│ │ │ │ User: Tell me about quantum computing │ │
8│ │ Welcome │ │ │ │
9│ │ Quantum Computing │ │ MCP: Quantum computing is a type of │ │
10│ │ AI Ethics │ │ computation that harnesses quantum │ │
11│ │ Python Tutorial │ │ mechanical phenomena like super- │ │
12│ │ │ │ position and entanglement. │ │
13│ │ │ │ │ │
14│ │ │ │ Unlike classical bits that represent │ │
15│ │ │ │ either 0 or 1, quantum bits or │ │
16│ │ │ │ "qubits" can exist in multiple states │ │
17│ │ │ │ simultaneously due to superposition. │ │
18│ │ │ │ │ │
19│ │ │ │ [Response continues...] │ │
20│ │ │ │ │ │
21│ │ │ │ User: How does quantum entanglement │ │
22│ │ │ │ work? │ │
23│ │ │ │ │ │
24│ │ │ │ MCP is typing... │ │
25│ │ │ │ │ │
26│ └─────────────────────────┘ └───────────────────────────────────────┘ │
27│ │
28│ ┌─────────────────────────────────────────────────────────────────┐ │
29│ │ Type your message... Send ▶ │ │
30│ └─────────────────────────────────────────────────────────────────┘ │
31│ │
32│ Model: auto (OpenAI:gpt-4) | Mode: Research | Memory: Enabled │
33│ │
34└─────────────────────────────────────────────────────────────────────────┘

Model Settings Panel

text
1┌─────────────────────────────────────────────────────────────────────────┐
2│ MCP Assistant > Settings > Models ✖ │
3├─────────────────────────────────────────────────────────────────────────┤
4│ │
5│ Model Selection │
6│ ┌─────────────────────────────────────────────────────────────────┐ │
7│ │ ● Auto-select model (recommended) │ │
8│ │ ○ Specify model and provider │ │
9│ └─────────────────────────────────────────────────────────────────┘ │
10│ │
11│ Provider Model │
12│ ┌────────────┐ ┌────────────────────┐ │
13│ │ OpenAI ▼ │ │ gpt-4-turbo ▼ │ │
14│ └────────────┘ └────────────────────┘ │
15│ │
16│ Auto-Selection Preferences │
17│ ┌─────────────────────────────────────────────────────────────────┐ │
18│ │ Prioritize: ● Speed ○ Quality ○ Privacy ○ Cost │ │
19│ │ │ │
20│ │ Complexity threshold: ███████████░░░░░░░░░ 0.65 │ │
21│ │ │ │
22│ │ [✓] Prefer Ollama for privacy-sensitive content │ │
23│ │ [✓] Use OpenAI for complex reasoning │ │
24│ │ [✓] Automatically fall back if a provider fails │ │
25│ └─────────────────────────────────────────────────────────────────┘ │
26│ │
27│ Available Ollama Models │
28│ ┌─────────────────────────────────────────────────────────────────┐ │
29│ │ ✓ llama2 ✓ mistral ✓ codellama │ │
30│ │ ✓ wizard-math ✓ neural-chat ○ llama2:70b [Download] │ │
31│ └─────────────────────────────────────────────────────────────────┘ │
32│ │
33│ [ Save Changes ] [ Cancel ] │
34│ │
35└─────────────────────────────────────────────────────────────────────────┘

Agent Configuration Panel

text
1┌─────────────────────────────────────────────────────────────────────────┐
2│ MCP Assistant > Settings > Agents ✖ │
3├─────────────────────────────────────────────────────────────────────────┤
4│ │
5│ Current Agent: Research Assistant [Edit ✏] │
6│ │
7│ Agent Library │
8│ ┌─────────────────────────────────────────────────────────────────┐ │
9│ │ ● Research Assistant Knowledge-focused with search capability│ │
10│ │ ○ Code Assistant Specialized for software development │ │
11│ │ ○ Creative Writer Content creation and storytelling │ │
12│ │ ○ Math Tutor Step-by-step problem solving │ │
13│ │ ○ General Assistant Versatile helper for everyday tasks │ │
14│ └─────────────────────────────────────────────────────────────────┘ │
15│ │
16│ Agent Capabilities │
17│ ┌─────────────────────────────────────────────────────────────────┐ │
18│ │ [✓] Knowledge retrieval [ ] Code execution │ │
19│ │ [✓] Web search [ ] Data visualization │ │
20│ │ [✓] Memory [ ] File operations │ │
21│ │ [✓] Calendar awareness [ ] Email integration │ │
22│ └─────────────────────────────────────────────────────────────────┘ │
23│ │
24│ System Instructions │
25│ ┌─────────────────────────────────────────────────────────────────┐ │
26│ │ You are a research assistant with expertise in finding and │ │
27│ │ synthesizing information. Provide comprehensive, accurate │ │
28│ │ answers with authoritative sources when available. │ │
29│ │ │ │
30│ │ │ │
31│ └─────────────────────────────────────────────────────────────────┘ │
32│ │
33│ [ Save Agent ] [ Create New Agent ] [ Import ] [ Export ] │
34│ │
35└─────────────────────────────────────────────────────────────────────────┘

Dashboard View

text
1┌─────────────────────────────────────────────────────────────────────────┐
2│ MCP Assistant > Dashboard ⚙️ │
3├─────────────────────────────────────────────────────────────────────────┤
4│ │
5│ System Status Last 24 Hours │
6│ ┌────────────────────────────┐ ┌────────────────────────────────┐ │
7│ │ OpenAI: ● Connected │ │ Requests: 143 │ │
8│ │ Ollama: ● Connected │ │ OpenAI: 62% | Ollama: 38% │ │
9│ │ Database: ● Operational │ │ Avg Response Time: 2.4s │ │
10│ └────────────────────────────┘ └────────────────────────────────┘ │
11│ │
12│ Recent Conversations │
13│ ┌─────────────────────────────────────────────────────────────────┐ │
14│ │ ● Quantum Computing Research Today, 14:32 [Resume] │ │
15│ │ ● Python Code Debugging Today, 10:15 [Resume] │ │
16│ │ ● Travel Planning Yesterday [Resume] │ │
17│ │ ● Financial Analysis 2 days ago [Resume] │ │
18│ └─────────────────────────────────────────────────────────────────┘ │
19│ │
20│ Model Usage Agent Usage │
21│ ┌────────────────────────────┐ ┌────────────────────────────────┐ │
22│ │ ███ OpenAI:gpt-4 27% │ │ ███ Research Assistant 42% │ │
23│ │ ███ OpenAI:gpt-3.5 35% │ │ ███ Code Assistant 31% │ │
24│ │ ███ Ollama:mistral 20% │ │ ███ General Assistant 18% │ │
25│ │ ███ Ollama:llama2 18% │ │ ███ Other 9% │ │
26│ └────────────────────────────┘ └────────────────────────────────┘ │
27│ │
28│ API Credits │
29│ ┌─────────────────────────────────────────────────────────────────┐ │
30│ │ OpenAI: $4.32 used this month of $10.00 budget ████░░░░░ 43% │ │
31│ │ Estimated savings from Ollama usage: $3.87 │ │
32│ └─────────────────────────────────────────────────────────────────┘ │
33│ │
34│ [ New Chat ] [ View All Conversations ] [ System Settings ] │
35│ │
36└─────────────────────────────────────────────────────────────────────────┘

Web Interface Interaction Flow

text
1┌──────────────┐ ┌───────────────┐ ┌────────────────┐
2│ │ │ │ │ │
3│ Login Page │────▶│ Dashboard │────▶│ Chat Interface│◀───┐
4│ │ │ │ │ │ │
5└──────────────┘ └───────┬───────┘ └────────┬───────┘ │
6 │ │ │
7 ▼ ▼ │
8 ┌───────────────┐ ┌────────────────┐ │
9 │ │ │ │ │
10 │Settings Panel │ │ User Message │ │
11 │ │ │ │ │
12 └───┬───────────┘ └────────┬───────┘ │
13 │ │ │
14 ▼ ▼ │
15 ┌────────────────┐ ┌────────────────┐ │
16 │ │ │ │ │
17 │Model Settings │ │API Processing │ │
18 │ │ │ │ │
19 └────────┬───────┘ └────────┬───────┘ │
20 │ │ │
21 ▼ ▼ │
22 ┌────────────────┐ ┌────────────────┐ │
23 │ │ │ │ │
24 │Agent Settings │ │System Response │────┘
25 │ │ │ │
26 └────────────────┘ └────────────────┘

Key Web Components

ProviderSelector Component

jsx
1// ProviderSelector.jsx
2import React, { useState, useEffect } from 'react';
3import { Dropdown, Switch, Slider, Checkbox, Button, Card, Alert } from 'antd';
4import { ApiOutlined, SettingOutlined, QuestionCircleOutlined } from '@ant-design/icons';
5
6const ProviderSelector = ({
7 onProviderChange,
8 onModelChange,
9 initialProvider = 'auto',
10 initialModel = null,
11 showAdvanced = false
12}) => {
13 const [provider, setProvider] = useState(initialProvider);
14 const [model, setModel] = useState(initialModel);
15 const [autoSelect, setAutoSelect] = useState(initialProvider === 'auto');
16 const [complexityThreshold, setComplexityThreshold] = useState(0.65);
17 const [prioritizePrivacy, setPrioritizePrivacy] = useState(false);
18 const [ollamaModels, setOllamaModels] = useState([]);
19 const [ollamaStatus, setOllamaStatus] = useState('unknown'); // 'online', 'offline', 'unknown'
20 const [openaiModels, setOpenaiModels] = useState([
21 { value: 'gpt-4o', label: 'GPT-4o' },
22 { value: 'gpt-4-turbo', label: 'GPT-4 Turbo' },
23 { value: 'gpt-3.5-turbo', label: 'GPT-3.5 Turbo' }
24 ]);
25
26 // Fetch available Ollama models on component mount
27 useEffect(() => {
28 const fetchOllamaModels = async () => {
29 try {
30 const response = await fetch('/api/v1/models/ollama');
31 if (response.ok) {
32 const data = await response.json();
33 setOllamaModels(data.models.map(m => ({
34 value: m.name,
35 label: m.name
36 })));
37 setOllamaStatus('online');
38 } else {
39 setOllamaStatus('offline');
40 }
41 } catch (error) {
42 console.error('Error fetching Ollama models:', error);
43 setOllamaStatus('offline');
44 }
45 };
46
47 fetchOllamaModels();
48 }, []);
49
50 const handleProviderChange = (value) => {
51 setProvider(value);
52 onProviderChange(value);
53
54 // Reset model when changing provider
55 setModel(null);
56 onModelChange(null);
57 };
58
59 const handleModelChange = (value) => {
60 setModel(value);
61 onModelChange(value);
62 };
63
64 const handleAutoSelectChange = (checked) => {
65 setAutoSelect(checked);
66 if (checked) {
67 setProvider('auto');
68 onProviderChange('auto');
69 setModel(null);
70 onModelChange(null);
71 } else {
72 // Default to OpenAI if disabling auto-select
73 setProvider('openai');
74 onProviderChange('openai');
75 setModel('gpt-3.5-turbo');
76 onModelChange('gpt-3.5-turbo');
77 }
78 };
79
80 const providerOptions = [
81 { value: 'openai', label: 'OpenAI' },
82 { value: 'ollama', label: 'Ollama (Local)' },
83 { value: 'auto', label: 'Auto-select' }
84 ];
85
86 return (
87 <Card title="Model Selection" extra={<QuestionCircleOutlined />}>
88 <div className="provider-selector">
89 <div className="selector-row">
90 <Switch
91 checked={autoSelect}
92 onChange={handleAutoSelectChange}
93 checkedChildren="Auto-select"
94 unCheckedChildren="Manual"
95 />
96 <span className="selector-label">
97 {autoSelect ? 'Automatically select the best model for each query' : 'Manually choose provider and model'}
98 </span>
99 </div>
100
101 {!autoSelect && (
102 <div className="selector-row model-selection">
103 <div className="provider-dropdown">
104 <span>Provider:</span>
105 <Dropdown
106 options={providerOptions}
107 value={provider}
108 onChange={handleProviderChange}
109 disabled={autoSelect}
110 />
111 </div>
112
113 <div className="model-dropdown">
114 <span>Model:</span>
115 <Dropdown
116 options={provider === 'openai' ? openaiModels : ollamaModels}
117 value={model}
118 onChange={handleModelChange}
119 disabled={autoSelect}
120 placeholder="Select a model"
121 />
122 </div>
123 </div>
124 )}
125
126 {provider === 'ollama' && ollamaStatus === 'offline' && (
127 <Alert
128 message="Ollama is currently offline"
129 description="Please start Ollama service to use local models."
130 type="warning"
131 showIcon
132 />
133 )}
134
135 {showAdvanced && (
136 <div className="advanced-settings">
137 <div className="setting-header">Advanced Routing Settings</div>
138
139 <div className="setting-row">
140 <span>Complexity threshold:</span>
141 <Slider
142 value={complexityThreshold}
143 onChange={setComplexityThreshold}
144 min={0}
145 max={1}
146 step={0.05}
147 disabled={!autoSelect}
148 />
149 <span className="setting-value">{complexityThreshold}</span>
150 </div>
151
152 <div className="setting-row">
153 <Checkbox
154 checked={prioritizePrivacy}
155 onChange={e => setPrioritizePrivacy(e.target.checked)}
156 disabled={!autoSelect}
157 >
158 Prioritize privacy (prefer Ollama for sensitive content)
159 </Checkbox>
160 </div>
161
162 <div className="model-status">
163 <div>
164 <ApiOutlined /> OpenAI: <span className="status-online">Connected</span>
165 </div>
166 <div>
167 <ApiOutlined /> Ollama: <span className={ollamaStatus === 'online' ? 'status-online' : 'status-offline'}>
168 {ollamaStatus === 'online' ? 'Connected' : 'Disconnected'}
169 </span>
170 </div>
171 </div>
172 </div>
173 )}
174 </div>
175 </Card>
176 );
177};
178
179export default ProviderSelector;

ChatInterface Component

jsx
1// ChatInterface.jsx
2import React, { useState, useEffect, useRef } from 'react';
3import { Input, Button, Spin, Avatar, Tooltip, Card, Typography, Dropdown, Menu } from 'antd';
4import { SendOutlined, UserOutlined, RobotOutlined, SettingOutlined,
5 SaveOutlined, CopyOutlined, DeleteOutlined, InfoCircleOutlined } from '@ant-design/icons';
6import ReactMarkdown from 'react-markdown';
7import { Prism as SyntaxHighlighter } from 'react-syntax-highlighter';
8import { tomorrow } from 'react-syntax-highlighter/dist/esm/styles/prism';
9import ProviderSelector from './ProviderSelector';
10
11const { TextArea } = Input;
12const { Text, Title } = Typography;
13
14const ChatInterface = () => {
15 const [messages, setMessages] = useState([]);
16 const [input, setInput] = useState('');
17 const [loading, setLoading] = useState(false);
18 const [sessionId, setSessionId] = useState(null);
19 const [provider, setProvider] = useState('auto');
20 const [model, setModel] = useState(null);
21 const [showSettings, setShowSettings] = useState(false);
22 const messagesEndRef = useRef(null);
23
24 // Scroll to bottom when messages change
25 useEffect(() => {
26 scrollToBottom();
27 }, [messages]);
28
29 const scrollToBottom = () => {
30 messagesEndRef.current?.scrollIntoView({ behavior: 'smooth' });
31 };
32
33 const handleSend = async () => {
34 if (!input.trim()) return;
35
36 // Add user message to chat
37 const userMessage = { role: 'user', content: input, timestamp: new Date() };
38 setMessages(prev => [...prev, userMessage]);
39 setInput('');
40 setLoading(true);
41
42 try {
43 const response = await fetch('/api/v1/chat', {
44 method: 'POST',
45 headers: { 'Content-Type': 'application/json' },
46 body: JSON.stringify({
47 message: input,
48 session_id: sessionId,
49 model_params: {
50 provider: provider,
51 model: model,
52 auto_select: provider === 'auto'
53 }
54 })
55 });
56
57 if (!response.ok) {
58 throw new Error('Failed to get response');
59 }
60
61 const data = await response.json();
62
63 // Update session ID if new
64 if (data.session_id && !sessionId) {
65 setSessionId(data.session_id);
66 }
67
68 // Add assistant message to chat
69 const assistantMessage = {
70 role: 'assistant',
71 content: data.response,
72 timestamp: new Date(),
73 metadata: {
74 model_used: data.model_used,
75 provider_used: data.provider_used
76 }
77 };
78
79 setMessages(prev => [...prev, assistantMessage]);
80
81 } catch (error) {
82 console.error('Error sending message:', error);
83 // Add error message
84 setMessages(prev => [...prev, {
85 role: 'system',
86 content: 'Error: Unable to get a response. Please try again.',
87 error: true,
88 timestamp: new Date()
89 }]);
90 } finally {
91 setLoading(false);
92 }
93 };
94
95 const handleKeyDown = (e) => {
96 if (e.key === 'Enter' && !e.shiftKey) {
97 e.preventDefault();
98 handleSend();
99 }
100 };
101
102 const handleCopyMessage = (content) => {
103 navigator.clipboard.writeText(content);
104 // Could show a toast notification here
105 };
106
107 const renderMessage = (message, index) => {
108 const isUser = message.role === 'user';
109 const isError = message.error;
110
111 return (
112 <div
113 key={index}
114 className={`message-container ${isUser ? 'user-message' : 'assistant-message'} ${isError ? 'error-message' : ''}`}
115 >
116 <div className="message-avatar">
117 <Avatar
118 icon={isUser ? <UserOutlined /> : <RobotOutlined />}
119 style={{ backgroundColor: isUser ? '#1890ff' : '#52c41a' }}
120 />
121 </div>
122
123 <div className="message-content">
124 <div className="message-header">
125 <Text strong>{isUser ? 'You' : 'MCP Assistant'}</Text>
126 {message.metadata && (
127 <Tooltip title="Model information">
128 <Text type="secondary" className="model-info">
129 <InfoCircleOutlined /> {message.metadata.provider_used}:{message.metadata.model_used}
130 </Text>
131 </Tooltip>
132 )}
133 <Text type="secondary" className="message-time">
134 {message.timestamp.toLocaleTimeString()}
135 </Text>
136 </div>
137
138 <div className="message-body">
139 <ReactMarkdown
140 children={message.content}
141 components={{
142 code({node, inline, className, children, ...props}) {
143 const match = /language-(\w+)/.exec(className || '');
144 return !inline && match ? (
145 <SyntaxHighlighter
146 children={String(children).replace(/\n$/, '')}
147 style={tomorrow}
148 language={match[1]}
149 PreTag="div"
150 {...props}
151 />
152 ) : (
153 <code className={className} {...props}>
154 {children}
155 </code>
156 );
157 }
158 }}
159 />
160 </div>
161
162 <div className="message-actions">
163 <Button
164 type="text"
165 size="small"
166 icon={<CopyOutlined />}
167 onClick={() => handleCopyMessage(message.content)}
168 >
169 Copy
170 </Button>
171 </div>
172 </div>
173 </div>
174 );
175 };
176
177 const settingsMenu = (
178 <Card className="settings-panel">
179 <Title level={4}>Chat Settings</Title>
180
181 <ProviderSelector
182 onProviderChange={setProvider}
183 onModelChange={setModel}
184 initialProvider={provider}
185 initialModel={model}
186 showAdvanced={true}
187 />
188
189 <div className="settings-actions">
190 <Button type="primary" onClick={() => setShowSettings(false)}>
191 Close Settings
192 </Button>
193 </div>
194 </Card>
195 );
196
197 return (
198 <div className="chat-interface">
199 <div className="chat-header">
200 <Title level={3}>MCP Assistant</Title>
201
202 <div className="header-actions">
203 <Button icon={<DeleteOutlined />} onClick={() => setMessages([])}>
204 Clear Chat
205 </Button>
206 <Button
207 icon={<SettingOutlined />}
208 type={showSettings ? 'primary' : 'default'}
209 onClick={() => setShowSettings(!showSettings)}
210 >
211 Settings
212 </Button>
213 </div>
214 </div>
215
216 {showSettings && settingsMenu}
217
218 <div className="message-list">
219 {messages.length === 0 && (
220 <div className="empty-state">
221 <Title level={4}>Start a conversation</Title>
222 <Text>Ask a question or request information</Text>
223 </div>
224 )}
225
226 {messages.map(renderMessage)}
227
228 {loading && (
229 <div className="message-container assistant-message">
230 <div className="message-avatar">
231 <Avatar icon={<RobotOutlined />} style={{ backgroundColor: '#52c41a' }} />
232 </div>
233 <div className="message-content">
234 <div className="message-body typing-indicator">
235 <Spin /> MCP is thinking...
236 </div>
237 </div>
238 </div>
239 )}
240
241 <div ref={messagesEndRef} />
242 </div>
243
244 <div className="chat-input">
245 <TextArea
246 value={input}
247 onChange={e => setInput(e.target.value)}
248 onKeyDown={handleKeyDown}
249 placeholder="Type your message..."
250 autoSize={{ minRows: 1, maxRows: 4 }}
251 disabled={loading}
252 />
253 <Button
254 type="primary"
255 icon={<SendOutlined />}
256 onClick={handleSend}
257 disabled={loading || !input.trim()}
258 >
259 Send
260 </Button>
261 </div>
262
263 <div className="chat-footer">
264 <Text type="secondary">
265 Model: {provider === 'auto' ? 'Auto-select' : `${provider}:${model || 'default'}`}
266 </Text>
267 {sessionId && (
268 <Text type="secondary">Session ID: {sessionId}</Text>
269 )}
270 </div>
271 </div>
272 );
273};
274
275export default ChatInterface;

AgentConfiguration Component

jsx
1// AgentConfiguration.jsx
2import React, { useState, useEffect } from 'react';
3import { Form, Input, Button, Select, Checkbox, Card, Typography, Tabs, message } from 'antd';
4import { SaveOutlined, PlusOutlined, ImportOutlined, ExportOutlined } from '@ant-design/icons';
5
6const { Title, Text } = Typography;
7const { TextArea } = Input;
8const { Option } = Select;
9const { TabPane } = Tabs;
10
11const AgentConfiguration = () => {
12 const [form] = Form.useForm();
13 const [agents, setAgents] = useState([]);
14 const [currentAgent, setCurrentAgent] = useState(null);
15 const [loading, setLoading] = useState(false);
16
17 // Fetch available agents on component mount
18 useEffect(() => {
19 const fetchAgents = async () => {
20 setLoading(true);
21 try {
22 const response = await fetch('/api/v1/agents');
23 if (response.ok) {
24 const data = await response.json();
25 setAgents(data.agents);
26
27 // Set current agent to the first one
28 if (data.agents.length > 0) {
29 setCurrentAgent(data.agents[0]);
30 form.setFieldsValue(data.agents[0]);
31 }
32 }
33 } catch (error) {
34 console.error('Error fetching agents:', error);
35 message.error('Failed to load agents');
36 } finally {
37 setLoading(false);
38 }
39 };
40
41 fetchAgents();
42 }, [form]);
43
44 const handleAgentChange = (agentId) => {
45 const selected = agents.find(a => a.id === agentId);
46 if (selected) {
47 setCurrentAgent(selected);
48 form.setFieldsValue(selected);
49 }
50 };
51
52 const handleSaveAgent = async (values) => {
53 setLoading(true);
54 try {
55 const response = await fetch(`/api/v1/agents/${currentAgent.id}`, {
56 method: 'PUT',
57 headers: { 'Content-Type': 'application/json' },
58 body: JSON.stringify(values)
59 });
60
61 if (response.ok) {
62 message.success('Agent configuration saved');
63 // Update local state
64 const updatedAgents = agents.map(a =>
65 a.id === currentAgent.id ? { ...a, ...values } : a
66 );
67 setAgents(updatedAgents);
68 setCurrentAgent({ ...currentAgent, ...values });
69 } else {
70 message.error('Failed to save agent configuration');
71 }
72 } catch (error) {
73 console.error('Error saving agent:', error);
74 message.error('Error saving agent configuration');
75 } finally {
76 setLoading(false);
77 }
78 };
79
80 const handleCreateAgent = () => {
81 form.resetFields();
82 form.setFieldsValue({
83 name: 'New Agent',
84 description: 'Custom assistant',
85 capabilities: [],
86 system_prompt: 'You are a helpful assistant.'
87 });
88
89 setCurrentAgent(null); // Indicates we're creating a new agent
90 };
91
92 const handleExportAgent = () => {
93 if (!currentAgent) return;
94
95 const agentData = JSON.stringify(currentAgent, null, 2);
96 const blob = new Blob([agentData], { type: 'application/json' });
97 const url = URL.createObjectURL(blob);
98
99 const a = document.createElement('a');
100 a.href = url;
101 a.download = `${currentAgent.name.replace(/\s+/g, '_').toLowerCase()}_agent.json`;
102 document.body.appendChild(a);
103 a.click();
104 document.body.removeChild(a);
105 URL.revokeObjectURL(url);
106 };
107
108 return (
109 <div className="agent-configuration">
110 <Card title={<Title level={4}>Agent Configuration</Title>}>
111 <div className="agent-actions">
112 <Button
113 type="primary"
114 icon={<PlusOutlined />}
115 onClick={handleCreateAgent}
116 >
117 Create New Agent
118 </Button>
119
120 <Button
121 icon={<ExportOutlined />}
122 onClick={handleExportAgent}
123 disabled={!currentAgent}
124 >
125 Export
126 </Button>
127
128 <Button icon={<ImportOutlined />}>
129 Import
130 </Button>
131 </div>
132
133 <div className="agent-selector">
134 <Text strong>Select Agent:</Text>
135 <Select
136 style={{ width: 300 }}
137 onChange={handleAgentChange}
138 value={currentAgent?.id}
139 loading={loading}
140 >
141 {agents.map(agent => (
142 <Option key={agent.id} value={agent.id}>
143 {agent.name} - {agent.description}
144 </Option>
145 ))}
146 </Select>
147 </div>
148
149 <Form
150 form={form}
151 layout="vertical"
152 onFinish={handleSaveAgent}
153 className="agent-form"
154 >
155 <Tabs defaultActiveKey="basic">
156 <TabPane tab="Basic Information" key="basic">
157 <Form.Item
158 name="name"
159 label="Agent Name"
160 rules={[{ required: true, message: 'Please enter an agent name' }]}
161 >
162 <Input placeholder="Agent name" />
163 </Form.Item>
164
165 <Form.Item
166 name="description"
167 label="Description"
168 rules={[{ required: true, message: 'Please enter a description' }]}
169 >
170 <Input placeholder="Brief description of this agent's purpose" />
171 </Form.Item>
172
173 <Form.Item
174 name="system_prompt"
175 label="System Instructions"
176 rules={[{ required: true, message: 'Please enter system instructions' }]}
177 >
178 <TextArea
179 placeholder="Instructions that define the agent's behavior"
180 autoSize={{ minRows: 4, maxRows: 8 }}
181 />
182 </Form.Item>
183 </TabPane>
184
185 <TabPane tab="Capabilities" key="capabilities">
186 <Form.Item name="capabilities" label="Agent Capabilities">
187 <Checkbox.Group>
188 <div className="capabilities-grid">
189 <Checkbox value="knowledge_retrieval">Knowledge Retrieval</Checkbox>
190 <Checkbox value="web_search">Web Search</Checkbox>
191 <Checkbox value="memory">Long-term Memory</Checkbox>
192 <Checkbox value="calendar">Calendar Awareness</Checkbox>
193 <Checkbox value="code_execution">Code Execution</Checkbox>
194 <Checkbox value="data_visualization">Data Visualization</Checkbox>
195 <Checkbox value="file_operations">File Operations</Checkbox>
196 <Checkbox value="email">Email Integration</Checkbox>
197 </div>
198 </Checkbox.Group>
199 </Form.Item>
200
201 <Form.Item name="preferred_models" label="Preferred Models">
202 <Select mode="multiple" placeholder="Select preferred models">
203 <Option value="openai:gpt-4">OpenAI: GPT-4</Option>
204 <Option value="openai:gpt-3.5-turbo">OpenAI: GPT-3.5 Turbo</Option>
205 <Option value="ollama:llama2">Ollama: Llama2</Option>
206 <Option value="ollama:mistral">Ollama: Mistral</Option>
207 <Option value="ollama:codellama">Ollama: CodeLlama</Option>
208 </Select>
209 </Form.Item>
210 </TabPane>
211
212 <TabPane tab="Advanced" key="advanced">
213 <Form.Item name="tool_configuration" label="Tool Configuration">
214 <TextArea
215 placeholder="JSON configuration for tools (advanced)"
216 autoSize={{ minRows: 4, maxRows: 8 }}
217 />
218 </Form.Item>
219
220 <Form.Item name="temperature" label="Temperature">
221 <Select placeholder="Response creativity level">
222 <Option value="0.2">0.2 - More deterministic/factual</Option>
223 <Option value="0.5">0.5 - Balanced</Option>
224 <Option value="0.8">0.8 - More creative/varied</Option>
225 </Select>
226 </Form.Item>
227 </TabPane>
228 </Tabs>
229
230 <Form.Item>
231 <Button
232 type="primary"
233 htmlType="submit"
234 icon={<SaveOutlined />}
235 loading={loading}
236 >
237 {currentAgent ? 'Save Changes' : 'Create Agent'}
238 </Button>
239 </Form.Item>
240 </Form>
241 </Card>
242 </div>
243 );
244};
245
246export default AgentConfiguration;

User Interaction Flows

New User Onboarding Flow

text
1┌────────────────┐ ┌────────────────┐ ┌────────────────┐
2│ │ │ │ │ │
3│ Welcome Screen │────▶│ Initial Setup │────▶│ API Key Setup │
4│ │ │ │ │ │
5└────────────────┘ └────────────────┘ └───────┬────────┘
6
7┌────────────────┐ ┌────────────────┐ ┌───────▼────────┐
8│ │ │ │ │ │
9│ First Chat │◀────│ Ollama Setup │◀────│ Model Download │
10│ │ │ │ │ │
11└────────────────┘ └────────────────┘ └────────────────┘

Task-Based User Flow Example

text
1┌────────────────┐ ┌────────────────┐ ┌────────────────┐
2│ │ │ │ │ │
3│ Start Chat │────▶│ Select Research│────▶│ Enter Research │
4│ │ │ Agent │ │ Query │
5└────────────────┘ └────────────────┘ └───────┬────────┘
6
7┌────────────────┐ ┌────────────────┐ ┌───────▼────────┐
8│ │ │ │ │ │
9│ Save Results │◀────│ Refine Query │◀────│ View Response │
10│ │ │ │ │ (Using OpenAI) │
11└────────────────┘ └────────────────┘ └────────────────┘

Advanced Settings Flow

text
1┌────────────────┐ ┌────────────────┐ ┌────────────────┐
2│ │ │ │ │ │
3│ Chat Screen │────▶│ Settings Menu │────▶│ Model Settings │
4│ │ │ │ │ │
5└────────────────┘ └────────────────┘ └───────┬────────┘
6
7┌────────────────┐ ┌────────────────┐ ┌───────▼────────┐
8│ │ │ │ │ │
9│ Return to │◀────│ Save Settings │◀────│ Agent Settings │
10│ Chat │ │ │ │ │
11└────────────────┘ └────────────────┘ └────────────────┘

Implementation Recommendations

  1. Responsive Design: Ensure the web interface is mobile-friendly using responsive design principles
  2. Accessibility: Implement proper ARIA attributes and keyboard navigation for accessibility
  3. Progressive Enhancement: Build with a progressive enhancement approach where core functionality works without JavaScript
  4. State Management: Use context API or Redux for global state in more complex implementations
  5. Offline Support: Consider adding service workers for offline functionality in the web interface
  6. CLI Shortcuts: Implement tab completion and command history in the CLI for improved usability

Conclusion

The proposed user interface designs for the MCP system provide a balance between simplicity and power, enabling users to leverage the hybrid OpenAI-Ollama architecture effectively. The CLI offers a lightweight, scriptable interface for technical users and automation scenarios, while the web interface provides a rich, interactive experience for broader adoption.

Both interfaces expose the key capabilities of the system:

  1. Intelligent Model Routing: Users can leverage automatic model selection or manually choose specific models
  2. Agent Specialization: Configurable agents enable task-specific optimization
  3. Privacy Controls: Explicit options for privacy-sensitive content
  4. Performance Analytics: Visibility into system usage, costs, and efficiency

These interfaces serve as the critical touchpoint between users and the sophisticated underlying architecture, making complex AI capabilities accessible and manageable.

Optimization and Deployment Strategies for OpenAI-Ollama Hybrid AI System

Strategic Optimization Framework

The integration of cloud-based and local inference capabilities within a unified architecture presents unique opportunities for optimization across multiple dimensions. This document outlines comprehensive strategies for enhancing performance, reducing operational costs, and improving response accuracy, followed by detailed deployment methodologies for both local and cloud environments.

Performance Optimization Strategies

1. Query Routing Optimization

python
1# app/services/routing_optimizer.py
2import logging
3import numpy as np
4from typing import Dict, List, Any, Optional
5from app.config import settings
6
7logger = logging.getLogger(__name__)
8
9class RoutingOptimizer:
10 """Optimizes routing decisions based on historical performance data."""
11
12 def __init__(self, cache_size: int = 1000):
13 self.performance_history = {}
14 self.cache_size = cache_size
15 self.learning_rate = 0.05
16
17 # Baseline thresholds
18 self.complexity_threshold = settings.COMPLEXITY_THRESHOLD
19 self.token_threshold = 800 # Approximate tokens before preferring cloud
20 self.latency_requirement = 2.0 # Seconds
21
22 # Performance weights
23 self.weights = {
24 "complexity": 0.4,
25 "token_count": 0.2,
26 "privacy_score": 0.3,
27 "tool_requirement": 0.1
28 }
29
30 def update_performance_metrics(self,
31 provider: str,
32 model: str,
33 query_complexity: float,
34 token_count: int,
35 response_time: float,
36 success: bool) -> None:
37 """Update performance metrics based on actual results."""
38 model_key = f"{provider}:{model}"
39
40 if model_key not in self.performance_history:
41 self.performance_history[model_key] = {
42 "queries": 0,
43 "avg_response_time": 0,
44 "success_rate": 0,
45 "complexity_performance": {} # Maps complexity ranges to success/time
46 }
47
48 metrics = self.performance_history[model_key]
49
50 # Update metrics with exponential moving average
51 metrics["queries"] += 1
52 metrics["avg_response_time"] = (
53 (1 - self.learning_rate) * metrics["avg_response_time"] +
54 self.learning_rate * response_time
55 )
56
57 # Update success rate
58 old_success_rate = metrics["success_rate"]
59 queries = metrics["queries"]
60 metrics["success_rate"] = (old_success_rate * (queries - 1) + (1 if success else 0)) / queries
61
62 # Update complexity-specific performance
63 complexity_bin = round(query_complexity * 10) / 10 # Round to nearest 0.1
64
65 if complexity_bin not in metrics["complexity_performance"]:
66 metrics["complexity_performance"][complexity_bin] = {
67 "count": 0,
68 "avg_time": 0,
69 "success_rate": 0
70 }
71
72 bin_metrics = metrics["complexity_performance"][complexity_bin]
73 bin_metrics["count"] += 1
74 bin_metrics["avg_time"] = (
75 (bin_metrics["count"] - 1) * bin_metrics["avg_time"] + response_time
76 ) / bin_metrics["count"]
77
78 bin_metrics["success_rate"] = (
79 (bin_metrics["count"] - 1) * bin_metrics["success_rate"] + (1 if success else 0)
80 ) / bin_metrics["count"]
81
82 # Prune cache if needed
83 if len(self.performance_history) > self.cache_size:
84 # Remove least used models
85 sorted_models = sorted(
86 self.performance_history.items(),
87 key=lambda x: x[1]["queries"]
88 )
89 for i in range(len(self.performance_history) - self.cache_size):
90 if i < len(sorted_models):
91 del self.performance_history[sorted_models[i][0]]
92
93 def optimize_thresholds(self) -> None:
94 """Periodically optimize routing thresholds based on collected metrics."""
95 if not self.performance_history:
96 return
97
98 openai_models = [k for k in self.performance_history if k.startswith("openai:")]
99 ollama_models = [k for k in self.performance_history if k.startswith("ollama:")]
100
101 if not openai_models or not ollama_models:
102 return # Need data from both providers
103
104 # Calculate average performance metrics for each provider
105 openai_avg_time = np.mean([
106 self.performance_history[model]["avg_response_time"]
107 for model in openai_models
108 ])
109 ollama_avg_time = np.mean([
110 self.performance_history[model]["avg_response_time"]
111 for model in ollama_models
112 ])
113
114 # Find optimal complexity threshold by analyzing where Ollama begins to struggle
115 complexity_success_rates = {}
116
117 for model in ollama_models:
118 for complexity, metrics in self.performance_history[model]["complexity_performance"].items():
119 if complexity not in complexity_success_rates:
120 complexity_success_rates[complexity] = []
121 complexity_success_rates[complexity].append(metrics["success_rate"])
122
123 # Find the complexity level where Ollama success rate drops significantly
124 optimal_threshold = self.complexity_threshold # Start with current
125
126 if complexity_success_rates:
127 complexities = sorted(complexity_success_rates.keys())
128 avg_success_rates = [
129 np.mean(complexity_success_rates[c]) for c in complexities
130 ]
131
132 # Find first major drop in success rate
133 for i in range(1, len(complexities)):
134 if (avg_success_rates[i-1] - avg_success_rates[i]) > 0.15: # 15% drop
135 optimal_threshold = complexities[i-1]
136 break
137
138 # If no clear drop, look for when it falls below 85%
139 if optimal_threshold == self.complexity_threshold:
140 for i, c in enumerate(complexities):
141 if avg_success_rates[i] < 0.85:
142 optimal_threshold = c
143 break
144
145 # Update thresholds (with dampening to avoid oscillation)
146 self.complexity_threshold = (
147 0.8 * self.complexity_threshold +
148 0.2 * optimal_threshold
149 )
150
151 # Update latency requirements based on current performance
152 self.latency_requirement = max(1.0, min(ollama_avg_time * 1.2, 5.0))
153
154 logger.info(f"Optimized routing thresholds: complexity={self.complexity_threshold:.2f}, latency={self.latency_requirement:.2f}s")
155
156 def get_optimal_provider(self,
157 query_complexity: float,
158 privacy_score: float,
159 estimated_tokens: int,
160 requires_tools: bool) -> str:
161 """Get the optimal provider based on current metrics and query characteristics."""
162 # Calculate weighted score for routing decision
163 openai_score = 0
164 ollama_score = 0
165
166 # Complexity factor
167 if query_complexity > self.complexity_threshold:
168 openai_score += self.weights["complexity"]
169 else:
170 ollama_score += self.weights["complexity"]
171
172 # Token count factor
173 if estimated_tokens > self.token_threshold:
174 openai_score += self.weights["token_count"]
175 else:
176 ollama_score += self.weights["token_count"]
177
178 # Privacy factor (higher privacy score means more sensitive)
179 if privacy_score > 0.5:
180 ollama_score += self.weights["privacy_score"]
181 else:
182 # Split proportionally
183 ollama_privacy = self.weights["privacy_score"] * privacy_score * 2
184 openai_privacy = self.weights["privacy_score"] * (1 - privacy_score * 2)
185 ollama_score += ollama_privacy
186 openai_score += openai_privacy
187
188 # Tool requirements factor
189 if requires_tools:
190 openai_score += self.weights["tool_requirement"]
191
192 # Return the provider with higher score
193 return "openai" if openai_score > ollama_score else "ollama"

2. Response Caching with Semantic Search

python
1# app/services/cache_service.py
2import time
3import hashlib
4import json
5from typing import Dict, List, Any, Optional, Tuple
6import numpy as np
7from scipy.spatial.distance import cosine
8import aioredis
9
10from app.config import settings
11from app.services.embedding_service import EmbeddingService
12
13class SemanticCache:
14 """Intelligent caching system using semantic similarity."""
15
16 def __init__(self, embedding_service: EmbeddingService, ttl: int = 3600):
17 self.embedding_service = embedding_service
18 self.redis = None
19 self.ttl = ttl
20 self.similarity_threshold = 0.92 # Threshold for semantic similarity
21 self.exact_cache_enabled = True
22 self.semantic_cache_enabled = True
23
24 async def initialize(self):
25 """Initialize Redis connection."""
26 self.redis = await aioredis.create_redis_pool(settings.REDIS_URL)
27
28 async def close(self):
29 """Close Redis connection."""
30 if self.redis:
31 self.redis.close()
32 await self.redis.wait_closed()
33
34 def _get_exact_cache_key(self, messages: List[Dict], provider: str, model: str) -> str:
35 """Generate an exact cache key from request parameters."""
36 # Normalize the request to ensure consistent keys
37 normalized = {
38 "messages": messages,
39 "provider": provider,
40 "model": model
41 }
42 serialized = json.dumps(normalized, sort_keys=True)
43 return f"exact:{hashlib.md5(serialized.encode()).hexdigest()}"
44
45 async def _get_embedding_key(self, text: str) -> str:
46 """Get the embedding key for a text string."""
47 return f"emb:{hashlib.md5(text.encode()).hexdigest()}"
48
49 async def _store_embedding(self, text: str, embedding: List[float]) -> None:
50 """Store an embedding in Redis."""
51 key = await self._get_embedding_key(text)
52 await self.redis.set(key, json.dumps(embedding), expire=self.ttl)
53
54 async def _get_embedding(self, text: str) -> Optional[List[float]]:
55 """Get an embedding from Redis or compute it if not found."""
56 key = await self._get_embedding_key(text)
57 cached = await self.redis.get(key)
58
59 if cached:
60 return json.loads(cached)
61
62 # Generate new embedding
63 embedding = await self.embedding_service.get_embedding(text)
64 if embedding:
65 await self._store_embedding(text, embedding)
66
67 return embedding
68
69 async def _compute_similarity(self, embedding1: List[float], embedding2: List[float]) -> float:
70 """Compute cosine similarity between embeddings."""
71 return 1 - cosine(embedding1, embedding2)
72
73 async def get(self, messages: List[Dict], provider: str, model: str) -> Optional[Dict]:
74 """Get a cached response if available."""
75 if not self.redis:
76 return None
77
78 # Try exact match first
79 if self.exact_cache_enabled:
80 exact_key = self._get_exact_cache_key(messages, provider, model)
81 cached = await self.redis.get(exact_key)
82 if cached:
83 return json.loads(cached)
84
85 # Try semantic search if enabled
86 if self.semantic_cache_enabled:
87 # Extract query text (last user message)
88 query_text = None
89 for msg in reversed(messages):
90 if msg.get("role") == "user" and msg.get("content"):
91 query_text = msg["content"]
92 break
93
94 if not query_text:
95 return None
96
97 # Get embedding for query
98 query_embedding = await self._get_embedding(query_text)
99 if not query_embedding:
100 return None
101
102 # Get all semantic cache keys
103 semantic_keys = await self.redis.keys("semantic:*")
104 if not semantic_keys:
105 return None
106
107 # Find most similar cached query
108 best_match = None
109 best_similarity = 0
110
111 for key in semantic_keys:
112 # Get metadata
113 meta_key = f"{key}:meta"
114 meta_data = await self.redis.get(meta_key)
115 if not meta_data:
116 continue
117
118 meta = json.loads(meta_data)
119 cached_embedding = meta.get("embedding")
120
121 if not cached_embedding:
122 continue
123
124 # Check provider/model compatibility
125 if (provider != "auto" and meta.get("provider") != provider) or \
126 (model and meta.get("model") != model):
127 continue
128
129 # Compute similarity
130 similarity = await self._compute_similarity(query_embedding, cached_embedding)
131
132 if similarity > self.similarity_threshold and similarity > best_similarity:
133 best_match = key
134 best_similarity = similarity
135
136 if best_match:
137 cached = await self.redis.get(best_match)
138 if cached:
139 # Record cache hit analytics
140 await self.redis.incr("stats:semantic_cache_hits")
141 return json.loads(cached)
142
143 # Record cache miss
144 await self.redis.incr("stats:cache_misses")
145 return None
146
147 async def set(self, messages: List[Dict], provider: str, model: str, response: Dict) -> None:
148 """Set a response in the cache."""
149 if not self.redis:
150 return
151
152 # Set exact match cache
153 if self.exact_cache_enabled:
154 exact_key = self._get_exact_cache_key(messages, provider, model)
155 await self.redis.set(exact_key, json.dumps(response), expire=self.ttl)
156
157 # Set semantic cache
158 if self.semantic_cache_enabled:
159 # Extract query text (last user message)
160 query_text = None
161 for msg in reversed(messages):
162 if msg.get("role") == "user" and msg.get("content"):
163 query_text = msg["content"]
164 break
165
166 if not query_text:
167 return
168
169 # Get embedding for query
170 query_embedding = await self._get_embedding(query_text)
171 if not query_embedding:
172 return
173
174 # Generate semantic key
175 semantic_key = f"semantic:{time.time()}:{hashlib.md5(query_text.encode()).hexdigest()}"
176
177 # Store response
178 await self.redis.set(semantic_key, json.dumps(response), expire=self.ttl)
179
180 # Store metadata (for similarity search)
181 meta_data = {
182 "query": query_text,
183 "embedding": query_embedding,
184 "provider": response.get("provider", provider),
185 "model": response.get("model", model),
186 "timestamp": time.time()
187 }
188
189 await self.redis.set(f"{semantic_key}:meta", json.dumps(meta_data), expire=self.ttl)
190
191 async def get_stats(self) -> Dict[str, int]:
192 """Get cache statistics."""
193 if not self.redis:
194 return {"hits": 0, "misses": 0, "semantic_hits": 0}
195
196 exact_hits = int(await self.redis.get("stats:exact_cache_hits") or 0)
197 semantic_hits = int(await self.redis.get("stats:semantic_cache_hits") or 0)
198 misses = int(await self.redis.get("stats:cache_misses") or 0)
199
200 return {
201 "exact_hits": exact_hits,
202 "semantic_hits": semantic_hits,
203 "total_hits": exact_hits + semantic_hits,
204 "misses": misses,
205 "hit_rate": (exact_hits + semantic_hits) / (exact_hits + semantic_hits + misses) if (exact_hits + semantic_hits + misses) > 0 else 0
206 }

3. Parallel Query Processing

python
1# app/services/parallel_processor.py
2import asyncio
3from typing import List, Dict, Any, Optional, Tuple
4import logging
5import time
6
7from app.services.provider_service import ProviderService
8from app.config import settings
9
10logger = logging.getLogger(__name__)
11
12class ParallelProcessor:
13 """Processes complex queries by decomposing and running in parallel."""
14
15 def __init__(self, provider_service: ProviderService):
16 self.provider_service = provider_service
17 # Threshold for when to use parallel processing
18 self.complexity_threshold = 0.8
19 self.parallel_enabled = settings.ENABLE_PARALLEL_PROCESSING
20
21 async def should_process_in_parallel(self, messages: List[Dict]) -> bool:
22 """Determine if a query should be processed in parallel."""
23 if not self.parallel_enabled:
24 return False
25
26 # Get the last user message
27 user_message = None
28 for msg in reversed(messages):
29 if msg.get("role") == "user":
30 user_message = msg.get("content", "")
31 break
32
33 if not user_message:
34 return False
35
36 # Check message length
37 if len(user_message.split()) < 50:
38 return False
39
40 # Check for complexity indicators
41 complexity_markers = [
42 "compare", "analyze", "different perspectives", "pros and cons",
43 "multiple aspects", "detail", "comprehensive", "multifaceted"
44 ]
45
46 marker_count = sum(1 for marker in complexity_markers if marker in user_message.lower())
47
48 # Check for multiple questions
49 question_count = user_message.count("?")
50
51 # Calculate complexity score
52 complexity = (marker_count * 0.15) + (question_count * 0.2) + (len(user_message.split()) / 500)
53
54 return complexity > self.complexity_threshold
55
56 async def decompose_query(self, query: str) -> List[str]:
57 """Decompose a complex query into simpler sub-queries."""
58 # Use the provider service to generate the decomposition
59 decompose_messages = [
60 {"role": "system", "content": """
61 You are a query decomposition specialist. Your job is to break down complex questions into
62 simpler, independent sub-questions that can be answered separately and then combined.
63
64 Return a JSON array of strings, where each string is a sub-question.
65 For example: ["What are the basics of quantum computing?", "How does quantum computing differ from classical computing?"]
66
67 Keep the total number of sub-questions between 2 and 5.
68 """},
69 {"role": "user", "content": f"Decompose this complex query into simpler sub-questions: {query}"}
70 ]
71
72 try:
73 response = await self.provider_service.generate_completion(
74 messages=decompose_messages,
75 provider="openai", # Use OpenAI for decomposition
76 model="gpt-3.5-turbo", # Use a faster model for this task
77 response_format={"type": "json_object"}
78 )
79
80 if response and response.get("message", {}).get("content"):
81 import json
82 result = json.loads(response["message"]["content"])
83 if isinstance(result, list) and all(isinstance(item, str) for item in result):
84 return result
85 elif isinstance(result, dict) and "sub_questions" in result:
86 return result["sub_questions"]
87
88 # Fallback to simple decomposition
89 return [query]
90
91 except Exception as e:
92 logger.error(f"Error decomposing query: {str(e)}")
93 # Fallback to simple decomposition
94 return [query]
95
96 async def process_sub_query(self, sub_query: str, provider: str, model: str) -> Dict[str, Any]:
97 """Process a single sub-query."""
98 messages = [{"role": "user", "content": sub_query}]
99
100 start_time = time.time()
101 response = await self.provider_service.generate_completion(
102 messages=messages,
103 provider=provider,
104 model=model
105 )
106 duration = time.time() - start_time
107
108 return {
109 "query": sub_query,
110 "response": response,
111 "content": response.get("message", {}).get("content", ""),
112 "duration": duration
113 }
114
115 async def synthesize_responses(self,
116 original_query: str,
117 sub_results: List[Dict]) -> str:
118 """Synthesize the responses from sub-queries into a cohesive answer."""
119 # Extract the responses
120 synthesize_prompt = f"""
121 Original question: {original_query}
122
123 I've broken this question down into parts and found the following information:
124
125 {
126 ''.join([f"Sub-question: {r['query']}\nAnswer: {r['content']}\n\n" for r in sub_results])
127 }
128
129 Please synthesize this information into a cohesive, comprehensive answer to the original question.
130 Ensure the response is well-structured and flows naturally as if it were answering the original
131 question directly. Maintain a consistent tone throughout.
132 """
133
134 messages = [
135 {"role": "system", "content": "You are an expert at synthesizing information from multiple sources into cohesive, comprehensive answers."},
136 {"role": "user", "content": synthesize_prompt}
137 ]
138
139 try:
140 response = await self.provider_service.generate_completion(
141 messages=messages,
142 provider="openai", # Use OpenAI for synthesis
143 model="gpt-4" # Use a more capable model for synthesis
144 )
145
146 if response and response.get("message", {}).get("content"):
147 return response["message"]["content"]
148
149 # Fallback
150 return "\n\n".join([r['content'] for r in sub_results])
151
152 except Exception as e:
153 logger.error(f"Error synthesizing responses: {str(e)}")
154 # Fallback to simple concatenation
155 return "\n\n".join([f"Regarding '{r['query']}':\n{r['content']}" for r in sub_results])
156
157 async def process_in_parallel(self,
158 messages: List[Dict],
159 provider: str = "auto",
160 model: str = None) -> Dict[str, Any]:
161 """Process a complex query by breaking it down and processing in parallel."""
162 # Get the last user message
163 user_message = None
164 for msg in reversed(messages):
165 if msg.get("role") == "user":
166 user_message = msg.get("content", "")
167 break
168
169 if not user_message:
170 # Fallback to regular processing
171 return await self.provider_service.generate_completion(
172 messages=messages,
173 provider=provider,
174 model=model
175 )
176
177 # Decompose the query
178 sub_queries = await self.decompose_query(user_message)
179
180 if len(sub_queries) <= 1:
181 # Not complex enough to benefit from parallel processing
182 return await self.provider_service.generate_completion(
183 messages=messages,
184 provider=provider,
185 model=model
186 )
187
188 # Process sub-queries in parallel
189 tasks = [
190 self.process_sub_query(query, provider, model)
191 for query in sub_queries
192 ]
193
194 sub_results = await asyncio.gather(*tasks)
195
196 # Synthesize the results
197 final_content = await self.synthesize_responses(user_message, sub_results)
198
199 # Calculate aggregated metrics
200 total_duration = sum(result["duration"] for result in sub_results)
201 providers_used = [result["response"].get("provider") for result in sub_results
202 if result["response"].get("provider")]
203 models_used = [result["response"].get("model") for result in sub_results
204 if result["response"].get("model")]
205
206 # Construct a response in the same format as provider_service.generate_completion
207 return {
208 "id": f"parallel_{int(time.time())}",
209 "object": "chat.completion",
210 "created": int(time.time()),
211 "model": ", ".join(set(models_used)) if models_used else model,
212 "provider": ", ".join(set(providers_used)) if providers_used else provider,
213 "usage": {
214 "prompt_tokens": sum(result["response"].get("usage", {}).get("prompt_tokens", 0)
215 for result in sub_results),
216 "completion_tokens": sum(result["response"].get("usage", {}).get("completion_tokens", 0)
217 for result in sub_results),
218 "total_tokens": sum(result["response"].get("usage", {}).get("total_tokens", 0)
219 for result in sub_results)
220 },
221 "message": {
222 "role": "assistant",
223 "content": final_content
224 },
225 "parallel_processing": {
226 "sub_queries": len(sub_queries),
227 "total_duration": total_duration,
228 "max_duration": max(result["duration"] for result in sub_results),
229 "processing_efficiency": 1 - (max(result["duration"] for result in sub_results) / total_duration)
230 if total_duration > 0 else 0
231 }
232 }

4. Dynamic Batching for High-Load Scenarios

python
1# app/services/batch_processor.py
2import asyncio
3from typing import List, Dict, Any, Optional, Callable, Awaitable
4import time
5import logging
6from collections import deque
7
8logger = logging.getLogger(__name__)
9
10class RequestBatcher:
11 """
12 Dynamically batches requests to optimize throughput under high load.
13 """
14
15 def __init__(self,
16 max_batch_size: int = 4,
17 max_wait_time: float = 0.1,
18 processor_fn: Optional[Callable] = None):
19 self.max_batch_size = max_batch_size
20 self.max_wait_time = max_wait_time
21 self.processor_fn = processor_fn
22 self.queue = deque()
23 self.batch_task = None
24 self.active = False
25 self.stats = {
26 "total_requests": 0,
27 "total_batches": 0,
28 "avg_batch_size": 0,
29 "max_queue_length": 0
30 }
31
32 async def start(self):
33 """Start the batch processor."""
34 if self.active:
35 return
36
37 self.active = True
38 self.batch_task = asyncio.create_task(self._batch_processor())
39 logger.info("Batch processor started")
40
41 async def stop(self):
42 """Stop the batch processor."""
43 if not self.active:
44 return
45
46 self.active = False
47 if self.batch_task:
48 try:
49 self.batch_task.cancel()
50 await self.batch_task
51 except asyncio.CancelledError:
52 pass
53
54 logger.info("Batch processor stopped")
55
56 async def _batch_processor(self):
57 """Background task to process batches."""
58 while self.active:
59 try:
60 # Process any batches in the queue
61 await self._process_next_batch()
62
63 # Wait a small amount of time before checking again
64 await asyncio.sleep(0.01)
65 except Exception as e:
66 logger.error(f"Error in batch processor: {str(e)}")
67 await asyncio.sleep(1) # Wait longer on error
68
69 async def _process_next_batch(self):
70 """Process the next batch from the queue."""
71 if not self.queue:
72 return
73
74 # Start timing from oldest request
75 oldest_request_time = self.queue[0][2]
76 current_time = time.time()
77
78 # Process if we have max batch size or max wait time elapsed
79 if len(self.queue) >= self.max_batch_size or \
80 (current_time - oldest_request_time) >= self.max_wait_time:
81
82 # Extract batch (up to max_batch_size)
83 batch_size = min(len(self.queue), self.max_batch_size)
84 batch = []
85
86 for _ in range(batch_size):
87 request, future, _ = self.queue.popleft()
88 batch.append((request, future))
89
90 # Update stats
91 self.stats["total_batches"] += 1
92 self.stats["avg_batch_size"] = ((self.stats["avg_batch_size"] * (self.stats["total_batches"] - 1)) + batch_size) / self.stats["total_batches"]
93
94 # Process batch
95 asyncio.create_task(self._process_batch(batch))
96
97 async def _process_batch(self, batch: List[tuple]):
98 """Process a batch of requests."""
99 if not self.processor_fn:
100 for _, future in batch:
101 if not future.done():
102 future.set_exception(ValueError("No processor function set"))
103 return
104
105 # Extract just the requests for processing
106 requests = [req for req, _ in batch]
107
108 try:
109 # Process the batch
110 results = await self.processor_fn(requests)
111
112 # Match results to futures
113 if results and len(results) == len(batch):
114 for i, (_, future) in enumerate(batch):
115 if not future.done():
116 future.set_result(results[i])
117 else:
118 # Handle mismatch in results
119 logger.error(f"Batch result count mismatch: {len(results)} results for {len(batch)} requests")
120 for _, future in batch:
121 if not future.done():
122 future.set_exception(ValueError("Batch processing error: result count mismatch"))
123
124 except Exception as e:
125 logger.error(f"Error processing batch: {str(e)}")
126 # Set exception for all futures in batch
127 for _, future in batch:
128 if not future.done():
129 future.set_exception(e)
130
131 async def submit(self, request: Any) -> Any:
132 """Submit a request for batched processing."""
133 self.stats["total_requests"] += 1
134
135 # Create future for this request
136 future = asyncio.Future()
137
138 # Add to queue with timestamp
139 self.queue.append((request, future, time.time()))
140
141 # Update max queue length stat
142 queue_length = len(self.queue)
143 if queue_length > self.stats["max_queue_length"]:
144 self.stats["max_queue_length"] = queue_length
145
146 # Return future
147 return await future

5. Model-Specific Prompt Optimization

python
1# app/services/prompt_optimizer.py
2import logging
3from typing import List, Dict, Any, Optional
4import re
5
6logger = logging.getLogger(__name__)
7
8class PromptOptimizer:
9 """Optimizes prompts for specific models to improve response quality and reduce token usage."""
10
11 def __init__(self):
12 self.model_specific_templates = {
13 # OpenAI models
14 "gpt-4": {
15 "prefix": "", # GPT-4 doesn't need special prefixing
16 "suffix": "",
17 "instruction_format": "{instruction}"
18 },
19 "gpt-3.5-turbo": {
20 "prefix": "",
21 "suffix": "",
22 "instruction_format": "{instruction}"
23 },
24
25 # Ollama models - they benefit from more explicit formatting
26 "llama2": {
27 "prefix": "",
28 "suffix": "Think step-by-step and be thorough in your response.",
29 "instruction_format": "{instruction}"
30 },
31 "llama2:70b": {
32 "prefix": "",
33 "suffix": "",
34 "instruction_format": "{instruction}"
35 },
36 "mistral": {
37 "prefix": "",
38 "suffix": "Take a deep breath and work on this step-by-step.",
39 "instruction_format": "{instruction}"
40 },
41 "codellama": {
42 "prefix": "You are an expert programmer with years of experience. ",
43 "suffix": "Make sure your code is correct and efficient.",
44 "instruction_format": "Task: {instruction}"
45 },
46 "wizard-math": {
47 "prefix": "You are a mathematics expert. ",
48 "suffix": "Show your work step-by-step and explain your reasoning clearly.",
49 "instruction_format": "Problem: {instruction}"
50 }
51 }
52
53 # Default template to use when model not specifically defined
54 self.default_template = {
55 "prefix": "",
56 "suffix": "",
57 "instruction_format": "{instruction}"
58 }
59
60 # Task-specific optimizations
61 self.task_templates = {
62 "code_generation": {
63 "prefix": "You are an expert programmer. ",
64 "suffix": "Ensure your code is correct, efficient, and well-commented.",
65 "instruction_format": "Programming Task: {instruction}"
66 },
67 "creative_writing": {
68 "prefix": "You are a creative writer with excellent storytelling abilities. ",
69 "suffix": "",
70 "instruction_format": "Creative Writing Prompt: {instruction}"
71 },
72 "reasoning": {
73 "prefix": "You are a logical thinker with strong reasoning skills. ",
74 "suffix": "Think step-by-step and be precise in your analysis.",
75 "instruction_format": "Reasoning Task: {instruction}"
76 },
77 "math": {
78 "prefix": "You are a mathematics expert. ",
79 "suffix": "Show your work step-by-step with explanations.",
80 "instruction_format": "Math Problem: {instruction}"
81 }
82 }
83
84 def detect_task_type(self, message: str) -> Optional[str]:
85 """Detect the type of task from the message content."""
86 message_lower = message.lower()
87
88 # Code detection patterns
89 code_patterns = [
90 r"write (a|an|the)?\s?(code|function|program|script|class|method)",
91 r"implement (a|an|the)?\s?(algorithm|function|class|method)",
92 r"debug (this|the)?\s?(code|function|program)",
93 r"(js|javascript|python|java|c\+\+|go|rust|typescript)"
94 ]
95
96 # Creative writing patterns
97 creative_patterns = [
98 r"write (a|an|the)?\s?(story|poem|essay|narrative|scene)",
99 r"create (a|an|the)?\s?(story|character|dialogue|setting)",
100 r"describe (a|an|the)?\s?(scene|character|setting|world)"
101 ]
102
103 # Math patterns
104 math_patterns = [
105 r"calculate",
106 r"solve (this|the)?\s?(equation|problem|expression)",
107 r"compute",
108 r"what is (the)?\s?(value|result|answer)",
109 r"find (the)?\s?(derivative|integral|product|sum|limit)"
110 ]
111
112 # Reasoning patterns
113 reasoning_patterns = [
114 r"analyze",
115 r"compare (and|&) contrast",
116 r"explain (why|how)",
117 r"what are (the)?\s?(pros|cons|advantages|disadvantages)",
118 r"evaluate"
119 ]
120
121 # Check each pattern set
122 for pattern in code_patterns:
123 if re.search(pattern, message_lower):
124 return "code_generation"
125
126 for pattern in creative_patterns:
127 if re.search(pattern, message_lower):
128 return "creative_writing"
129
130 for pattern in math_patterns:
131 if re.search(pattern, message_lower):
132 return "math"
133
134 for pattern in reasoning_patterns:
135 if re.search(pattern, message_lower):
136 return "reasoning"
137
138 return None
139
140 def optimize_system_prompt(self, original_prompt: str, model: str, task_type: Optional[str] = None) -> str:
141 """Optimize the system prompt for the specific model and task."""
142 # If no original prompt, return an appropriate default
143 if not original_prompt:
144 return "You are a helpful assistant. Provide accurate, detailed, and clear responses."
145
146 # Get model-specific template
147 template = self.model_specific_templates.get(model, self.default_template)
148
149 # If task type is provided, incorporate task-specific optimizations
150 if task_type and task_type in self.task_templates:
151 task_template = self.task_templates[task_type]
152
153 # Merge templates, with task template taking precedence for non-empty values
154 merged_template = {
155 "prefix": task_template["prefix"] if task_template["prefix"] else template["prefix"],
156 "suffix": task_template["suffix"] if task_template["suffix"] else template["suffix"],
157 "instruction_format": task_template["instruction_format"]
158 }
159
160 template = merged_template
161
162 # Apply template
163 optimized_prompt = f"{template['prefix']}{original_prompt}"
164
165 # Add suffix if it doesn't appear to already be present
166 if template["suffix"] and template["suffix"] not in optimized_prompt:
167 optimized_prompt += f" {template['suffix']}"
168
169 return optimized_prompt
170
171 def optimize_user_prompt(self, original_prompt: str, model: str, task_type: Optional[str] = None) -> str:
172 """Optimize the user prompt for the specific model and task."""
173 if not original_prompt:
174 return original_prompt
175
176 # Auto-detect task type if not provided
177 if not task_type:
178 task_type = self.detect_task_type(original_prompt)
179
180 # Get model-specific template
181 template = self.model_specific_templates.get(model, self.default_template)
182
183 # If task type is provided, incorporate task-specific optimizations
184 if task_type and task_type in self.task_templates:
185 task_template = self.task_templates[task_type]
186 # Use task instruction format if available
187 instruction_format = task_template["instruction_format"]
188 else:
189 instruction_format = template["instruction_format"]
190
191 # Apply instruction format if the prompt doesn't already look formatted
192 if "{instruction}" in instruction_format and not re.match(r"^(task|problem|prompt|question):", original_prompt.lower()):
193 formatted_prompt = instruction_format.replace("{instruction}", original_prompt)
194 return formatted_prompt
195
196 return original_prompt
197
198 def optimize_messages(self, messages: List[Dict[str, str]], model: str) -> List[Dict[str, str]]:
199 """Optimize all messages in a conversation for the specific model."""
200 if not messages:
201 return messages
202
203 # Try to detect task type from the user messages
204 task_type = None
205 for msg in messages:
206 if msg.get("role") == "user" and msg.get("content"):
207 detected_task = self.detect_task_type(msg["content"])
208 if detected_task:
209 task_type = detected_task
210 break
211
212 optimized = []
213
214 for msg in messages:
215 role = msg.get("role", "")
216 content = msg.get("content", "")
217
218 if role == "system" and content:
219 optimized_content = self.optimize_system_prompt(content, model, task_type)
220 optimized.append({"role": role, "content": optimized_content})
221 elif role == "user" and content:
222 optimized_content = self.optimize_user_prompt(content, model, task_type)
223 optimized.append({"role": role, "content": optimized_content})
224 else:
225 # Keep other messages unchanged
226 optimized.append(msg)
227
228 return optimized

Cost Reduction Strategies

1. Token Usage Optimization

python
1# app/services/token_optimizer.py
2import logging
3import re
4from typing import List, Dict, Any, Optional, Tuple
5import tiktoken
6import numpy as np
7
8logger = logging.getLogger(__name__)
9
10class TokenOptimizer:
11 """Optimizes token usage to reduce costs."""
12
13 def __init__(self):
14 # Load tokenizers once
15 try:
16 self.gpt3_tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo")
17 self.gpt4_tokenizer = tiktoken.encoding_for_model("gpt-4")
18 except Exception as e:
19 logger.warning(f"Could not load tokenizers: {str(e)}. Falling back to approximate counting.")
20 self.gpt3_tokenizer = None
21 self.gpt4_tokenizer = None
22
23 def count_tokens(self, text: str, model: str = "gpt-3.5-turbo") -> int:
24 """Count the number of tokens in a text string for a specific model."""
25 if not text:
26 return 0
27
28 # Use appropriate tokenizer if available
29 if model.startswith("gpt-4") and self.gpt4_tokenizer:
30 return len(self.gpt4_tokenizer.encode(text))
31 elif model.startswith("gpt-3") and self.gpt3_tokenizer:
32 return len(self.gpt3_tokenizer.encode(text))
33
34 # Fallback to approximation (~ 4 chars per token for English)
35 return len(text) // 4 + 1
36
37 def count_message_tokens(self, messages: List[Dict[str, str]], model: str = "gpt-3.5-turbo") -> int:
38 """Count tokens in a full message array."""
39 if not messages:
40 return 0
41
42 total = 0
43
44 # Different models have different message formatting overheads
45 if model.startswith("gpt-3.5-turbo"):
46 # Per OpenAI's formula for message token counting
47 # See: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
48 total += 3 # Every message follows <im_start>{role/name}\n{content}<im_end>\n
49
50 for message in messages:
51 total += 3 # Role overhead
52 for key, value in message.items():
53 if key == "name": # Name is 1 token
54 total += 1
55 if key == "content" and value:
56 total += self.count_tokens(value, model)
57
58 total += 3 # Assistant response overhead
59
60 elif model.startswith("gpt-4"):
61 # Similar formula for GPT-4
62 total += 3
63
64 for message in messages:
65 total += 3
66 for key, value in message.items():
67 if key == "name":
68 total += 1
69 if key == "content" and value:
70 total += self.count_tokens(value, model)
71
72 total += 3
73
74 else:
75 # Simple approach for other models
76 for message in messages:
77 content = message.get("content", "")
78 if content:
79 total += self.count_tokens(content, model)
80
81 return total
82
83 def truncate_messages(self,
84 messages: List[Dict[str, str]],
85 max_tokens: int,
86 model: str = "gpt-3.5-turbo",
87 preserve_system: bool = True,
88 preserve_last_n_exchanges: int = 2) -> List[Dict[str, str]]:
89 """Truncate conversation history to fit within token limit."""
90 if not messages:
91 return messages
92
93 # Clone messages to avoid modifying the original
94 messages = [m.copy() for m in messages]
95
96 current_tokens = self.count_message_tokens(messages, model)
97
98 # If already under the limit, return as is
99 if current_tokens <= max_tokens:
100 return messages
101
102 # Identify system and user/assistant pairs
103 system_messages = [m for m in messages if m.get("role") == "system"]
104 system_tokens = sum(self.count_tokens(m.get("content", ""), model) for m in system_messages)
105
106 # Extract exchanges (user followed by assistant message)
107 exchanges = []
108 current_exchange = []
109
110 for m in messages:
111 if m.get("role") == "system":
112 continue
113
114 current_exchange.append(m)
115
116 # If we have a user+assistant pair, add to exchanges and reset
117 if len(current_exchange) == 2 and current_exchange[0].get("role") == "user" and current_exchange[1].get("role") == "assistant":
118 exchanges.append(current_exchange)
119 current_exchange = []
120
121 # Add any remaining messages
122 if current_exchange:
123 exchanges.append(current_exchange)
124
125 # Calculate tokens needed for essential parts
126 essential_tokens = system_tokens if preserve_system else 0
127
128 # Add tokens for the last N exchanges
129 last_n_exchanges = exchanges[-preserve_last_n_exchanges:] if exchanges else []
130 last_n_tokens = sum(
131 self.count_tokens(m.get("content", ""), model)
132 for exchange in last_n_exchanges
133 for m in exchange
134 )
135
136 essential_tokens += last_n_tokens
137
138 # If essential parts already exceed the limit, we need more aggressive truncation
139 if essential_tokens > max_tokens:
140 logger.warning(f"Essential conversation parts exceed token limit: {essential_tokens} > {max_tokens}")
141
142 # Start by keeping system messages if requested
143 result = system_messages.copy() if preserve_system else []
144
145 # Add as many of the last exchanges as we can fit
146 remaining_tokens = max_tokens - sum(self.count_tokens(m.get("content", ""), model) for m in result)
147
148 for exchange in reversed(last_n_exchanges):
149 exchange_tokens = sum(self.count_tokens(m.get("content", ""), model) for m in exchange)
150
151 if exchange_tokens <= remaining_tokens:
152 result.extend(exchange)
153 remaining_tokens -= exchange_tokens
154 else:
155 # If we can't fit the whole exchange, try truncating the assistant response
156 if len(exchange) == 2:
157 user_msg = exchange[0]
158 assistant_msg = exchange[1].copy()
159
160 user_tokens = self.count_tokens(user_msg.get("content", ""), model)
161
162 if user_tokens < remaining_tokens:
163 # We can include the user message
164 result.append(user_msg)
165 remaining_tokens -= user_tokens
166
167 # Truncate the assistant message to fit
168 assistant_content = assistant_msg.get("content", "")
169 if assistant_content:
170 # Simple truncation - in a real system, you'd want more intelligent truncation
171 chars_to_keep = int(remaining_tokens * 4) # Approximate char count
172 truncated_content = assistant_content[:chars_to_keep] + "... [truncated]"
173 assistant_msg["content"] = truncated_content
174 result.append(assistant_msg)
175
176 break
177
178 # Resort the messages to maintain the correct order
179 result.sort(key=lambda m: messages.index(m) if m in messages else 999999)
180 return result
181
182 # If we get here, we can keep all essential parts and need to drop from the middle
183 result = system_messages.copy() if preserve_system else []
184 middle_exchanges = exchanges[:-preserve_last_n_exchanges] if len(exchanges) > preserve_last_n_exchanges else []
185
186 # Calculate how many tokens we can allocate to middle exchanges
187 remaining_tokens = max_tokens - essential_tokens
188
189 # Add exchanges from the middle, newest first, until we run out of tokens
190 for exchange in reversed(middle_exchanges):
191 exchange_tokens = sum(self.count_tokens(m.get("content", ""), model) for m in exchange)
192
193 if exchange_tokens <= remaining_tokens:
194 result.extend(exchange)
195 remaining_tokens -= exchange_tokens
196 else:
197 break
198
199 # Add the preserved last exchanges
200 for exchange in last_n_exchanges:
201 result.extend(exchange)
202
203 # Sort messages to maintain the correct order
204 result.sort(key=lambda m: messages.index(m) if m in messages else 999999)
205
206 # Verify the result is within the token limit
207 final_tokens = self.count_message_tokens(result, model)
208 if final_tokens > max_tokens:
209 logger.warning(f"Truncation failed to meet target: {final_tokens} > {max_tokens}")
210
211 return result
212
213 def compress_system_prompt(self, system_prompt: str, max_tokens: int, model: str = "gpt-3.5-turbo") -> str:
214 """Compress a system prompt to use fewer tokens while preserving key information."""
215 current_tokens = self.count_tokens(system_prompt, model)
216
217 if current_tokens <= max_tokens:
218 return system_prompt
219
220 # Use a language model to compress the prompt
221 # In a real implementation, you might want to call an external service
222
223 # Fallback compression strategy: Use text summarization techniques
224 # 1. Remove redundant phrases
225 redundant_phrases = [
226 "Please note that", "It's important to remember that", "Keep in mind that",
227 "I want you to", "I'd like you to", "You should", "Make sure to",
228 "Always", "Never", "Remember to"
229 ]
230
231 compressed = system_prompt
232 for phrase in redundant_phrases:
233 compressed = compressed.replace(phrase, "")
234
235 # 2. Replace verbose constructions with shorter ones
236 replacements = {
237 "in order to": "to",
238 "for the purpose of": "for",
239 "due to the fact that": "because",
240 "in the event that": "if",
241 "on the condition that": "if",
242 "with regard to": "about",
243 "in relation to": "about"
244 }
245
246 for verbose, concise in replacements.items():
247 compressed = compressed.replace(verbose, concise)
248
249 # 3. Remove unnecessary whitespace
250 compressed = re.sub(r'\s+', ' ', compressed).strip()
251
252 # 4. If still over the limit, truncate with an ellipsis
253 compressed_tokens = self.count_tokens(compressed, model)
254 if compressed_tokens > max_tokens:
255 # Approximation: 4 characters per token
256 char_limit = max_tokens * 4
257 compressed = compressed[:char_limit] + "..."
258
259 return compressed
260
261 def optimize_messages_for_cost(self,
262 messages: List[Dict[str, str]],
263 model: str,
264 max_tokens: int = 4096) -> List[Dict[str, str]]:
265 """Fully optimize messages for cost efficiency."""
266 if not messages:
267 return messages
268
269 # 1. First, identify system messages for compression
270 system_messages = []
271 other_messages = []
272
273 for msg in messages:
274 if msg.get("role") == "system":
275 system_messages.append(msg)
276 else:
277 other_messages.append(msg)
278
279 # 2. Compress system messages if there are multiple
280 if len(system_messages) > 1:
281 # Combine multiple system messages
282 combined_content = " ".join(msg.get("content", "") for msg in system_messages)
283 compressed_content = self.compress_system_prompt(combined_content, 1024, model)
284
285 # Replace with a single compressed message
286 system_messages = [{"role": "system", "content": compressed_content}]
287 elif len(system_messages) == 1 and self.count_tokens(system_messages[0].get("content", ""), model) > 1024:
288 # Compress a single long system message
289 system_messages[0]["content"] = self.compress_system_prompt(
290 system_messages[0].get("content", ""), 1024, model
291 )
292
293 # 3. Recombine and truncate the full conversation
294 optimized = system_messages + other_messages
295 reserved_completion_tokens = max(max_tokens // 4, 1024) # Reserve 25% or at least 1024 tokens for completion
296 max_prompt_tokens = max_tokens - reserved_completion_tokens
297
298 return self.truncate_messages(
299 optimized,
300 max_prompt_tokens,
301 model,
302 preserve_system=True,
303 preserve_last_n_exchanges=2
304 )

2. Model Tier Selection

python
1# app/services/model_tier_service.py
2import logging
3from typing import Dict, List, Any, Optional, Tuple
4import re
5import time
6
7from app.config import settings
8
9logger = logging.getLogger(__name__)
10
11class ModelTierService:
12 """Selects the appropriate model tier based on task requirements and budget constraints."""
13
14 def __init__(self):
15 # Cost per 1000 tokens for different models (approximate)
16 self.model_costs = {
17 # OpenAI models input/output costs
18 "gpt-4": {"input": 0.03, "output": 0.06},
19 "gpt-4-32k": {"input": 0.06, "output": 0.12},
20 "gpt-4-turbo": {"input": 0.01, "output": 0.03},
21 "gpt-3.5-turbo": {"input": 0.0015, "output": 0.002},
22 "gpt-3.5-turbo-16k": {"input": 0.003, "output": 0.004},
23
24 # Ollama models (local, so effectively zero API cost)
25 "llama2": {"input": 0, "output": 0},
26 "mistral": {"input": 0, "output": 0},
27 "codellama": {"input": 0, "output": 0}
28 }
29
30 # Model capabilities and appropriate use cases
31 self.model_capabilities = {
32 "gpt-4": ["complex_reasoning", "creative", "code", "math", "general"],
33 "gpt-4-turbo": ["complex_reasoning", "creative", "code", "math", "general"],
34 "gpt-3.5-turbo": ["simple_reasoning", "general", "summarization"],
35 "llama2": ["simple_reasoning", "general", "summarization"],
36 "mistral": ["simple_reasoning", "general", "creative"],
37 "codellama": ["code"]
38 }
39
40 # Default model selections for different task types
41 self.task_model_mapping = {
42 "complex_reasoning": {
43 "high": "gpt-4-turbo",
44 "medium": "gpt-4-turbo",
45 "low": "gpt-3.5-turbo"
46 },
47 "simple_reasoning": {
48 "high": "gpt-3.5-turbo",
49 "medium": "gpt-3.5-turbo",
50 "low": "mistral"
51 },
52 "creative": {
53 "high": "gpt-4-turbo",
54 "medium": "mistral",
55 "low": "mistral"
56 },
57 "code": {
58 "high": "gpt-4-turbo",
59 "medium": "codellama",
60 "low": "codellama"
61 },
62 "math": {
63 "high": "gpt-4-turbo",
64 "medium": "gpt-3.5-turbo",
65 "low": "mistral"
66 },
67 "general": {
68 "high": "gpt-3.5-turbo",
69 "medium": "mistral",
70 "low": "llama2"
71 },
72 "summarization": {
73 "high": "gpt-3.5-turbo",
74 "medium": "mistral",
75 "low": "llama2"
76 }
77 }
78
79 # Budget tier thresholds - what percentage of budget is remaining?
80 self.budget_tiers = {
81 "high": 0.6, # >60% of budget remaining
82 "medium": 0.3, # 30-60% of budget remaining
83 "low": 0.0 # <30% of budget remaining
84 }
85
86 # Initialize usage tracking
87 self.monthly_budget = settings.MONTHLY_BUDGET
88 self.usage_this_month = 0
89 self.month_start_timestamp = self._get_month_start_timestamp()
90
91 def _get_month_start_timestamp(self) -> int:
92 """Get timestamp for the start of the current month."""
93 import datetime
94 now = datetime.datetime.now()
95 month_start = datetime.datetime(now.year, now.month, 1)
96 return int(month_start.timestamp())
97
98 def detect_task_type(self, query: str) -> str:
99 """Detect the type of task from the query."""
100 query_lower = query.lower()
101
102 # Check for code-related tasks
103 code_indicators = [
104 "code", "function", "program", "algorithm", "javascript",
105 "python", "java", "c++", "typescript", "html", "css"
106 ]
107 if any(indicator in query_lower for indicator in code_indicators):
108 return "code"
109
110 # Check for math problems
111 math_indicators = [
112 "calculate", "solve", "equation", "math problem", "compute",
113 "derivative", "integral", "algebra", "calculus", "arithmetic"
114 ]
115 if any(indicator in query_lower for indicator in math_indicators):
116 return "math"
117
118 # Check for creative tasks
119 creative_indicators = [
120 "story", "poem", "creative", "imagine", "fiction", "fantasy",
121 "character", "novel", "script", "narrative", "write a"
122 ]
123 if any(indicator in query_lower for indicator in creative_indicators):
124 return "creative"
125
126 # Check for complex reasoning
127 complex_indicators = [
128 "analyze", "critique", "evaluate", "compare and contrast",
129 "implications", "consequences", "recommend", "strategy",
130 "detailed explanation", "comprehensive", "thorough"
131 ]
132 if any(indicator in query_lower for indicator in complex_indicators):
133 return "complex_reasoning"
134
135 # Check for summarization
136 summary_indicators = [
137 "summarize", "summary", "tldr", "briefly explain", "short version",
138 "key points", "main ideas"
139 ]
140 if any(indicator in query_lower for indicator in summary_indicators):
141 return "summarization"
142
143 # Default to simple reasoning if no specific category is detected
144 simple_indicators = [
145 "explain", "how", "why", "what", "when", "who", "where",
146 "help me understand", "tell me about"
147 ]
148 if any(indicator in query_lower for indicator in simple_indicators):
149 return "simple_reasoning"
150
151 # Fallback to general
152 return "general"
153
154 def get_current_budget_tier(self) -> str:
155 """Get the current budget tier based on monthly usage."""
156 # Check if we're in a new month
157 current_month_start = self._get_month_start_timestamp()
158 if current_month_start > self.month_start_timestamp:
159 # Reset for new month
160 self.month_start_timestamp = current_month_start
161 self.usage_this_month = 0
162
163 if self.monthly_budget <= 0:
164 # No budget constraints
165 return "high"
166
167 # Calculate remaining budget percentage
168 remaining_percentage = 1 - (self.usage_this_month / self.monthly_budget)
169
170 # Determine tier
171 if remaining_percentage > self.budget_tiers["high"]:
172 return "high"
173 elif remaining_percentage > self.budget_tiers["medium"]:
174 return "medium"
175 else:
176 return "low"
177
178 def record_usage(self, model: str, input_tokens: int, output_tokens: int) -> None:
179 """Record token usage for budget tracking."""
180 if model not in self.model_costs:
181 return
182
183 costs = self.model_costs[model]
184 input_cost = (input_tokens / 1000) * costs["input"]
185 output_cost = (output_tokens / 1000) * costs["output"]
186 total_cost = input_cost + output_cost
187
188 self.usage_this_month += total_cost
189
190 # Log for monitoring
191 logger.info(f"Usage recorded: {model}, {input_tokens} input tokens, {output_tokens} output tokens, ${total_cost:.4f}")
192
193 def select_optimal_model(self,
194 query: str,
195 preferred_provider: Optional[str] = None,
196 force_tier: Optional[str] = None) -> Tuple[str, str]:
197 """
198 Select the optimal model based on the query and budget constraints.
199 Returns a tuple of (provider, model)
200 """
201 # Detect task type
202 task_type = self.detect_task_type(query)
203
204 # Get budget tier (unless forced)
205 budget_tier = force_tier if force_tier else self.get_current_budget_tier()
206
207 # Get the recommended model for this task and budget tier
208 recommended_model = self.task_model_mapping[task_type][budget_tier]
209
210 # Determine provider based on model
211 if recommended_model in ["llama2", "mistral", "codellama"]:
212 provider = "ollama"
213 else:
214 provider = "openai"
215
216 # Override provider if specified and compatible
217 if preferred_provider:
218 if preferred_provider == "ollama" and provider == "openai":
219 # Find an Ollama alternative for this task
220 for model, capabilities in self.model_capabilities.items():
221 if task_type in capabilities and model in ["llama2", "mistral", "codellama"]:
222 recommended_model = model
223 provider = "ollama"
224 break
225 elif preferred_provider == "openai" and provider == "ollama":
226 # Find an OpenAI alternative for this task
227 for model, capabilities in self.model_capabilities.items():
228 if task_type in capabilities and model not in ["llama2", "mistral", "codellama"]:
229 recommended_model = model
230 provider = "openai"
231 break
232
233 logger.info(f"Selected model for task '{task_type}' (tier: {budget_tier}): {provider}:{recommended_model}")
234 return provider, recommended_model
235
236 def estimate_cost(self, model: str, input_tokens: int, expected_output_tokens: int) -> float:
237 """Estimate the cost of a request."""
238 if model not in self.model_costs:
239 return 0.0
240
241 costs = self.model_costs[model]
242 input_cost = (input_tokens / 1000) * costs["input"]
243 output_cost = (expected_output_tokens / 1000) * costs["output"]
244
245 return input_cost + output_cost

3. Local Model Prioritization for Development

python
1# app/services/dev_mode_service.py
2import logging
3import os
4from typing import Dict, List, Any, Optional
5import re
6
7logger = logging.getLogger(__name__)
8
9class DevModeService:
10 """
11 Service that prioritizes local models during development to reduce costs.
12 """
13
14 def __init__(self):
15 # Read environment to determine if we're in development mode
16 self.is_dev_mode = os.environ.get("APP_ENV", "development").lower() == "development"
17 self.dev_mode_forced = os.environ.get("FORCE_DEV_MODE", "false").lower() == "true"
18
19 # Set up developer-focused settings
20 self.allow_openai_for_patterns = [
21 r"(complex|sophisticated|advanced)\s+(reasoning|analysis)",
22 r"(gpt-4|gpt-3\.5|openai)" # Explicit requests for OpenAI models
23 ]
24
25 self.use_ollama_for_patterns = [
26 r"^test\s", # Queries starting with "test"
27 r"^debug\s", # Debugging queries
28 r"^hello\s", # Simple greetings
29 r"^hi\s",
30 r"^try\s"
31 ]
32
33 # Track usage for reporting
34 self.openai_requests = 0
35 self.ollama_requests = 0
36 self.redirected_requests = 0
37
38 def is_development_environment(self) -> bool:
39 """Check if we're running in a development environment."""
40 return self.is_dev_mode or self.dev_mode_forced
41
42 def should_use_local_model(self, query: str) -> bool:
43 """
44 Determine if a query should use local models in development mode.
45 In development, we default to local models unless specific patterns are matched.
46 """
47 if not self.is_development_environment():
48 return False
49
50 # Always use local models for specific patterns
51 for pattern in self.use_ollama_for_patterns:
52 if re.search(pattern, query, re.IGNORECASE):
53 return True
54
55 # Allow OpenAI for specific advanced patterns
56 for pattern in self.allow_openai_for_patterns:
57 if re.search(pattern, query, re.IGNORECASE):
58 return False
59
60 # In development, default to local models to save costs
61 return True
62
63 def get_dev_routing_decision(self, query: str, default_provider: str) -> str:
64 """
65 Make a routing decision based on development mode settings.
66 Returns: "openai" or "ollama"
67 """
68 if not self.is_development_environment():
69 return default_provider
70
71 should_use_local = self.should_use_local_model(query)
72
73 # Track for reporting
74 if should_use_local:
75 self.ollama_requests += 1
76 if default_provider == "openai":
77 self.redirected_requests += 1
78 else:
79 self.openai_requests += 1
80
81 return "ollama" if should_use_local else "openai"
82
83 def get_usage_report(self) -> Dict[str, Any]:
84 """Get a report of usage patterns for monitoring costs."""
85 total_requests = self.openai_requests + self.ollama_requests
86
87 if total_requests == 0:
88 ollama_percentage = 0
89 redirected_percentage = 0
90 else:
91 ollama_percentage = (self.ollama_requests / total_requests) * 100
92 redirected_percentage = (self.redirected_requests / total_requests) * 100
93
94 return {
95 "dev_mode_active": self.is_development_environment(),
96 "total_requests": total_requests,
97 "openai_requests": self.openai_requests,
98 "ollama_requests": self.ollama_requests,
99 "redirected_to_ollama": self.redirected_requests,
100 "ollama_usage_percentage": ollama_percentage,
101 "cost_savings_percentage": redirected_percentage
102 }

4. Request Batching and Rate Limiting

python
1# app/services/rate_limiter.py
2import time
3import asyncio
4import logging
5from typing import Dict, List, Any, Optional, Callable, Awaitable
6from collections import defaultdict
7import redis.asyncio as redis
8
9from app.config import settings
10
11logger = logging.getLogger(__name__)
12
13class RateLimiter:
14 """
15 Rate limiter to control API usage and costs.
16 Implements tiered rate limiting based on user roles.
17 """
18
19 def __init__(self):
20 self.redis = None
21
22 # Rate limit tiers (requests per time window)
23 self.rate_limit_tiers = {
24 "free": {
25 "minute": 5,
26 "hour": 20,
27 "day": 100
28 },
29 "basic": {
30 "minute": 20,
31 "hour": 100,
32 "day": 1000
33 },
34 "premium": {
35 "minute": 60,
36 "hour": 1000,
37 "day": 10000
38 },
39 "enterprise": {
40 "minute": 120,
41 "hour": 5000,
42 "day": 50000
43 }
44 }
45
46 # Provider-specific rate limits (global)
47 self.provider_rate_limits = {
48 "openai": {
49 "minute": 60, # Shared across all users
50 "tokens_per_minute": 90000 # Token budget per minute
51 },
52 "ollama": {
53 "minute": 100, # Higher for local models
54 "tokens_per_minute": 250000
55 }
56 }
57
58 # Tracking for available token budgets
59 self.token_budgets = {
60 "openai": self.provider_rate_limits["openai"]["tokens_per_minute"],
61 "ollama": self.provider_rate_limits["ollama"]["tokens_per_minute"]
62 }
63 self.last_budget_reset = time.time()
64
65 async def initialize(self):
66 """Initialize Redis connection."""
67 self.redis = await redis.from_url(settings.REDIS_URL)
68
69 # Start token budget replenishment task
70 asyncio.create_task(self._token_budget_replenishment())
71
72 async def _token_budget_replenishment(self):
73 """Periodically replenish token budgets."""
74 while True:
75 try:
76 now = time.time()
77 elapsed = now - self.last_budget_reset
78
79 # Reset every minute
80 if elapsed >= 60:
81 self.token_budgets = {
82 "openai": self.provider_rate_limits["openai"]["tokens_per_minute"],
83 "ollama": self.provider_rate_limits["ollama"]["tokens_per_minute"]
84 }
85 self.last_budget_reset = now
86
87 # Partial replenishment for less than a minute
88 else:
89 # Calculate replenishment based on elapsed time
90 openai_replenishment = int((elapsed / 60) * self.provider_rate_limits["openai"]["tokens_per_minute"])
91 ollama_replenishment = int((elapsed / 60) * self.provider_rate_limits["ollama"]["tokens_per_minute"])
92
93 # Replenish up to max
94 self.token_budgets["openai"] = min(
95 self.token_budgets["openai"] + openai_replenishment,
96 self.provider_rate_limits["openai"]["tokens_per_minute"]
97 )
98 self.token_budgets["ollama"] = min(
99 self.token_budgets["ollama"] + ollama_replenishment,
100 self.provider_rate_limits["ollama"]["tokens_per_minute"]
101 )
102
103 self.last_budget_reset = now
104 except Exception as e:
105 logger.error(f"Error in token budget replenishment: {str(e)}")
106
107 # Update every 5 seconds
108 await asyncio.sleep(5)
109
110 async def check_rate_limit(self,
111 user_id: str,
112 tier: str = "free",
113 provider: str = "openai") -> Dict[str, Any]:
114 """
115 Check if a request is within rate limits.
116 Returns: {"allowed": bool, "retry_after": Optional[int], "reason": Optional[str]}
117 """
118 if not self.redis:
119 # If Redis is not available, allow the request but log a warning
120 logger.warning("Redis not available for rate limiting")
121 return {"allowed": True}
122
123 # Get rate limits for this user's tier
124 tier_limits = self.rate_limit_tiers.get(tier, self.rate_limit_tiers["free"])
125
126 # Check user-specific rate limits
127 for window, limit in tier_limits.items():
128 key = f"rate:user:{user_id}:{window}"
129
130 # Get current count
131 count = await self.redis.get(key)
132 count = int(count) if count else 0
133
134 if count >= limit:
135 ttl = await self.redis.ttl(key)
136 return {
137 "allowed": False,
138 "retry_after": max(1, ttl),
139 "reason": f"Rate limit exceeded for {window}"
140 }
141
142 # Check provider-specific rate limits
143 provider_limits = self.provider_rate_limits.get(provider, {})
144 if "minute" in provider_limits:
145 provider_key = f"rate:provider:{provider}:minute"
146 provider_count = await self.redis.get(provider_key)
147 provider_count = int(provider_count) if provider_count else 0
148
149 if provider_count >= provider_limits["minute"]:
150 ttl = await self.redis.ttl(provider_key)
151 return {
152 "allowed": False,
153 "retry_after": max(1, ttl),
154 "reason": f"Global {provider} rate limit exceeded"
155 }
156
157 # Check token budget
158 if provider in self.token_budgets and self.token_budgets[provider] <= 0:
159 # Calculate time until next budget refresh
160 time_since_reset = time.time() - self.last_budget_reset
161 time_until_refresh = max(1, int(60 - time_since_reset))
162
163 return {
164 "allowed": False,
165 "retry_after": time_until_refresh,
166 "reason": f"{provider} token budget exhausted"
167 }
168
169 # All checks passed
170 return {"allowed": True}
171
172 async def increment_counters(self,
173 user_id: str,
174 provider: str,
175 token_count: int = 0) -> None:
176 """Increment rate limit counters after a successful request."""
177 if not self.redis:
178 return
179
180 now = int(time.time())
181
182 # Increment user counters for different windows
183 pipeline = self.redis.pipeline()
184
185 # Minute window (expires in 60 seconds)
186 minute_key = f"rate:user:{user_id}:minute"
187 pipeline.incr(minute_key)
188 pipeline.expireat(minute_key, now + 60)
189
190 # Hour window (expires in 3600 seconds)
191 hour_key = f"rate:user:{user_id}:hour"
192 pipeline.incr(hour_key)
193 pipeline.expireat(hour_key, now + 3600)
194
195 # Day window (expires in 86400 seconds)
196 day_key = f"rate:user:{user_id}:day"
197 pipeline.incr(day_key)
198 pipeline.expireat(day_key, now + 86400)
199
200 # Increment provider counter
201 provider_key = f"rate:provider:{provider}:minute"
202 pipeline.incr(provider_key)
203 pipeline.expireat(provider_key, now + 60)
204
205 # Execute all commands
206 await pipeline.execute()
207
208 # Decrement token budget
209 if provider in self.token_budgets and token_count > 0:
210 self.token_budgets[provider] = max(0, self.token_budgets[provider] - token_count)
211
212 async def get_user_usage(self, user_id: str) -> Dict[str, Any]:
213 """Get current usage statistics for a user."""
214 if not self.redis:
215 return {
216 "minute": 0,
217 "hour": 0,
218 "day": 0
219 }
220
221 pipeline = self.redis.pipeline()
222
223 # Get counts for all windows
224 pipeline.get(f"rate:user:{user_id}:minute")
225 pipeline.get(f"rate:user:{user_id}:hour")
226 pipeline.get(f"rate:user:{user_id}:day")
227
228 # Get TTLs (time remaining)
229 pipeline.ttl(f"rate:user:{user_id}:minute")
230 pipeline.ttl(f"rate:user:{user_id}:hour")
231 pipeline.ttl(f"rate:user:{user_id}:day")
232
233 results = await pipeline.execute()
234
235 return {
236 "minute": {
237 "usage": int(results[0]) if results[0] else 0,
238 "reset_in": results[3] if results[3] and results[3] > 0 else 60
239 },
240 "hour": {
241 "usage": int(results[1]) if results[1] else 0,
242 "reset_in": results[4] if results[4] and results[4] > 0 else 3600
243 },
244 "day": {
245 "usage": int(results[2]) if results[2] else 0,
246 "reset_in": results[5] if results[5] and results[5] > 0 else 86400
247 }
248 }

5. Memory and Context Compression

python
1# app/services/context_compression.py
2import logging
3from typing import List, Dict, Any, Optional
4import re
5import json
6
7logger = logging.getLogger(__name__)
8
9class ContextCompressor:
10 """
11 Compresses conversation history to reduce token usage while preserving context.
12 """
13
14 def __init__(self):
15 self.max_summary_tokens = 300 # Target size for summaries
16
17 async def compress_history(self,
18 messages: List[Dict[str, str]],
19 provider_service: Any) -> List[Dict[str, str]]:
20 """
21 Compress conversation history by summarizing older exchanges.
22 Returns a new message list with compressed history.
23 """
24 # If fewer than 4 messages (system + maybe 1-2 exchanges), no compression needed
25 if len(messages) < 4:
26 return messages.copy()
27
28 # Extract system message
29 system_messages = [m for m in messages if m.get("role") == "system"]
30
31 # Find the cut point - we'll preserve the most recent exchanges
32 if len(messages) <= 10:
33 # For shorter conversations, keep the most recent 3 messages (1-2 exchanges)
34 preserve_count = 3
35 compress_messages = messages[:-preserve_count]
36 preserve_messages = messages[-preserve_count:]
37 else:
38 # For longer conversations, preserve the most recent 4-6 messages (2-3 exchanges)
39 preserve_count = min(6, max(4, len(messages) // 5))
40 compress_messages = messages[:-preserve_count]
41 preserve_messages = messages[-preserve_count:]
42
43 # No system message in the compression list
44 compress_messages = [m for m in compress_messages if m.get("role") != "system"]
45
46 # If nothing to compress, return original
47 if not compress_messages:
48 return messages.copy()
49
50 # Generate summary of the earlier conversation
51 summary = await self._generate_conversation_summary(compress_messages, provider_service)
52
53 # Create a new message list with the summary + preserved messages
54 result = system_messages.copy() # Start with system message(s)
55
56 # Add summary as a system message
57 if summary:
58 result.append({
59 "role": "system",
60 "content": f"Previous conversation summary: {summary}"
61 })
62
63 # Add preserved recent messages
64 result.extend(preserve_messages)
65
66 return result
67
68 async def _generate_conversation_summary(self,
69 messages: List[Dict[str, str]],
70 provider_service: Any) -> str:
71 """Generate a summary of the conversation history."""
72 if not messages:
73 return ""
74
75 # Format the conversation for summarization
76 conversation_text = "\n".join([
77 f"{m.get('role', 'unknown')}: {m.get('content', '')}"
78 for m in messages if m.get('content')
79 ])
80
81 # Prepare the summarization prompt
82 summary_prompt = [
83 {"role": "system", "content":
84 "You are a conversation summarizer. Create a concise summary of the key points "
85 "from the conversation that would help maintain context for future responses. "
86 "Focus on important information, user preferences, and outstanding questions. "
87 "Keep the summary under 200 words."
88 },
89 {"role": "user", "content": f"Summarize this conversation:\n\n{conversation_text}"}
90 ]
91
92 # Get a summary using a smaller/faster model
93 try:
94 summary_response = await provider_service.generate_completion(
95 messages=summary_prompt,
96 provider="openai", # Use OpenAI for reliability
97 model="gpt-3.5-turbo", # Use a smaller model for efficiency
98 max_tokens=self.max_summary_tokens
99 )
100
101 if summary_response and summary_response.get("message", {}).get("content"):
102 return summary_response["message"]["content"]
103
104 except Exception as e:
105 logger.error(f"Error generating conversation summary: {str(e)}")
106
107 # Simple fallback summary generation
108 topics = self._extract_topics(conversation_text)
109 if topics:
110 return f"Previous conversation covered: {', '.join(topics)}."
111
112 return "The conversation covered various topics which have been summarized to save space."
113
114 def _extract_topics(self, conversation_text: str) -> List[str]:
115 """Simple topic extraction as a fallback mechanism."""
116 # Extract potential topic indicators
117 topic_phrases = [
118 "discussed", "talked about", "mentioned", "referred to",
119 "asked about", "inquired about", "wanted to know"
120 ]
121
122 topics = []
123
124 for phrase in topic_phrases:
125 pattern = rf"{phrase} ([^\.,:;]+)"
126 matches = re.findall(pattern, conversation_text, re.IGNORECASE)
127 topics.extend(matches)
128
129 # Deduplicate and limit
130 unique_topics = list(set(topics))
131 return unique_topics[:5] # Return at most 5 topics
132
133 async def compress_user_query(self,
134 original_query: str,
135 provider_service: Any) -> str:
136 """
137 Compress a long user query to reduce token usage while preserving intent.
138 Used for very long inputs.
139 """
140 # If query is already reasonably sized, return as is
141 if len(original_query.split()) < 100:
142 return original_query
143
144 # Prepare compression prompt
145 compression_prompt = [
146 {"role": "system", "content":
147 "You are a query optimizer. Your job is to reformulate user queries to be more "
148 "concise while preserving the core intent and all critical details. "
149 "Remove redundant information and excessive elaboration, but maintain all "
150 "specific requirements, constraints, and examples provided."
151 },
152 {"role": "user", "content": f"Optimize this query to be more concise while preserving all important details:\n\n{original_query}"}
153 ]
154
155 # Get a compressed query
156 try:
157 compression_response = await provider_service.generate_completion(
158 messages=compression_prompt,
159 provider="openai",
160 model="gpt-3.5-turbo",
161 max_tokens=len(original_query.split()) // 2 # Target ~50% reduction
162 )
163
164 if (compression_response and
165 compression_response.get("message", {}).get("content") and
166 len(compression_response["message"]["content"]) < len(original_query)):
167 return compression_response["message"]["content"]
168
169 except Exception as e:
170 logger.error(f"Error compressing user query: {str(e)}")
171
172 # If compression fails or doesn't reduce size, return original
173 return original_query

Response Accuracy Optimization Strategies

1. Prompt Engineering Templates

python
1# app/services/prompt_templates.py
2from typing import Dict, List, Any, Optional
3import re
4
5class PromptTemplates:
6 """
7 Provides optimized prompt templates for different use cases to improve response accuracy.
8 """
9
10 def __init__(self):
11 # Core system prompt templates
12 self.system_templates = {
13 "general": """
14 You are a helpful assistant with diverse knowledge and capabilities.
15 Provide accurate, relevant, and concise responses to user queries.
16 When you don't know something, admit it rather than making up information.
17 Format your responses clearly using markdown when helpful.
18 """,
19
20 "coding": """
21 You are a coding assistant with expertise in programming languages and software development.
22 Provide correct, efficient, and well-documented code examples.
23 Explain your code clearly and highlight important concepts.
24 Format code blocks using markdown with appropriate syntax highlighting.
25 Suggest best practices and consider edge cases in your solutions.
26 """,
27
28 "research": """
29 You are a research assistant with access to broad knowledge.
30 Provide comprehensive, accurate, and nuanced information.
31 Consider different perspectives and cite limitations of your knowledge.
32 Structure complex information clearly and logically.
33 Indicate uncertainty when appropriate rather than speculating.
34 """,
35
36 "math": """
37 You are a mathematics tutor with expertise in various mathematical domains.
38 Provide step-by-step explanations for mathematical problems.
39 Use clear notation and formatting for equations using markdown.
40 Verify your solutions and check for errors or edge cases.
41 When solving problems, explain the underlying concepts and techniques.
42 """,
43
44 "creative": """
45 You are a creative assistant skilled in writing, storytelling, and idea generation.
46 Provide original, engaging, and imaginative content based on user requests.
47 Consider tone, style, and audience in your creative work.
48 When generating stories or content, maintain internal consistency.
49 Respect copyright and avoid plagiarizing existing creative works.
50 """
51 }
52
53 # Task-specific prompt templates that can be inserted into system prompts
54 self.task_templates = {
55 "step_by_step": """
56 Break down your explanation into clear, logical steps.
57 Begin with foundational concepts before advancing to more complex ideas.
58 Use numbered or bulleted lists for sequential instructions or key points.
59 Provide examples to illustrate abstract concepts.
60 """,
61
62 "comparison": """
63 Present a balanced and objective comparison.
64 Identify clear categories for comparison (features, performance, use cases, etc.).
65 Highlight both similarities and differences.
66 Consider context and specific use cases in your evaluation.
67 Avoid unjustified bias and present evidence for evaluative statements.
68 """,
69
70 "factual_accuracy": """
71 Prioritize accuracy over comprehensiveness.
72 Clearly distinguish between well-established facts, expert consensus, and speculation.
73 Acknowledge limitations in your knowledge, especially for time-sensitive information.
74 Avoid overgeneralizations and recognize exceptions where relevant.
75 """,
76
77 "technical_explanation": """
78 Begin with a high-level overview before diving into technical details.
79 Define specialized terminology when introduced.
80 Use analogies to explain complex concepts when appropriate.
81 Balance technical precision with accessibility based on the apparent expertise level of the user.
82 """
83 }
84
85 # Output format templates
86 self.format_templates = {
87 "pros_cons": """
88 Structure your response with clearly labeled sections for advantages and disadvantages.
89 Use bullet points or numbered lists for each point.
90 Consider different perspectives or use cases.
91 If applicable, provide a balanced conclusion or recommendation.
92 """,
93
94 "academic": """
95 Structure your response similar to an academic paper with introduction, body, and conclusion.
96 Use formal language and precise terminology.
97 Acknowledge limitations and alternative viewpoints.
98 Refer to theoretical frameworks or methodologies where relevant.
99 """,
100
101 "tutorial": """
102 Structure your response as a tutorial with clear sections:
103 - Introduction explaining what will be covered and prerequisites
104 - Step-by-step instructions with examples
105 - Common pitfalls or troubleshooting tips
106 - Summary of key takeaways
107 Use headings and code blocks with appropriate formatting.
108 """,
109
110 "eli5": """
111 Explain the concept as if to a 10-year-old with no specialized knowledge.
112 Use simple language and concrete analogies.
113 Break complex ideas into simple components.
114 Avoid jargon, or define terms very clearly when they must be used.
115 """
116 }
117
118 def get_system_prompt(self, category: str, include_tasks: List[str] = None) -> str:
119 """Get a system prompt template with optional task-specific additions."""
120 base_template = self.system_templates.get(
121 category,
122 self.system_templates["general"]
123 ).strip()
124
125 if not include_tasks:
126 return base_template
127
128 # Add selected task templates
129 task_additions = []
130 for task in include_tasks:
131 if task in self.task_templates:
132 task_additions.append(self.task_templates[task].strip())
133
134 if task_additions:
135 combined = base_template + "\n\n" + "\n\n".join(task_additions)
136 return combined
137
138 return base_template
139
140 def enhance_user_prompt(self, original_prompt: str, format_type: str = None) -> str:
141 """Enhance a user prompt with formatting instructions."""
142 if not format_type or format_type not in self.format_templates:
143 return original_prompt
144
145 format_instructions = self.format_templates[format_type].strip()
146 enhanced_prompt = f"{original_prompt}\n\nPlease format your response as follows:\n{format_instructions}"
147
148 return enhanced_prompt
149
150 def detect_format_type(self, prompt: str) -> Optional[str]:
151 """Detect what format type might be appropriate based on prompt content."""
152 prompt_lower = prompt.lower()
153
154 # Check for format indicators
155 if any(phrase in prompt_lower for phrase in ["pros and cons", "advantages and disadvantages", "benefits and drawbacks"]):
156 return "pros_cons"
157
158 if any(phrase in prompt_lower for phrase in ["academic", "paper", "research", "literature", "theoretical"]):
159 return "academic"
160
161 if any(phrase in prompt_lower for phrase in ["tutorial", "how to", "guide", "step by step", "walkthrough"]):
162 return "tutorial"
163
164 if any(phrase in prompt_lower for phrase in ["explain like", "eli5", "simple terms", "layman's terms", "simply explain"]):
165 return "eli5"
166
167 return None

2. Context-Aware Chain of Thought

python
1# app/services/chain_of_thought.py
2from typing import Dict, List, Any, Optional
3import logging
4import json
5import re
6
7logger = logging.getLogger(__name__)
8
9class ChainOfThoughtService:
10 """
11 Enhances response accuracy by enabling step-by-step reasoning.
12 """
13
14 def __init__(self):
15 # Configure when to use chain-of-thought prompting
16 self.cot_triggers = [
17 # Keywords indicating complex reasoning is needed
18 r"(why|how|explain|analyze|reason|think|consider)",
19 # Question patterns that benefit from step-by-step thinking
20 r"(what (would|will|could|might) happen if)",
21 r"(what (is|are) the (cause|reason|impact|effect|implication))",
22 # Complexity indicators
23 r"(complex|complicated|difficult|challenging|nuanced)",
24 # Multi-step problems
25 r"(steps|process|procedure|method|approach)"
26 ]
27
28 # Task-specific CoT templates
29 self.cot_templates = {
30 "general": "Let's think through this step-by-step.",
31
32 "math": """
33 Let's solve this step-by-step:
34 1. First, understand what we're looking for
35 2. Identify the relevant information and equations
36 3. Work through the solution methodically
37 4. Verify the answer makes sense
38 """,
39
40 "reasoning": """
41 Let's approach this systematically:
42 1. Identify the key elements of the problem
43 2. Consider relevant principles and constraints
44 3. Analyze potential approaches
45 4. Evaluate and compare alternatives
46 5. Draw a well-reasoned conclusion
47 """,
48
49 "decision": """
50 Let's analyze this decision carefully:
51 1. Clarify the decision to be made
52 2. Identify the key criteria and constraints
53 3. Consider the available options
54 4. Evaluate each option against the criteria
55 5. Assess potential risks and trade-offs
56 6. Recommend the best course of action with justification
57 """,
58
59 "causal": """
60 Let's analyze the causal relationships:
61 1. Identify the events or phenomena to be explained
62 2. Consider potential causes and mechanisms
63 3. Evaluate the evidence for each causal link
64 4. Consider alternative explanations
65 5. Draw conclusions about the most likely causal relationships
66 """
67 }
68
69 # Internal vs. external CoT modes
70 self.cot_modes = {
71 "internal": {
72 "prefix": "Think through this problem step-by-step before providing your final answer.",
73 "format": "standard" # No special formatting needed
74 },
75 "external": {
76 "prefix": "Show your step-by-step reasoning process explicitly in your response.",
77 "format": "markdown" # Format as markdown
78 }
79 }
80
81 def should_use_cot(self, query: str) -> bool:
82 """Determine if chain-of-thought prompting should be used for this query."""
83 query_lower = query.lower()
84
85 # Check for CoT triggers
86 for pattern in self.cot_triggers:
87 if re.search(pattern, query_lower):
88 return True
89
90 # Check for task complexity indicators
91 if len(query.split()) > 30: # Longer queries often benefit from CoT
92 return True
93
94 # Check for explicit reasoning requests
95 explicit_requests = [
96 "step by step", "explain your reasoning", "think through",
97 "show your work", "explain how you", "walk me through"
98 ]
99
100 if any(request in query_lower for request in explicit_requests):
101 return True
102
103 return False
104
105 def detect_task_type(self, query: str) -> str:
106 """Detect the type of reasoning task from the query."""
107 query_lower = query.lower()
108
109 # Check for mathematical content
110 math_indicators = [
111 "calculate", "compute", "solve", "equation", "formula",
112 "find the value", "what is the result", r"\d+(\.\d+)?"
113 ]
114
115 if any(re.search(indicator, query_lower) for indicator in math_indicators):
116 return "math"
117
118 # Check for decision-making queries
119 decision_indicators = [
120 "should i", "which is better", "what's the best", "recommend",
121 "decide between", "choose", "options"
122 ]
123
124 if any(indicator in query_lower for indicator in decision_indicators):
125 return "decision"
126
127 # Check for causal analysis
128 causal_indicators = [
129 "why did", "what caused", "reason for", "explain why",
130 "how does", "what leads to", "effect of", "impact of"
131 ]
132
133 if any(indicator in query_lower for indicator in causal_indicators):
134 return "causal"
135
136 # Default to general reasoning
137 reasoning_indicators = [
138 "explain", "analyze", "evaluate", "critique", "assess",
139 "compare", "contrast", "discuss", "review"
140 ]
141
142 if any(indicator in query_lower for indicator in reasoning_indicators):
143 return "reasoning"
144
145 return "general"
146
147 def enhance_prompt_with_cot(self,
148 query: str,
149 mode: str = "internal",
150 explicit_template: bool = False) -> str:
151 """
152 Enhance a prompt with chain-of-thought instructions.
153
154 Args:
155 query: The original user query
156 mode: "internal" (for model thinking) or "external" (for visible reasoning)
157 explicit_template: Whether to include the full template or just the instruction
158 """
159 if not self.should_use_cot(query):
160 return query
161
162 # Get CoT mode configuration
163 cot_mode = self.cot_modes.get(mode, self.cot_modes["internal"])
164
165 # Detect the task type
166 task_type = self.detect_task_type(query)
167
168 # Get the appropriate template
169 template = self.cot_templates.get(task_type, self.cot_templates["general"])
170
171 if explicit_template:
172 # Add the full template
173 enhanced = f"{query}\n\n{cot_mode['prefix']}\n\n{template.strip()}"
174 else:
175 # Just add the basic instruction
176 enhanced = f"{query}\n\n{cot_mode['prefix']}"
177
178 return enhanced
179
180 def format_cot_for_response(self, reasoning: str, final_answer: str, mode: str = "external") -> str:
181 """
182 Format chain-of-thought reasoning and final answer for response.
183
184 Args:
185 reasoning: The step-by-step reasoning process
186 final_answer: The final answer or conclusion
187 mode: "internal" (hidden) or "external" (visible)
188 """
189 if mode == "internal":
190 # For internal mode, just return the final answer
191 return final_answer
192
193 # For external mode, format the reasoning and answer
194 formatted = f"""
195## Reasoning Process
196
197{reasoning}
198
199## Conclusion
200
201{final_answer}
202"""
203 return formatted.strip()

3. Self-Verification and Error Correction

python
1# app/services/verification_service.py
2import logging
3from typing import Dict, List, Any, Optional, Tuple
4import re
5import json
6
7logger = logging.getLogger(__name__)
8
9class VerificationService:
10 """
11 Improves response accuracy through self-verification and error correction.
12 """
13
14 def __init__(self):
15 # Define verification categories
16 self.verification_categories = [
17 "factual_accuracy",
18 "logical_consistency",
19 "completeness",
20 "code_correctness",
21 "calculation_accuracy",
22 "bias_detection"
23 ]
24
25 # High-risk categories that should always be verified
26 self.high_risk_categories = [
27 "medical",
28 "legal",
29 "financial",
30 "security"
31 ]
32
33 # Verification prompt templates
34 self.verification_templates = {
35 "general": """
36 Please verify your response for:
37 1. Factual accuracy - Are all stated facts correct?
38 2. Logical consistency - Is the reasoning sound and free of contradictions?
39 3. Completeness - Does the answer address all aspects of the question?
40 4. Clarity - Is the response clear and easy to understand?
41
42 If you find any errors or omissions, please correct them in your response.
43 """,
44
45 "factual": """
46 Critically verify the factual claims in your response:
47 - Are dates, names, and definitions accurate?
48 - Are statistics and measurements correct?
49 - Are attributions to people, organizations, or sources accurate?
50 - Have you distinguished between facts and opinions/interpretations?
51
52 If you identify any factual errors, please correct them.
53 """,
54
55 "code": """
56 Verify your code for:
57 1. Syntax errors and typos
58 2. Logical correctness - does it perform the intended function?
59 3. Edge cases and error handling
60 4. Efficiency and best practices
61 5. Security vulnerabilities
62
63 If you find any issues, please provide corrected code.
64 """,
65
66 "math": """
67 Verify your mathematical work by:
68 1. Re-checking each calculation step
69 2. Verifying that formulas are applied correctly
70 3. Confirming unit conversions if applicable
71 4. Testing the solution with sample values if possible
72 5. Checking for arithmetic errors
73
74 If you find any errors, please recalculate and provide the correct answer.
75 """,
76
77 "bias": """
78 Check your response for potential biases:
79 1. Is the framing balanced and objective?
80 2. Have you considered diverse perspectives?
81 3. Are there cultural, geographic, or demographic assumptions?
82 4. Does the language contain implicit value judgments?
83
84 If you detect bias, please revise for greater objectivity.
85 """
86 }
87
88 def detect_verification_needs(self, query: str) -> List[str]:
89 """Detect which verification categories are needed based on the query."""
90 query_lower = query.lower()
91 needed_categories = []
92
93 # Check for high-risk topics
94 high_risk_detected = False
95 for category in self.high_risk_categories:
96 if category in query_lower or f"related to {category}" in query_lower:
97 high_risk_detected = True
98 break
99
100 # For high-risk topics, perform comprehensive verification
101 if high_risk_detected:
102 return ["factual_accuracy", "logical_consistency", "completeness", "bias_detection"]
103
104 # Check for code-related content
105 code_indicators = ["code", "function", "program", "algorithm", "syntax"]
106 if any(indicator in query_lower for indicator in code_indicators):
107 needed_categories.append("code_correctness")
108
109 # Check for mathematical content
110 math_indicators = ["calculate", "compute", "solve", "equation", "math problem"]
111 if any(indicator in query_lower for indicator in math_indicators):
112 needed_categories.append("calculation_accuracy")
113
114 # Check for factual questions
115 factual_indicators = ["fact", "information about", "when did", "who is", "history of"]
116 if any(indicator in query_lower for indicator in factual_indicators):
117 needed_categories.append("factual_accuracy")
118
119 # Check for logical reasoning requirements
120 logic_indicators = ["why", "explain", "reason", "because", "therefore", "hence"]
121 if any(indicator in query_lower for indicator in logic_indicators):
122 needed_categories.append("logical_consistency")
123
124 # For comprehensive questions
125 if len(query.split()) > 30 or "comprehensive" in query_lower or "detailed" in query_lower:
126 needed_categories.append("completeness")
127
128 # For sensitive or controversial topics
129 sensitive_indicators = ["controversy", "debate", "opinion", "perspective", "ethical"]
130 if any(indicator in query_lower for indicator in sensitive_indicators):
131 needed_categories.append("bias_detection")
132
133 # Default to basic verification if nothing specific detected
134 if not needed_categories:
135 needed_categories = ["factual_accuracy", "logical_consistency"]
136
137 return needed_categories
138
139 def get_verification_prompt(self, categories: List[str]) -> str:
140 """Get the appropriate verification prompt based on needed categories."""
141 if "code_correctness" in categories and len(categories) == 1:
142 return self.verification_templates["code"]
143
144 if "calculation_accuracy" in categories and len(categories) == 1:
145 return self.verification_templates["math"]
146
147 if "factual_accuracy" in categories and "bias_detection" not in categories:
148 return self.verification_templates["factual"]
149
150 if "bias_detection" in categories and len(categories) == 1:
151 return self.verification_templates["bias"]
152
153 # Default to general verification
154 return self.verification_templates["general"]
155
156 async def verify_response(self,
157 query: str,
158 initial_response: str,
159 provider_service: Any) -> Tuple[str, bool]:
160 """
161 Verify and potentially correct a response.
162
163 Returns:
164 Tuple of (verified_response, was_corrected)
165 """
166 # Detect verification needs
167 verification_categories = self.detect_verification_needs(query)
168
169 # If no verification needed, return original
170 if not verification_categories:
171 return initial_response, False
172
173 # Get verification prompt
174 verification_prompt = self.get_verification_prompt(verification_categories)
175
176 # Create verification messages
177 verification_messages = [
178 {"role": "system", "content":
179 "You are a verification assistant. Your job is to verify the accuracy, "
180 "consistency, and completeness of responses. Identify any errors or "
181 "issues, and provide corrections when necessary."
182 },
183 {"role": "user", "content": query},
184 {"role": "assistant", "content": initial_response},
185 {"role": "user", "content": verification_prompt}
186 ]
187
188 try:
189 verification_response = await provider_service.generate_completion(
190 messages=verification_messages,
191 provider="openai", # Use OpenAI for verification
192 model="gpt-4" # Use a more capable model for verification
193 )
194
195 if verification_response and verification_response.get("message", {}).get("content"):
196 # Check if verification found issues
197 verification_text = verification_response["message"]["content"]
198
199 # Look for indicators of corrections
200 correction_indicators = [
201 "correction", "error", "mistake", "incorrect",
202 "needs clarification", "inaccurate", "not quite right"
203 ]
204
205 if any(indicator in verification_text.lower() for indicator in correction_indicators):
206 # Attempt to correct the response
207 corrected_response = await self._generate_corrected_response(
208 query, initial_response, verification_text, provider_service
209 )
210 return corrected_response, True
211
212 # If verification found no issues, or was just minor clarifications
213 minor_indicators = ["minor clarification", "additional note", "small correction"]
214 if any(indicator in verification_text.lower() for indicator in minor_indicators):
215 # Include the clarification in the response
216 combined = f"{initial_response}\n\n**Note:** {verification_text}"
217 return combined, True
218
219 # If verification failed or found no issues
220 return initial_response, False
221
222 except Exception as e:
223 logger.error(f"Error in response verification: {str(e)}")
224 return initial_response, False
225
226 async def _generate_corrected_response(self,
227 query: str,
228 initial_response: str,
229 verification_text: str,
230 provider_service: Any) -> str:
231 """Generate a corrected response based on verification feedback."""
232 correction_prompt = [
233 {"role": "system", "content":
234 "You are a correction assistant. Your job is to provide a revised response "
235 "that addresses the issues identified in the verification feedback. "
236 "Create a complete, standalone corrected response."
237 },
238 {"role": "user", "content": f"Original question:\n{query}"},
239 {"role": "assistant", "content": f"Initial response:\n{initial_response}"},
240 {"role": "user", "content": f"Verification feedback:\n{verification_text}\n\nPlease provide a corrected response."}
241 ]
242
243 try:
244 correction_response = await provider_service.generate_completion(
245 messages=correction_prompt,
246 provider="openai",
247 model="gpt-4"
248 )
249
250 if correction_response and correction_response.get("message", {}).get("content"):
251 return correction_response["message"]["content"]
252
253 except Exception as e:
254 logger.error(f"Error generating corrected response: {str(e)}")
255
256 # Fallback - append verification notes to original
257 return f"{initial_response}\n\n**Correction Note:** {verification_text}"

4. Domain-Specific Knowledge Integration

python
1# app/services/domain_knowledge.py
2import logging
3from typing import Dict, List, Any, Optional
4import json
5import re
6import os
7import yaml
8
9logger = logging.getLogger(__name__)
10
11class DomainKnowledgeService:
12 """
13 Enhances response accuracy by integrating domain-specific knowledge.
14 """
15
16 def __init__(self, knowledge_dir: str = "knowledge"):
17 self.knowledge_dir = knowledge_dir
18
19 # Domain definitions
20 self.domains = {
21 "programming": {
22 "keywords": ["coding", "programming", "software", "development", "algorithm", "function"],
23 "languages": ["python", "javascript", "java", "c++", "ruby", "go", "rust", "php"]
24 },
25 "medicine": {
26 "keywords": ["medical", "health", "disease", "treatment", "diagnosis", "symptom", "patient"],
27 "specialties": ["cardiology", "neurology", "pediatrics", "oncology", "psychiatry"]
28 },
29 "finance": {
30 "keywords": ["finance", "investment", "stock", "market", "trading", "portfolio", "asset"],
31 "topics": ["stocks", "bonds", "cryptocurrency", "retirement", "taxes", "budgeting"]
32 },
33 "law": {
34 "keywords": ["legal", "law", "regulation", "compliance", "contract", "liability"],
35 "areas": ["corporate", "criminal", "civil", "constitutional", "intellectual property"]
36 },
37 "science": {
38 "keywords": ["science", "research", "experiment", "theory", "hypothesis", "evidence"],
39 "fields": ["physics", "chemistry", "biology", "astronomy", "geology", "ecology"]
40 }
41 }
42
43 # Load domain knowledge
44 self.domain_knowledge = self._load_domain_knowledge()
45
46 # Track query->domain mappings to optimize repeated queries
47 self.domain_cache = {}
48
49 def _load_domain_knowledge(self) -> Dict[str, Any]:
50 """Load domain knowledge from files."""
51 knowledge = {}
52
53 try:
54 # Create knowledge dir if it doesn't exist
55 os.makedirs(self.knowledge_dir, exist_ok=True)
56
57 # List all domain knowledge files
58 for domain in self.domains.keys():
59 domain_path = os.path.join(self.knowledge_dir, f"{domain}.yaml")
60
61 # Create empty file if it doesn't exist
62 if not os.path.exists(domain_path):
63 with open(domain_path, 'w') as f:
64 yaml.dump({
65 "domain": domain,
66 "concepts": {},
67 "facts": [],
68 "common_misconceptions": [],
69 "best_practices": []
70 }, f)
71
72 # Load domain knowledge
73 try:
74 with open(domain_path, 'r') as f:
75 domain_data = yaml.safe_load(f)
76 knowledge[domain] = domain_data
77 except Exception as e:
78 logger.error(f"Error loading domain knowledge for {domain}: {str(e)}")
79 knowledge[domain] = {
80 "domain": domain,
81 "concepts": {},
82 "facts": [],
83 "common_misconceptions": [],
84 "best_practices": []
85 }
86 except Exception as e:
87 logger.error(f"Error loading domain knowledge: {str(e)}")
88
89 return knowledge
90
91 def detect_domains(self, query: str) -> List[str]:
92 """Detect relevant domains for a query."""
93 # Check cache first
94 cache_key = hashlib.md5(query.encode()).hexdigest()
95 if cache_key in self.domain_cache:
96 return self.domain_cache[cache_key]
97
98 query_lower = query.lower()
99 relevant_domains = []
100
101 # Check each domain for relevance
102 for domain, definition in self.domains.items():
103 # Check domain keywords
104 keyword_match = any(keyword in query_lower for keyword in definition["keywords"])
105
106 # Check specific domain topics
107 topic_match = False
108 for topic_category, topics in definition.items():
109 if topic_category != "keywords":
110 if any(topic in query_lower for topic in topics):
111 topic_match = True
112 break
113
114 if keyword_match or topic_match:
115 relevant_domains.append(domain)
116
117 # Cache result
118 self.domain_cache[cache_key] = relevant_domains
119 return relevant_domains
120
121 def get_domain_knowledge(self, domains: List[str]) -> Dict[str, Any]:
122 """Get knowledge for the specified domains."""
123 combined_knowledge = {
124 "concepts": {},
125 "facts": [],
126 "common_misconceptions": [],
127 "best_practices": []
128 }
129
130 for domain in domains:
131 if domain in self.domain_knowledge:
132 domain_data = self.domain_knowledge[domain]
133
134 # Merge concepts (dictionary)
135 combined_knowledge["concepts"].update(domain_data.get("concepts", {}))
136
137 # Extend lists
138 for key in ["facts", "common_misconceptions", "best_practices"]:
139 combined_knowledge[key].extend(domain_data.get(key, []))
140
141 return combined_knowledge
142
143 def format_domain_knowledge(self, knowledge: Dict[str, Any]) -> str:
144 """Format domain knowledge as a context string."""
145 if not knowledge or all(not v for v in knowledge.values()):
146 return ""
147
148 formatted_parts = []
149
150 # Format concepts
151 if knowledge["concepts"]:
152 concepts_list = []
153 for concept, definition in knowledge["concepts"].items():
154 concepts_list.append(f"- {concept}: {definition}")
155
156 formatted_parts.append("Key concepts:\n" + "\n".join(concepts_list))
157
158 # Format facts
159 if knowledge["facts"]:
160 formatted_parts.append("Important facts:\n- " + "\n- ".join(knowledge["facts"]))
161
162 # Format misconceptions
163 if knowledge["common_misconceptions"]:
164 formatted_parts.append("Common misconceptions to avoid:\n- " + "\n- ".join(knowledge["common_misconceptions"]))
165
166 # Format best practices
167 if knowledge["best_practices"]:
168 formatted_parts.append("Best practices:\n- " + "\n- ".join(knowledge["best_practices"]))
169
170 return "\n\n".join(formatted_parts)
171
172 def enhance_prompt_with_domain_knowledge(self, query: str, system_prompt: str) -> str:
173 """Enhance a system prompt with relevant domain knowledge."""
174 # Detect relevant domains
175 domains = self.detect_domains(query)
176
177 if not domains:
178 return system_prompt
179
180 # Get domain knowledge
181 knowledge = self.get_domain_knowledge(domains)
182
183 # Format knowledge as context
184 knowledge_text = self.format_domain_knowledge(knowledge)
185
186 if not knowledge_text:
187 return system_prompt
188
189 # Add to system prompt
190 enhanced_prompt = f"{system_prompt}\n\nRelevant domain knowledge:\n{knowledge_text}"
191
192 return enhanced_prompt

5. Dynamic Few-Shot Learning

python
1# app/services/few_shot_examples.py
2import logging
3from typing import Dict, List, Any, Optional, Tuple
4import os
5import json
6import random
7import re
8import hashlib
9
10logger = logging.getLogger(__name__)
11
12class FewShotExampleService:
13 """
14 Enhances response accuracy using dynamic few-shot learning with examples.
15 """
16
17 def __init__(self, examples_dir: str = "examples"):
18 self.examples_dir = examples_dir
19
20 # Ensure examples directory exists
21 os.makedirs(examples_dir, exist_ok=True)
22
23 # Task categories for examples
24 self.task_categories = {
25 "code_generation": {
26 "keywords": ["write code", "function", "implement", "program", "algorithm"],
27 "patterns": [r"write a .* function", r"implement .* in (python|javascript|java|c\+\+)"]
28 },
29 "explanation": {
30 "keywords": ["explain", "describe", "how does", "what is", "why is"],
31 "patterns": [r"explain .* to me", r"what is the .* of", r"how does .* work"]
32 },
33 "classification": {
34 "keywords": ["classify", "categorize", "identify", "is this", "determine"],
35 "patterns": [r"is this .* or .*", r"which category", r"identify the .*"]
36 },
37 "comparison": {
38 "keywords": ["compare", "contrast", "difference", "similarities", "versus"],
39 "patterns": [r"compare .* and .*", r"what is the difference between", r".* vs .*"]
40 },
41 "summarization": {
42 "keywords": ["summarize", "summary", "brief overview", "key points"],
43 "patterns": [r"summarize .*", r"provide a summary", r"key points of"]
44 }
45 }
46
47 # Load examples
48 self.examples = self._load_examples()
49
50 def _load_examples(self) -> Dict[str, List[Dict[str, str]]]:
51 """Load examples from files."""
52 examples = {category: [] for category in self.task_categories.keys()}
53
54 # Load examples for each category
55 for category in self.task_categories.keys():
56 category_file = os.path.join(self.examples_dir, f"{category}.json")
57
58 if os.path.exists(category_file):
59 try:
60 with open(category_file, 'r') as f:
61 category_examples = json.load(f)
62 examples[category] = category_examples
63 except Exception as e:
64 logger.error(f"Error loading examples for {category}: {str(e)}")
65
66 return examples
67
68 def detect_task_category(self, query: str) -> Optional[str]:
69 """Detect the task category for a query."""
70 query_lower = query.lower()
71
72 # Check each category
73 for category, definition in self.task_categories.items():
74 # Check keywords
75 if any(keyword in query_lower for keyword in definition["keywords"]):
76 return category
77
78 # Check regex patterns
79 if any(re.search(pattern, query_lower) for pattern in definition["patterns"]):
80 return category
81
82 return None
83
84 def select_examples(self,
85 query: str,
86 category: Optional[str] = None,
87 num_examples: int = 3) -> List[Dict[str, str]]:
88 """Select the most relevant examples for a query."""
89 # Detect category if not provided
90 if not category:
91 category = self.detect_task_category(query)
92
93 if not category or category not in self.examples or not self.examples[category]:
94 return []
95
96 category_examples = self.examples[category]
97
98 # If we have few examples, just return all of them (up to num_examples)
99 if len(category_examples) <= num_examples:
100 return category_examples
101
102 # For simplicity, we're using random selection here
103 # In a production system, this would use semantic similarity or other relevance metrics
104 selected = random.sample(category_examples, min(num_examples, len(category_examples)))
105
106 return selected
107
108 def format_examples_for_prompt(self, examples: List[Dict[str, str]]) -> str:
109 """Format examples for inclusion in a prompt."""
110 if not examples:
111 return ""
112
113 formatted_examples = []
114
115 for i, example in enumerate(examples, 1):
116 query = example.get("query", "")
117 response = example.get("response", "")
118
119 formatted = f"Example {i}:\n\nUser: {query}\n\nAssistant: {response}\n"
120 formatted_examples.append(formatted)
121
122 return "\n".join(formatted_examples)
123
124 def enhance_prompt_with_examples(self,
125 query: str,
126 system_prompt: str,
127 num_examples: int = 2) -> str:
128 """Enhance a system prompt with few-shot examples."""
129 # Select relevant examples
130 examples = self.select_examples(query, num_examples=num_examples)
131
132 if not examples:
133 return system_prompt
134
135 # Format examples
136 examples_text = self.format_examples_for_prompt(examples)
137
138 # Add to system prompt
139 enhanced_prompt = f"{system_prompt}\n\nHere are some examples of how to respond to similar queries:\n\n{examples_text}"
140
141 return enhanced_prompt
142
143 def add_example(self, category: str, query: str, response: str) -> bool:
144 """Add a new example to the examples collection."""
145 if category not in self.task_categories:
146 logger.error(f"Invalid category: {category}")
147 return False
148
149 example = {
150 "query": query,
151 "response": response,
152 "id": hashlib.md5(f"{category}:{query}".encode()).hexdigest()
153 }
154
155 # Add to in-memory collection
156 if category not in self.examples:
157 self.examples[category] = []
158
159 # Check if this example already exists
160 existing_ids = [e.get("id") for e in self.examples[category]]
161 if example["id"] in existing_ids:
162 return False # Example already exists
163
164 self.examples[category].append(example)
165
166 # Save to file
167 try:
168 category_file = os.path.join(self.examples_dir, f"{category}.json")
169 with open(category_file, 'w') as f:
170 json.dump(self.examples[category], f, indent=2)
171 return True
172 except Exception as e:
173 logger.error(f"Error saving example: {str(e)}")
174 return False

Deployment Strategies

Local Development Environment

Setup Script for Local Deployment

bash
1#!/bin/bash
2# local_setup.sh - Set up local development environment
3
4set -e # Exit on error
5
6# Check for required tools
7echo "Checking prerequisites..."
8command -v python3 >/dev/null 2>&1 || { echo "Python 3 is required but not installed. Aborting."; exit 1; }
9command -v pip3 >/dev/null 2>&1 || { echo "pip3 is required but not installed. Aborting."; exit 1; }
10command -v docker >/dev/null 2>&1 || { echo "Docker is required but not installed. Aborting."; exit 1; }
11command -v docker-compose >/dev/null 2>&1 || { echo "Docker Compose is required but not installed. Aborting."; exit 1; }
12
13# Create virtual environment
14echo "Creating Python virtual environment..."
15python3 -m venv venv
16source venv/bin/activate
17
18# Install dependencies
19echo "Installing Python dependencies..."
20pip install --upgrade pip
21pip install -r requirements.txt
22pip install -r requirements-dev.txt
23
24# Set up environment file
25if [ ! -f .env ]; then
26 echo "Creating .env file..."
27 cp .env.example .env
28
29 # Prompt for OpenAI API key
30 read -p "Enter your OpenAI API key (leave blank to skip): " openai_key
31 if [ ! -z "$openai_key" ]; then
32 sed -i "s/OPENAI_API_KEY=.*/OPENAI_API_KEY=$openai_key/" .env
33 fi
34
35 # Set environment to development
36 sed -i "s/APP_ENV=.*/APP_ENV=development/" .env
37
38 echo ".env file created. Please review and update as needed."
39else
40 echo ".env file already exists. Skipping creation."
41fi
42
43# Check if Ollama is installed
44if ! command -v ollama >/dev/null 2>&1; then
45 echo "Ollama not found. Would you like to install it? (y/n)"
46 read install_ollama
47
48 if [ "$install_ollama" = "y" ]; then
49 echo "Installing Ollama..."
50 if [[ "$OSTYPE" == "darwin"* ]]; then
51 # macOS
52 curl -fsSL https://ollama.com/install.sh | sh
53 else
54 # Linux
55 curl -fsSL https://ollama.com/install.sh | sh
56 fi
57 else
58 echo "Skipping Ollama installation. You will need to install it manually."
59 fi
60else
61 echo "Ollama already installed."
62fi
63
64# Pull required Ollama models
65if command -v ollama >/dev/null 2>&1; then
66 echo "Would you like to pull the recommended Ollama models? (y/n)"
67 read pull_models
68
69 if [ "$pull_models" = "y" ]; then
70 echo "Pulling Ollama models..."
71 ollama pull llama2
72 ollama pull mistral
73 ollama pull codellama
74 fi
75fi
76
77# Start Redis for development
78echo "Starting Redis with Docker..."
79docker-compose up -d redis
80
81# Initialize database
82echo "Initializing database..."
83python scripts/init_db.py
84
85# Run tests to verify setup
86echo "Running tests to verify setup..."
87pytest tests/unit
88
89echo "Setup complete! You can now start the development server with:"
90echo "uvicorn app.main:app --reload"

Docker Compose for Local Services

yaml
1# docker-compose.yml
2version: '3.8'
3
4services:
5 app:
6 build:
7 context: .
8 dockerfile: Dockerfile.dev
9 ports:
10 - "8000:8000"
11 volumes:
12 - .:/app
13 environment:
14 - PYTHONPATH=/app
15 - REDIS_URL=redis://redis:6379/0
16 - OLLAMA_HOST=http://ollama:11434
17 - APP_ENV=development
18 - FORCE_DEV_MODE=true
19 depends_on:
20 - redis
21 - ollama
22 command: uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
23
24 redis:
25 image: redis:alpine
26 ports:
27 - "6379:6379"
28 volumes:
29 - redis_data:/data
30
31 ollama:
32 image: ollama/ollama:latest
33 ports:
34 - "11434:11434"
35 volumes:
36 - ollama_data:/root/.ollama
37 deploy:
38 resources:
39 reservations:
40 devices:
41 - driver: nvidia
42 count: all
43 capabilities: [gpu]
44
45 ui:
46 build:
47 context: ./ui
48 dockerfile: Dockerfile.dev
49 ports:
50 - "3000:3000"
51 volumes:
52 - ./ui:/app
53 - /app/node_modules
54 environment:
55 - API_URL=http://app:8000
56 depends_on:
57 - app
58 command: npm start
59
60volumes:
61 redis_data:
62 ollama_data:

Development Dockerfile

dockerfile
1# Dockerfile.dev
2FROM python:3.11-slim
3
4WORKDIR /app
5
6# Install system dependencies
7RUN apt-get update && apt-get install -y --no-install-recommends \
8 curl \
9 gcc \
10 build-essential \
11 && rm -rf /var/lib/apt/lists/*
12
13# Install Python dependencies
14COPY requirements.txt requirements-dev.txt ./
15RUN pip install --no-cache-dir -r requirements.txt -r requirements-dev.txt
16
17# Copy application code
18COPY . .
19
20# Set development environment
21ENV PYTHONUNBUFFERED=1
22ENV PYTHONDONTWRITEBYTECODE=1
23ENV APP_ENV=development
24
25# Make scripts executable
26RUN chmod +x scripts/*.sh
27
28# Default command
29CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]

Configuration for Local Environment

python
1# app/config/local.py
2"""Configuration for local development environment."""
3
4import os
5from typing import Dict, Any, List
6
7# API configuration
8API_HOST = "0.0.0.0"
9API_PORT = 8000
10API_RELOAD = True
11API_DEBUG = True
12
13# OpenAI configuration
14OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "")
15OPENAI_ORG_ID = os.environ.get("OPENAI_ORG_ID", "")
16OPENAI_MODEL = "gpt-3.5-turbo" # Default to cheaper model for development
17
18# Ollama configuration
19OLLAMA_HOST = os.environ.get("OLLAMA_HOST", "http://localhost:11434")
20OLLAMA_MODEL = "llama2" # Default local model
21ENABLE_GPU = True
22
23# App configuration
24LOG_LEVEL = "DEBUG"
25ENABLE_CORS = True
26CORS_ORIGINS = ["http://localhost:3000", "http://127.0.0.1:3000"]
27
28# Feature flags
29ENABLE_CACHING = True
30ENABLE_RATE_LIMITING = False # Disable rate limiting in local development
31ENABLE_PARALLEL_PROCESSING = True
32ENABLE_RESPONSE_VERIFICATION = True
33
34# Development-specific settings
35FORCE_DEV_MODE = os.environ.get("FORCE_DEV_MODE", "false").lower() == "true"
36DEV_OPENAI_QUOTA = 100 # Maximum OpenAI API calls per day in development
37
38# Redis configuration
39REDIS_URL = os.environ.get("REDIS_URL", "redis://localhost:6379/0")

Production Deployment

Kubernetes Manifests for Production

yaml
1# kubernetes/deployment.yaml
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: mcp-api
6 labels:
7 app: mcp-api
8spec:
9 replicas: 3
10 selector:
11 matchLabels:
12 app: mcp-api
13 template:
14 metadata:
15 labels:
16 app: mcp-api
17 spec:
18 containers:
19 - name: api
20 image: ${DOCKER_REGISTRY}/mcp-api:${IMAGE_TAG}
21 imagePullPolicy: Always
22 ports:
23 - containerPort: 8000
24 env:
25 - name: APP_ENV
26 value: "production"
27 - name: REDIS_URL
28 valueFrom:
29 secretKeyRef:
30 name: mcp-secrets
31 key: redis_url
32 - name: OPENAI_API_KEY
33 valueFrom:
34 secretKeyRef:
35 name: mcp-secrets
36 key: openai_api_key
37 - name: OLLAMA_HOST
38 value: "http://ollama-service:11434"
39 - name: MONTHLY_BUDGET
40 value: "${MONTHLY_BUDGET}"
41 resources:
42 requests:
43 cpu: 500m
44 memory: 512Mi
45 limits:
46 cpu: 1000m
47 memory: 1Gi
48 readinessProbe:
49 httpGet:
50 path: /api/health
51 port: 8000
52 initialDelaySeconds: 10
53 periodSeconds: 5
54 livenessProbe:
55 httpGet:
56 path: /api/health
57 port: 8000
58 initialDelaySeconds: 20
59 periodSeconds: 15
60---
61apiVersion: apps/v1
62kind: Deployment
63metadata:
64 name: ollama
65 labels:
66 app: ollama
67spec:
68 replicas: 1 # Start with a single replica for Ollama
69 selector:
70 matchLabels:
71 app: ollama
72 template:
73 metadata:
74 labels:
75 app: ollama
76 spec:
77 containers:
78 - name: ollama
79 image: ollama/ollama:latest
80 ports:
81 - containerPort: 11434
82 volumeMounts:
83 - mountPath: /root/.ollama
84 name: ollama-data
85 resources:
86 requests:
87 cpu: 1000m
88 memory: 4Gi
89 limits:
90 cpu: 4000m
91 memory: 16Gi
92 # If using GPU
93 env:
94 - name: NVIDIA_VISIBLE_DEVICES
95 value: "all"
96 - name: NVIDIA_DRIVER_CAPABILITIES
97 value: "compute,utility"
98 volumes:
99 - name: ollama-data
100 persistentVolumeClaim:
101 claimName: ollama-pvc
102---
103apiVersion: v1
104kind: Service
105metadata:
106 name: mcp-api-service
107spec:
108 selector:
109 app: mcp-api
110 ports:
111 - port: 80
112 targetPort: 8000
113 type: ClusterIP
114---
115apiVersion: v1
116kind: Service
117metadata:
118 name: ollama-service
119spec:
120 selector:
121 app: ollama
122 ports:
123 - port: 11434
124 targetPort: 11434
125 type: ClusterIP
126---
127apiVersion: networking.k8s.io/v1
128kind: Ingress
129metadata:
130 name: mcp-ingress
131 annotations:
132 kubernetes.io/ingress.class: "nginx"
133 cert-manager.io/cluster-issuer: "letsencrypt-prod"
134spec:
135 tls:
136 - hosts:
137 - api.mcpservice.com
138 secretName: mcp-tls
139 rules:
140 - host: api.mcpservice.com
141 http:
142 paths:
143 - path: /
144 pathType: Prefix
145 backend:
146 service:
147 name: mcp-api-service
148 port:
149 number: 80
150---
151apiVersion: v1
152kind: PersistentVolumeClaim
153metadata:
154 name: ollama-pvc
155spec:
156 accessModes:
157 - ReadWriteOnce
158 resources:
159 requests:
160 storage: 50Gi # Adjust based on your models

Horizontal Pod Autoscaling (HPA)

yaml
1# kubernetes/hpa.yaml
2apiVersion: autoscaling/v2
3kind: HorizontalPodAutoscaler
4metadata:
5 name: mcp-api-hpa
6spec:
7 scaleTargetRef:
8 apiVersion: apps/v1
9 kind: Deployment
10 name: mcp-api
11 minReplicas: 3
12 maxReplicas: 10
13 metrics:
14 - type: Resource
15 resource:
16 name: cpu
17 target:
18 type: Utilization
19 averageUtilization: 70
20 - type: Resource
21 resource:
22 name: memory
23 target:
24 type: Utilization
25 averageUtilization: 80

Deployment Script

bash
1#!/bin/bash
2# deploy.sh - Production deployment script
3
4set -e # Exit on error
5
6# Check required environment variables
7if [ -z "$DOCKER_REGISTRY" ] || [ -z "$IMAGE_TAG" ] || [ -z "$K8S_NAMESPACE" ]; then
8 echo "Error: Required environment variables not set."
9 echo "Please set DOCKER_REGISTRY, IMAGE_TAG, and K8S_NAMESPACE."
10 exit 1
11fi
12
13# Build and push Docker image
14echo "Building and pushing Docker image..."
15docker build -t ${DOCKER_REGISTRY}/mcp-api:${IMAGE_TAG} -f Dockerfile.prod .
16docker push ${DOCKER_REGISTRY}/mcp-api:${IMAGE_TAG}
17
18# Apply Kubernetes configuration
19echo "Applying Kubernetes configuration..."
20
21# Create namespace if it doesn't exist
22kubectl get namespace ${K8S_NAMESPACE} || kubectl create namespace ${K8S_NAMESPACE}
23
24# Apply secrets
25echo "Applying secrets..."
26kubectl apply -f kubernetes/secrets.yaml -n ${K8S_NAMESPACE}
27
28# Deploy Redis if needed
29echo "Deploying Redis..."
30helm upgrade --install redis bitnami/redis \
31 --namespace ${K8S_NAMESPACE} \
32 --set auth.password=${REDIS_PASSWORD} \
33 --set master.persistence.size=8Gi
34
35# Deploy application
36echo "Deploying application..."
37# Replace variables in deployment file
38envsubst < kubernetes/deployment.yaml | kubectl apply -f - -n ${K8S_NAMESPACE}
39
40# Apply HPA
41kubectl apply -f kubernetes/hpa.yaml -n ${K8S_NAMESPACE}
42
43# Verify deployment
44echo "Verifying deployment..."
45kubectl rollout status deployment/mcp-api -n ${K8S_NAMESPACE}
46kubectl rollout status deployment/ollama -n ${K8S_NAMESPACE}
47
48# Initialize Ollama models if needed
49echo "Would you like to initialize Ollama models? (y/n)"
50read init_models
51
52if [ "$init_models" = "y" ]; then
53 echo "Initializing Ollama models..."
54 # Get pod name
55 OLLAMA_POD=$(kubectl get pods -l app=ollama -n ${K8S_NAMESPACE} -o jsonpath="{.items[0].metadata.name}")
56
57 # Pull models
58 kubectl exec ${OLLAMA_POD} -n ${K8S_NAMESPACE} -- ollama pull llama2
59 kubectl exec ${OLLAMA_POD} -n ${K8S_NAMESPACE} -- ollama pull mistral
60 kubectl exec ${OLLAMA_POD} -n ${K8S_NAMESPACE} -- ollama pull codellama
61fi
62
63echo "Deployment complete!"
64echo "API available at: https://api.mcpservice.com"

Production Dockerfile

dockerfile
1# Dockerfile.prod
2FROM python:3.11-slim as builder
3
4WORKDIR /app
5
6# Install build dependencies
7RUN apt-get update && apt-get install -y --no-install-recommends \
8 gcc \
9 build-essential \
10 && rm -rf /var/lib/apt/lists/*
11
12# Install Python dependencies
13COPY requirements.txt ./
14RUN pip wheel --no-cache-dir --no-deps --wheel-dir /app/wheels -r requirements.txt
15
16# Final stage
17FROM python:3.11-slim
18
19WORKDIR /app
20
21# Copy wheels from builder stage
22COPY --from=builder /app/wheels /wheels
23RUN pip install --no-cache /wheels/*
24
25# Copy application code
26COPY app /app/app
27COPY scripts /app/scripts
28COPY alembic.ini /app/
29
30# Create non-root user
31RUN useradd -m appuser && \
32 chown -R appuser:appuser /app
33USER appuser
34
35# Set production environment
36ENV PYTHONPATH=/app
37ENV APP_ENV=production
38ENV PYTHONUNBUFFERED=1
39
40# Expose port
41EXPOSE 8000
42
43# Run using Gunicorn in production
44CMD ["gunicorn", "-k", "uvicorn.workers.UvicornWorker", "-c", "app/config/gunicorn.py", "app.main:app"]

Gunicorn Configuration for Production

python
1# app/config/gunicorn.py
2"""Gunicorn configuration for production deployment."""
3
4import multiprocessing
5import os
6
7# Bind to 0.0.0.0:8000
8bind = "0.0.0.0:8000"
9
10# Worker configuration
11workers = multiprocessing.cpu_count() * 2 + 1
12worker_class = "uvicorn.workers.UvicornWorker"
13worker_connections = 1000
14timeout = 60
15keepalive = 5
16
17# Logging
18accesslog = "-"
19errorlog = "-"
20loglevel = os.environ.get("LOG_LEVEL", "info").lower()
21
22# Security
23limit_request_line = 4094
24limit_request_fields = 100
25limit_request_field_size = 8190
26
27# Process naming
28proc_name = "mcp-api"

Cloud Deployment (AWS)

AWS CloudFormation Template

yaml
1# aws/cloudformation.yaml
2AWSTemplateFormatVersion: '2010-09-09'
3Description: 'MCP OpenAI-Ollama Hybrid System'
4
5Parameters:
6 Environment:
7 Description: Deployment environment
8 Type: String
9 Default: Production
10 AllowedValues:
11 - Development
12 - Staging
13 - Production
14
15 ECRRepositoryName:
16 Description: ECR Repository name
17 Type: String
18 Default: mcp-api
19
20 VpcId:
21 Description: VPC ID
22 Type: AWS::EC2::VPC::Id
23
24 SubnetIds:
25 Description: Subnet IDs for the ECS tasks
26 Type: List<AWS::EC2::Subnet::Id>
27
28 OllamaInstanceType:
29 Description: EC2 instance type for Ollama
30 Type: String
31 Default: g4dn.xlarge
32 AllowedValues:
33 - g4dn.xlarge
34 - g5.xlarge
35 - p3.2xlarge
36 - c5.2xlarge # CPU-only option
37
38 ApiInstanceCount:
39 Description: Number of API instances
40 Type: Number
41 Default: 2
42 MinValue: 1
43 MaxValue: 10
44
45Resources:
46 # ECR Repository
47 ECRRepository:
48 Type: AWS::ECR::Repository
49 Properties:
50 RepositoryName: !Ref ECRRepositoryName
51 ImageScanningConfiguration:
52 ScanOnPush: true
53 LifecyclePolicy:
54 LifecyclePolicyText: |
55 {
56 "rules": [
57 {
58 "rulePriority": 1,
59 "description": "Keep only the 10 most recent images",
60 "selection": {
61 "tagStatus": "any",
62 "countType": "imageCountMoreThan",
63 "countNumber": 10
64 },
65 "action": {
66 "type": "expire"
67 }
68 }
69 ]
70 }
71
72 # ElastiCache Redis
73 RedisSecurityGroup:
74 Type: AWS::EC2::SecurityGroup
75 Properties:
76 GroupDescription: Security group for Redis cluster
77 VpcId: !Ref VpcId
78 SecurityGroupIngress:
79 - IpProtocol: tcp
80 FromPort: 6379
81 ToPort: 6379
82 SourceSecurityGroupId: !GetAtt APISecurityGroup.GroupId
83
84 RedisSubnetGroup:
85 Type: AWS::ElastiCache::SubnetGroup
86 Properties:
87 Description: Subnet group for Redis
88 SubnetIds: !Ref SubnetIds
89
90 RedisCluster:
91 Type: AWS::ElastiCache::CacheCluster
92 Properties:
93 Engine: redis
94 CacheNodeType: cache.t3.medium
95 NumCacheNodes: 1
96 VpcSecurityGroupIds:
97 - !GetAtt RedisSecurityGroup.GroupId
98 CacheSubnetGroupName: !Ref RedisSubnetGroup
99 AutoMinorVersionUpgrade: true
100
101 # Ollama EC2 Instance
102 OllamaSecurityGroup:
103 Type: AWS::EC2::SecurityGroup
104 Properties:
105 GroupDescription: Security group for Ollama EC2 instance
106 VpcId: !Ref VpcId
107 SecurityGroupIngress:
108 - IpProtocol: tcp
109 FromPort: 11434
110 ToPort: 11434
111 SourceSecurityGroupId: !GetAtt APISecurityGroup.GroupId
112 - IpProtocol: tcp
113 FromPort: 22
114 ToPort: 22
115 CidrIp: 0.0.0.0/0 # Restrict this in production
116
117 OllamaInstanceRole:
118 Type: AWS::IAM::Role
119 Properties:
120 AssumeRolePolicyDocument:
121 Version: '2012-10-17'
122 Statement:
123 - Effect: Allow
124 Principal:
125 Service: ec2.amazonaws.com
126 Action: sts:AssumeRole
127 ManagedPolicyArns:
128 - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
129
130 OllamaInstanceProfile:
131 Type: AWS::IAM::InstanceProfile
132 Properties:
133 Roles:
134 - !Ref OllamaInstanceRole
135
136 OllamaEBSVolume:
137 Type: AWS::EC2::Volume
138 Properties:
139 AvailabilityZone: !Select [0, !GetAZs '']
140 Size: 100
141 VolumeType: gp3
142 Encrypted: true
143 Tags:
144 - Key: Name
145 Value: OllamaVolume
146
147 OllamaInstance:
148 Type: AWS::EC2::Instance
149 Properties:
150 InstanceType: !Ref OllamaInstanceType
151 ImageId: ami-0261755bbcb8c4a84 # Amazon Linux 2 AMI - update as needed
152 SecurityGroupIds:
153 - !GetAtt OllamaSecurityGroup.GroupId
154 SubnetId: !Select [0, !Ref SubnetIds]
155 IamInstanceProfile: !Ref OllamaInstanceProfile
156 BlockDeviceMappings:
157 - DeviceName: /dev/xvda
158 Ebs:
159 VolumeSize: 30
160 VolumeType: gp3
161 DeleteOnTermination: true
162 UserData:
163 Fn::Base64: !Sub |
164 #!/bin/bash
165 # Install Docker
166 amazon-linux-extras install docker -y
167 systemctl start docker
168 systemctl enable docker
169
170 # Install Ollama
171 curl -fsSL https://ollama.com/install.sh | sh
172
173 # Run Ollama in Docker
174 docker run -d --name ollama \
175 -p 11434:11434 \
176 -v ollama:/root/.ollama \
177 ollama/ollama
178
179 # Pull models
180 docker exec ollama ollama pull llama2
181 docker exec ollama ollama pull mistral
182 docker exec ollama ollama pull codellama
183 Tags:
184 - Key: Name
185 Value: !Sub "${AWS::StackName}-ollama"
186
187 OllamaVolumeAttachment:
188 Type: AWS::EC2::VolumeAttachment
189 Properties:
190 InstanceId: !Ref OllamaInstance
191 VolumeId: !Ref OllamaEBSVolume
192 Device: /dev/sdf
193
194 # API ECS Cluster
195 ECSCluster:
196 Type: AWS::ECS::Cluster
197 Properties:
198 ClusterName: !Sub "${AWS::StackName}-cluster"
199 CapacityProviders:
200 - FARGATE
201 DefaultCapacityProviderStrategy:
202 - CapacityProvider: FARGATE
203 Weight: 1
204
205 APISecurityGroup:
206 Type: AWS::EC2::SecurityGroup
207 Properties:
208 GroupDescription: Security group for API ECS tasks
209 VpcId: !Ref VpcId
210 SecurityGroupIngress:
211 - IpProtocol: tcp
212 FromPort: 8000
213 ToPort: 8000
214 CidrIp: 0.0.0.0/0 # Restrict in production
215
216 # ECS Task Definition
217 ECSTaskExecutionRole:
218 Type: AWS::IAM::Role
219 Properties:
220 AssumeRolePolicyDocument:
221 Version: '2012-10-17'
222 Statement:
223 - Effect: Allow
224 Principal:
225 Service: ecs-tasks.amazonaws.com
226 Action: sts:AssumeRole
227 ManagedPolicyArns:
228 - arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
229
230 ECSTaskRole:
231 Type: AWS::IAM::Role
232 Properties:
233 AssumeRolePolicyDocument:
234 Version: '2012-10-17'
235 Statement:
236 - Effect: Allow
237 Principal:
238 Service: ecs-tasks.amazonaws.com
239 Action: sts:AssumeRole
240 ManagedPolicyArns:
241 - arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
242
243 APITaskDefinition:
244 Type: AWS::ECS::TaskDefinition
245 Properties:
246 Family: !Sub "${AWS::StackName}-api"
247 Cpu: '1024'
248 Memory: '2048'
249 NetworkMode: awsvpc
250 RequiresCompatibilities:
251 - FARGATE
252 ExecutionRoleArn: !GetAtt ECSTaskExecutionRole.Arn
253 TaskRoleArn: !GetAtt ECSTaskRole.Arn
254 ContainerDefinitions:
255 - Name: api
256 Image: !Sub "${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/${ECRRepositoryName}:latest"
257 Essential: true
258 PortMappings:
259 - ContainerPort: 8000
260 Environment:
261 - Name: REDIS_URL
262 Value: !Sub "redis://${RedisCluster.RedisEndpoint.Address}:${RedisCluster.RedisEndpoint.Port}/0"
263 - Name: OLLAMA_HOST
264 Value: !Sub "http://${OllamaInstance.PrivateIp}:11434"
265 - Name: APP_ENV
266 Value: !Ref Environment
267 LogConfiguration:
268 LogDriver: awslogs
269 Options:
270 awslogs-group: !Ref APILogGroup
271 awslogs-region: !Ref AWS::Region
272 awslogs-stream-prefix: api
273 HealthCheck:
274 Command:
275 - CMD-SHELL
276 - curl -f http://localhost:8000/api/health || exit 1
277 Interval: 30
278 Timeout: 5
279 Retries: 3
280
281 APILogGroup:
282 Type: AWS::Logs::LogGroup
283 Properties:
284 LogGroupName: !Sub "/ecs/${AWS::StackName}-api"
285 RetentionInDays: 7
286
287 # ECS Service
288 APIService:
289 Type: AWS::ECS::Service
290 Properties:
291 ServiceName: !Sub "${AWS::StackName}-api"
292 Cluster: !Ref ECSCluster
293 TaskDefinition: !Ref APITaskDefinition
294 DesiredCount: !Ref ApiInstanceCount
295 LaunchType: FARGATE
296 NetworkConfiguration:
297 AwsvpcConfiguration:
298 AssignPublicIp: ENABLED
299 SecurityGroups:
300 - !GetAtt APISecurityGroup.GroupId
301 Subnets: !Ref SubnetIds
302 LoadBalancers:
303 - TargetGroupArn: !Ref ALBTargetGroup
304 ContainerName: api
305 ContainerPort: 8000
306 DependsOn: ALBListener
307
308 # Application Load Balancer
309 ALB:
310 Type: AWS::ElasticLoadBalancingV2::LoadBalancer
311 Properties:
312 Name: !Sub "${AWS::StackName}-alb"
313 Type: application
314 Scheme: internet-facing
315 SecurityGroups:
316 - !GetAtt ALBSecurityGroup.GroupId
317 Subnets: !Ref SubnetIds
318 LoadBalancerAttributes:
319 - Key: idle_timeout.timeout_seconds
320 Value: '60'
321
322 ALBSecurityGroup:
323 Type: AWS::EC2::SecurityGroup
324 Properties:
325 GroupDescription: Security group for ALB
326 VpcId: !Ref VpcId
327 SecurityGroupIngress:
328 - IpProtocol: tcp
329 FromPort: 80
330 ToPort: 80
331 CidrIp: 0.0.0.0/0
332 - IpProtocol: tcp
333 FromPort: 443
334 ToPort: 443
335 CidrIp: 0.0.0.0/0
336
337 ALBTargetGroup:
338 Type: AWS::ElasticLoadBalancingV2::TargetGroup
339 Properties:
340 Name: !Sub "${AWS::StackName}-target-group"
341 Port: 8000
342 Protocol: HTTP
343 TargetType: ip
344 VpcId: !Ref VpcId
345 HealthCheckPath: /api/health
346 HealthCheckIntervalSeconds: 30
347 HealthCheckTimeoutSeconds: 5
348 HealthyThresholdCount: 3
349 UnhealthyThresholdCount: 3
350
351 ALBListener:
352 Type: AWS::ElasticLoadBalancingV2::Listener
353 Properties:
354 LoadBalancerArn: !Ref ALB
355 Port: 80
356 Protocol: HTTP
357 DefaultActions:
358 - Type: forward
359 TargetGroupArn: !Ref ALBTargetGroup
360
361Outputs:
362 APIEndpoint:
363 Description: URL for API
364 Value: !Sub "http://${ALB.DNSName}"
365
366 OllamaEndpoint:
367 Description: Ollama Server Private IP
368 Value: !GetAtt OllamaInstance.PrivateIp
369
370 ECRRepository:
371 Description: ECR Repository URL
372 Value: !Sub "${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/${ECRRepositoryName}"
373
374 RedisEndpoint:
375 Description: Redis Endpoint
376 Value: !Sub "${RedisCluster.RedisEndpoint.Address}:${RedisCluster.RedisEndpoint.Port}"

AWS Deployment Script

bash
1#!/bin/bash
2# aws_deploy.sh - AWS deployment script
3
4set -e # Exit on error
5
6# Check required AWS CLI
7if ! command -v aws &> /dev/null; then
8 echo "AWS CLI is required but not installed. Aborting."
9 exit 1
10fi
11
12# AWS configuration
13AWS_REGION="us-east-1"
14STACK_NAME="mcp-hybrid-system"
15CFN_TEMPLATE="aws/cloudformation.yaml"
16IMAGE_TAG=$(git rev-parse --short HEAD)
17
18# Check if stack exists
19if aws cloudformation describe-stacks --stack-name $STACK_NAME --region $AWS_REGION &> /dev/null; then
20 STACK_ACTION="update"
21else
22 STACK_ACTION="create"
23fi
24
25# Deploy CloudFormation stack
26if [ "$STACK_ACTION" = "create" ]; then
27 echo "Creating CloudFormation stack..."
28 aws cloudformation create-stack \
29 --stack-name $STACK_NAME \
30 --template-body file://$CFN_TEMPLATE \
31 --capabilities CAPABILITY_IAM \
32 --parameters \
33 ParameterKey=Environment,ParameterValue=Production \
34 ParameterKey=OllamaInstanceType,ParameterValue=g4dn.xlarge \
35 ParameterKey=ApiInstanceCount,ParameterValue=2 \
36 --region $AWS_REGION
37
38 # Wait for stack creation
39 echo "Waiting for stack creation to complete..."
40 aws cloudformation wait stack-create-complete \
41 --stack-name $STACK_NAME \
42 --region $AWS_REGION
43else
44 echo "Updating CloudFormation stack..."
45 aws cloudformation update-stack \
46 --stack-name $STACK_NAME \
47 --template-body file://$CFN_TEMPLATE \
48 --capabilities CAPABILITY_IAM \
49 --parameters \
50 ParameterKey=Environment,ParameterValue=Production \
51 ParameterKey=OllamaInstanceType,ParameterValue=g4dn.xlarge \
52 ParameterKey=ApiInstanceCount,ParameterValue=2 \
53 --region $AWS_REGION
54
55 # Wait for stack update
56 echo "Waiting for stack update to complete..."
57 aws cloudformation wait stack-update-complete \
58 --stack-name $STACK_NAME \
59 --region $AWS_REGION
60fi
61
62# Get stack outputs
63echo "Getting stack outputs..."
64ECR_REPOSITORY=$(aws cloudformation describe-stacks \
65 --stack-name $STACK_NAME \
66 --query "Stacks[0].Outputs[?OutputKey=='ECRRepository'].OutputValue" \
67 --output text \
68 --region $AWS_REGION)
69
70API_ENDPOINT=$(aws cloudformation describe-stacks \
71 --stack-name $STACK_NAME \
72 --query "Stacks[0].Outputs[?OutputKey=='APIEndpoint'].OutputValue" \
73 --output text \
74 --region $AWS_REGION)
75
76# Build and push Docker image
77echo "Building and pushing Docker image to ECR..."
78# Login to ECR
79aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $ECR_REPOSITORY
80
81# Build and push
82docker build -t $ECR_REPOSITORY:$IMAGE_TAG -t $ECR_REPOSITORY:latest -f Dockerfile.prod .
83docker push $ECR_REPOSITORY:$IMAGE_TAG
84docker push $ECR_REPOSITORY:latest
85
86# Update ECS service to force deployment
87echo "Updating ECS service..."
88ECS_CLUSTER="${STACK_NAME}-cluster"
89ECS_SERVICE="${STACK_NAME}-api"
90
91aws ecs update-service \
92 --cluster $ECS_CLUSTER \
93 --service $ECS_SERVICE \
94 --force-new-deployment \
95 --region $AWS_REGION
96
97echo "Deployment complete!"
98echo "API Endpoint: $API_ENDPOINT"

Optimization and Deployment Strategies for OpenAI-Ollama Hybrid AI System (Continued)

Monitoring and Observability Configuration

Prometheus and Grafana Setup for Metrics

yaml
1# monitoring/prometheus-config.yaml
2apiVersion: v1
3kind: ConfigMap
4metadata:
5 name: prometheus-config
6data:
7 prometheus.yml: |
8 global:
9 scrape_interval: 15s
10 evaluation_interval: 15s
11
12 scrape_configs:
13 - job_name: 'mcp-api'
14 metrics_path: /metrics
15 kubernetes_sd_configs:
16 - role: pod
17 relabel_configs:
18 - source_labels: [__meta_kubernetes_pod_label_app]
19 regex: mcp-api
20 action: keep
21
22 - job_name: 'ollama'
23 metrics_path: /metrics
24 static_configs:
25 - targets: ['ollama-service:11434']
26
27 alerting:
28 alertmanagers:
29 - static_configs:
30 - targets: ['alertmanager:9093']
31---
32apiVersion: apps/v1
33kind: Deployment
34metadata:
35 name: prometheus
36spec:
37 replicas: 1
38 selector:
39 matchLabels:
40 app: prometheus
41 template:
42 metadata:
43 labels:
44 app: prometheus
45 spec:
46 containers:
47 - name: prometheus
48 image: prom/prometheus:v2.42.0
49 ports:
50 - containerPort: 9090
51 volumeMounts:
52 - name: config-volume
53 mountPath: /etc/prometheus
54 - name: prometheus-data
55 mountPath: /prometheus
56 args:
57 - "--config.file=/etc/prometheus/prometheus.yml"
58 - "--storage.tsdb.path=/prometheus"
59 - "--web.console.libraries=/usr/share/prometheus/console_libraries"
60 - "--web.console.templates=/usr/share/prometheus/consoles"
61 - "--web.enable-lifecycle"
62 volumes:
63 - name: config-volume
64 configMap:
65 name: prometheus-config
66 - name: prometheus-data
67 persistentVolumeClaim:
68 claimName: prometheus-pvc
69---
70apiVersion: v1
71kind: Service
72metadata:
73 name: prometheus-service
74spec:
75 selector:
76 app: prometheus
77 ports:
78 - port: 9090
79 targetPort: 9090
80 type: ClusterIP
81---
82apiVersion: apps/v1
83kind: Deployment
84metadata:
85 name: grafana
86spec:
87 replicas: 1
88 selector:
89 matchLabels:
90 app: grafana
91 template:
92 metadata:
93 labels:
94 app: grafana
95 spec:
96 containers:
97 - name: grafana
98 image: grafana/grafana:9.4.7
99 ports:
100 - containerPort: 3000
101 volumeMounts:
102 - name: grafana-data
103 mountPath: /var/lib/grafana
104 env:
105 - name: GF_SECURITY_ADMIN_USER
106 valueFrom:
107 secretKeyRef:
108 name: grafana-secrets
109 key: admin_user
110 - name: GF_SECURITY_ADMIN_PASSWORD
111 valueFrom:
112 secretKeyRef:
113 name: grafana-secrets
114 key: admin_password
115 volumes:
116 - name: grafana-data
117 persistentVolumeClaim:
118 claimName: grafana-pvc
119---
120apiVersion: v1
121kind: Service
122metadata:
123 name: grafana-service
124spec:
125 selector:
126 app: grafana
127 ports:
128 - port: 3000
129 targetPort: 3000
130 type: ClusterIP
131---
132apiVersion: v1
133kind: PersistentVolumeClaim
134metadata:
135 name: prometheus-pvc
136spec:
137 accessModes:
138 - ReadWriteOnce
139 resources:
140 requests:
141 storage: 10Gi
142---
143apiVersion: v1
144kind: PersistentVolumeClaim
145metadata:
146 name: grafana-pvc
147spec:
148 accessModes:
149 - ReadWriteOnce
150 resources:
151 requests:
152 storage: 5Gi

Grafana Dashboard Configuration

json
1{
2 "annotations": {
3 "list": [
4 {
5 "builtIn": 1,
6 "datasource": "-- Grafana --",
7 "enable": true,
8 "hide": true,
9 "iconColor": "rgba(0, 211, 255, 1)",
10 "name": "Annotations & Alerts",
11 "type": "dashboard"
12 }
13 ]
14 },
15 "editable": true,
16 "gnetId": null,
17 "graphTooltip": 0,
18 "id": 1,
19 "links": [],
20 "panels": [
21 {
22 "aliasColors": {},
23 "bars": false,
24 "dashLength": 10,
25 "dashes": false,
26 "datasource": "Prometheus",
27 "fieldConfig": {
28 "defaults": {
29 "custom": {}
30 },
31 "overrides": []
32 },
33 "fill": 1,
34 "fillGradient": 0,
35 "gridPos": {
36 "h": 8,
37 "w": 12,
38 "x": 0,
39 "y": 0
40 },
41 "hiddenSeries": false,
42 "id": 2,
43 "legend": {
44 "avg": false,
45 "current": false,
46 "max": false,
47 "min": false,
48 "show": true,
49 "total": false,
50 "values": false
51 },
52 "lines": true,
53 "linewidth": 1,
54 "nullPointMode": "null",
55 "options": {
56 "alertThreshold": true
57 },
58 "percentage": false,
59 "pluginVersion": "7.2.0",
60 "pointradius": 2,
61 "points": false,
62 "renderer": "flot",
63 "seriesOverrides": [],
64 "spaceLength": 10,
65 "stack": false,
66 "steppedLine": false,
67 "targets": [
68 {
69 "expr": "rate(api_requests_total[5m])",
70 "interval": "",
71 "legendFormat": "Requests ({{provider}})",
72 "refId": "A"
73 }
74 ],
75 "thresholds": [],
76 "timeFrom": null,
77 "timeRegions": [],
78 "timeShift": null,
79 "title": "Request Rate by Provider",
80 "tooltip": {
81 "shared": true,
82 "sort": 0,
83 "value_type": "individual"
84 },
85 "type": "graph",
86 "xaxis": {
87 "buckets": null,
88 "mode": "time",
89 "name": null,
90 "show": true,
91 "values": []
92 },
93 "yaxes": [
94 {
95 "format": "short",
96 "label": "Requests/sec",
97 "logBase": 1,
98 "max": null,
99 "min": null,
100 "show": true
101 },
102 {
103 "format": "short",
104 "label": null,
105 "logBase": 1,
106 "max": null,
107 "min": null,
108 "show": true
109 }
110 ],
111 "yaxis": {
112 "align": false,
113 "alignLevel": null
114 }
115 },
116 {
117 "aliasColors": {},
118 "bars": false,
119 "dashLength": 10,
120 "dashes": false,
121 "datasource": "Prometheus",
122 "fieldConfig": {
123 "defaults": {
124 "custom": {}
125 },
126 "overrides": []
127 },
128 "fill": 1,
129 "fillGradient": 0,
130 "gridPos": {
131 "h": 8,
132 "w": 12,
133 "x": 12,
134 "y": 0
135 },
136 "hiddenSeries": false,
137 "id": 3,
138 "legend": {
139 "avg": false,
140 "current": false,
141 "max": false,
142 "min": false,
143 "show": true,
144 "total": false,
145 "values": false
146 },
147 "lines": true,
148 "linewidth": 1,
149 "nullPointMode": "null",
150 "options": {
151 "alertThreshold": true
152 },
153 "percentage": false,
154 "pluginVersion": "7.2.0",
155 "pointradius": 2,
156 "points": false,
157 "renderer": "flot",
158 "seriesOverrides": [],
159 "spaceLength": 10,
160 "stack": false,
161 "steppedLine": false,
162 "targets": [
163 {
164 "expr": "api_response_time_seconds{quantile=\"0.5\"}",
165 "interval": "",
166 "legendFormat": "50th % ({{provider}})",
167 "refId": "A"
168 },
169 {
170 "expr": "api_response_time_seconds{quantile=\"0.9\"}",
171 "interval": "",
172 "legendFormat": "90th % ({{provider}})",
173 "refId": "B"
174 },
175 {
176 "expr": "api_response_time_seconds{quantile=\"0.99\"}",
177 "interval": "",
178 "legendFormat": "99th % ({{provider}})",
179 "refId": "C"
180 }
181 ],
182 "thresholds": [],
183 "timeFrom": null,
184 "timeRegions": [],
185 "timeShift": null,
186 "title": "Response Time by Provider",
187 "tooltip": {
188 "shared": true,
189 "sort": 0,
190 "value_type": "individual"
191 },
192 "type": "graph",
193 "xaxis": {
194 "buckets": null,
195 "mode": "time",
196 "name": null,
197 "show": true,
198 "values": []
199 },
200 "yaxes": [
201 {
202 "format": "s",
203 "label": "Response Time",
204 "logBase": 1,
205 "max": null,
206 "min": null,
207 "show": true
208 },
209 {
210 "format": "short",
211 "label": null,
212 "logBase": 1,
213 "max": null,
214 "min": null,
215 "show": true
216 }
217 ],
218 "yaxis": {
219 "align": false,
220 "alignLevel": null
221 }
222 },
223 {
224 "datasource": "Prometheus",
225 "fieldConfig": {
226 "defaults": {
227 "custom": {},
228 "mappings": [],
229 "thresholds": {
230 "mode": "absolute",
231 "steps": [
232 {
233 "color": "green",
234 "value": null
235 },
236 {
237 "color": "red",
238 "value": 80
239 }
240 ]
241 }
242 },
243 "overrides": []
244 },
245 "gridPos": {
246 "h": 8,
247 "w": 8,
248 "x": 0,
249 "y": 8
250 },
251 "id": 4,
252 "options": {
253 "colorMode": "value",
254 "graphMode": "area",
255 "justifyMode": "auto",
256 "orientation": "auto",
257 "reduceOptions": {
258 "calcs": [
259 "mean"
260 ],
261 "fields": "",
262 "values": false
263 },
264 "textMode": "auto"
265 },
266 "pluginVersion": "7.2.0",
267 "targets": [
268 {
269 "expr": "sum(api_requests_total{provider=\"openai\"})",
270 "interval": "",
271 "legendFormat": "",
272 "refId": "A"
273 }
274 ],
275 "timeFrom": null,
276 "timeShift": null,
277 "title": "OpenAI Total Requests",
278 "type": "stat"
279 },
280 {
281 "datasource": "Prometheus",
282 "fieldConfig": {
283 "defaults": {
284 "custom": {},
285 "mappings": [],
286 "thresholds": {
287 "mode": "absolute",
288 "steps": [
289 {
290 "color": "green",
291 "value": null
292 },
293 {
294 "color": "red",
295 "value": 80
296 }
297 ]
298 }
299 },
300 "overrides": []
301 },
302 "gridPos": {
303 "h": 8,
304 "w": 8,
305 "x": 8,
306 "y": 8
307 },
308 "id": 5,
309 "options": {
310 "colorMode": "value",
311 "graphMode": "area",
312 "justifyMode": "auto",
313 "orientation": "auto",
314 "reduceOptions": {
315 "calcs": [
316 "mean"
317 ],
318 "fields": "",
319 "values": false
320 },
321 "textMode": "auto"
322 },
323 "pluginVersion": "7.2.0",
324 "targets": [
325 {
326 "expr": "sum(api_requests_total{provider=\"ollama\"})",
327 "interval": "",
328 "legendFormat": "",
329 "refId": "A"
330 }
331 ],
332 "timeFrom": null,
333 "timeShift": null,
334 "title": "Ollama Total Requests",
335 "type": "stat"
336 },
337 {
338 "datasource": "Prometheus",
339 "fieldConfig": {
340 "defaults": {
341 "custom": {},
342 "mappings": [],
343 "thresholds": {
344 "mode": "absolute",
345 "steps": [
346 {
347 "color": "green",
348 "value": null
349 },
350 {
351 "color": "red",
352 "value": 80
353 }
354 ]
355 },
356 "unit": "currencyUSD"
357 },
358 "overrides": []
359 },
360 "gridPos": {
361 "h": 8,
362 "w": 8,
363 "x": 16,
364 "y": 8
365 },
366 "id": 6,
367 "options": {
368 "colorMode": "value",
369 "graphMode": "area",
370 "justifyMode": "auto",
371 "orientation": "auto",
372 "reduceOptions": {
373 "calcs": [
374 "sum"
375 ],
376 "fields": "",
377 "values": false
378 },
379 "textMode": "auto"
380 },
381 "pluginVersion": "7.2.0",
382 "targets": [
383 {
384 "expr": "sum(api_openai_cost_total)",
385 "interval": "",
386 "legendFormat": "",
387 "refId": "A"
388 }
389 ],
390 "timeFrom": null,
391 "timeShift": null,
392 "title": "OpenAI Cost",
393 "type": "stat"
394 },
395 {
396 "aliasColors": {},
397 "bars": false,
398 "dashLength": 10,
399 "dashes": false,
400 "datasource": "Prometheus",
401 "fieldConfig": {
402 "defaults": {
403 "custom": {}
404 },
405 "overrides": []
406 },
407 "fill": 1,
408 "fillGradient": 0,
409 "gridPos": {
410 "h": 8,
411 "w": 12,
412 "x": 0,
413 "y": 16
414 },
415 "hiddenSeries": false,
416 "id": 7,
417 "legend": {
418 "avg": false,
419 "current": false,
420 "max": false,
421 "min": false,
422 "show": true,
423 "total": false,
424 "values": false
425 },
426 "lines": true,
427 "linewidth": 1,
428 "nullPointMode": "null",
429 "options": {
430 "alertThreshold": true
431 },
432 "percentage": false,
433 "pluginVersion": "7.2.0",
434 "pointradius": 2,
435 "points": false,
436 "renderer": "flot",
437 "seriesOverrides": [],
438 "spaceLength": 10,
439 "stack": false,
440 "steppedLine": false,
441 "targets": [
442 {
443 "expr": "rate(api_token_usage_total{type=\"prompt\"}[5m])",
444 "interval": "",
445 "legendFormat": "Prompt ({{provider}})",
446 "refId": "A"
447 },
448 {
449 "expr": "rate(api_token_usage_total{type=\"completion\"}[5m])",
450 "interval": "",
451 "legendFormat": "Completion ({{provider}})",
452 "refId": "B"
453 }
454 ],
455 "thresholds": [],
456 "timeFrom": null,
457 "timeRegions": [],
458 "timeShift": null,
459 "title": "Token Usage Rate by Type",
460 "tooltip": {
461 "shared": true,
462 "sort": 0,
463 "value_type": "individual"
464 },
465 "type": "graph",
466 "xaxis": {
467 "buckets": null,
468 "mode": "time",
469 "name": null,
470 "show": true,
471 "values": []
472 },
473 "yaxes": [
474 {
475 "format": "short",
476 "label": "Tokens/sec",
477 "logBase": 1,
478 "max": null,
479 "min": null,
480 "show": true
481 },
482 {
483 "format": "short",
484 "label": null,
485 "logBase": 1,
486 "max": null,
487 "min": null,
488 "show": true
489 }
490 ],
491 "yaxis": {
492 "align": false,
493 "alignLevel": null
494 }
495 },
496 {
497 "aliasColors": {},
498 "bars": false,
499 "dashLength": 10,
500 "dashes": false,
501 "datasource": "Prometheus",
502 "fieldConfig": {
503 "defaults": {
504 "custom": {}
505 },
506 "overrides": []
507 },
508 "fill": 1,
509 "fillGradient": 0,
510 "gridPos": {
511 "h": 8,
512 "w": 12,
513 "x": 12,
514 "y": 16
515 },
516 "hiddenSeries": false,
517 "id": 8,
518 "legend": {
519 "avg": false,
520 "current": false,
521 "max": false,
522 "min": false,
523 "show": true,
524 "total": false,
525 "values": false
526 },
527 "lines": true,
528 "linewidth": 1,
529 "nullPointMode": "null",
530 "options": {
531 "alertThreshold": true
532 },
533 "percentage": false,
534 "pluginVersion": "7.2.0",
535 "pointradius": 2,
536 "points": false,
537 "renderer": "flot",
538 "seriesOverrides": [],
539 "spaceLength": 10,
540 "stack": false,
541 "steppedLine": false,
542 "targets": [
543 {
544 "expr": "rate(api_cache_hits_total[5m])",
545 "interval": "",
546 "legendFormat": "Cache Hits",
547 "refId": "A"
548 },
549 {
550 "expr": "rate(api_cache_misses_total[5m])",
551 "interval": "",
552 "legendFormat": "Cache Misses",
553 "refId": "B"
554 }
555 ],
556 "thresholds": [],
557 "timeFrom": null,
558 "timeRegions": [],
559 "timeShift": null,
560 "title": "Cache Performance",
561 "tooltip": {
562 "shared": true,
563 "sort": 0,
564 "value_type": "individual"
565 },
566 "type": "graph",
567 "xaxis": {
568 "buckets": null,
569 "mode": "time",
570 "name": null,
571 "show": true,
572 "values": []
573 },
574 "yaxes": [
575 {
576 "format": "short",
577 "label": "Rate",
578 "logBase": 1,
579 "max": null,
580 "min": null,
581 "show": true
582 },
583 {
584 "format": "short",
585 "label": null,
586 "logBase": 1,
587 "max": null,
588 "min": null,
589 "show": true
590 }
591 ],
592 "yaxis": {
593 "align": false,
594 "alignLevel": null
595 }
596 }
597 ],
598 "refresh": "10s",
599 "schemaVersion": 26,
600 "style": "dark",
601 "tags": [],
602 "templating": {
603 "list": []
604 },
605 "time": {
606 "from": "now-6h",
607 "to": "now"
608 },
609 "timepicker": {
610 "refresh_intervals": [
611 "5s",
612 "10s",
613 "30s",
614 "1m",
615 "5m",
616 "15m",
617 "30m",
618 "1h",
619 "2h",
620 "1d"
621 ]
622 },
623 "timezone": "",
624 "title": "MCP Hybrid System Dashboard",
625 "uid": "mcp-dashboard",
626 "version": 1
627}

Implementing Metrics Collection in API

python
1# app/middleware/metrics.py
2from fastapi import Request
3import time
4from prometheus_client import Counter, Histogram, Gauge
5import logging
6
7# Initialize metrics
8REQUEST_COUNT = Counter(
9 'api_requests_total',
10 'Total count of API requests',
11 ['method', 'endpoint', 'provider', 'model', 'status']
12)
13
14RESPONSE_TIME = Histogram(
15 'api_response_time_seconds',
16 'Response time in seconds',
17 ['method', 'endpoint', 'provider']
18)
19
20TOKEN_USAGE = Counter(
21 'api_token_usage_total',
22 'Total token usage',
23 ['provider', 'model', 'type'] # type: prompt or completion
24)
25
26OPENAI_COST = Counter(
27 'api_openai_cost_total',
28 'Total OpenAI API cost in USD',
29 ['model']
30)
31
32ACTIVE_REQUESTS = Gauge(
33 'api_active_requests',
34 'Number of active requests',
35 ['method']
36)
37
38CACHE_HITS = Counter(
39 'api_cache_hits_total',
40 'Total cache hits',
41 ['cache_type'] # exact or semantic
42)
43
44CACHE_MISSES = Counter(
45 'api_cache_misses_total',
46 'Total cache misses',
47 []
48)
49
50logger = logging.getLogger(__name__)
51
52async def metrics_middleware(request: Request, call_next):
53 """Middleware to collect metrics for API requests."""
54 # Track active requests
55 ACTIVE_REQUESTS.labels(method=request.method).inc()
56
57 # Start timing
58 start_time = time.time()
59
60 # Default status code
61 status_code = 500
62 provider = "unknown"
63 model = "unknown"
64
65 try:
66 # Process the request
67 response = await call_next(request)
68 status_code = response.status_code
69
70 # Try to get provider and model from response headers if available
71 provider = response.headers.get("X-Provider", "unknown")
72 model = response.headers.get("X-Model", "unknown")
73
74 return response
75 except Exception as e:
76 logger.exception("Unhandled exception in request")
77 raise
78 finally:
79 # Calculate response time
80 response_time = time.time() - start_time
81
82 # Record metrics
83 REQUEST_COUNT.labels(
84 method=request.method,
85 endpoint=request.url.path,
86 provider=provider,
87 model=model,
88 status=status_code
89 ).inc()
90
91 RESPONSE_TIME.labels(
92 method=request.method,
93 endpoint=request.url.path,
94 provider=provider
95 ).observe(response_time)
96
97 # Decrement active requests
98 ACTIVE_REQUESTS.labels(method=request.method).dec()

Scaling Strategies

Optimizing Ollama Scaling for High Loads

python
1# app/services/ollama_scaling.py
2import logging
3import asyncio
4import time
5from typing import Dict, List, Any, Optional
6import random
7import httpx
8
9logger = logging.getLogger(__name__)
10
11class OllamaScalingService:
12 """
13 Manages load balancing and scaling for multiple Ollama instances.
14 """
15
16 def __init__(self):
17 self.ollama_instances = []
18 self.instance_status = {}
19 self.model_availability = {}
20 self.health_check_interval = 60 # seconds
21 self.enable_scaling = False
22 self.min_instances = 1
23 self.max_instances = 5
24 self.health_check_task = None
25
26 async def initialize(self, instances: List[str]):
27 """Initialize the service with a list of Ollama instances."""
28 self.ollama_instances = instances
29 self.instance_status = {instance: False for instance in instances}
30 self.model_availability = {instance: [] for instance in instances}
31
32 # Start health checking
33 self.health_check_task = asyncio.create_task(self._health_check_loop())
34
35 # Perform initial health check
36 await self._check_all_instances()
37
38 logger.info(f"Initialized Ollama scaling with {len(instances)} instances")
39
40 async def shutdown(self):
41 """Shutdown the service."""
42 if self.health_check_task:
43 self.health_check_task.cancel()
44 try:
45 await self.health_check_task
46 except asyncio.CancelledError:
47 pass
48
49 async def _health_check_loop(self):
50 """Periodically check health of all instances."""
51 while True:
52 try:
53 await self._check_all_instances()
54 await asyncio.sleep(self.health_check_interval)
55 except asyncio.CancelledError:
56 break
57 except Exception as e:
58 logger.error(f"Error in health check loop: {str(e)}")
59 await asyncio.sleep(5) # Shorter retry on error
60
61 async def _check_all_instances(self):
62 """Check health and model availability for all instances."""
63 tasks = []
64 for instance in self.ollama_instances:
65 tasks.append(self._check_instance(instance))
66
67 # Run all checks in parallel
68 await asyncio.gather(*tasks, return_exceptions=True)
69
70 # Log status
71 healthy_count = sum(1 for status in self.instance_status.values() if status)
72 logger.debug(f"Ollama health check: {healthy_count}/{len(self.ollama_instances)} instances healthy")
73
74 async def _check_instance(self, instance: str):
75 """Check health and model availability for a single instance."""
76 try:
77 async with httpx.AsyncClient(timeout=5.0) as client:
78 response = await client.get(f"{instance}/api/version")
79
80 if response.status_code == 200:
81 # Instance is healthy
82 self.instance_status[instance] = True
83
84 # Check available models
85 models_response = await client.get(f"{instance}/api/tags")
86 if models_response.status_code == 200:
87 data = models_response.json()
88 models = [model["name"] for model in data.get("models", [])]
89 self.model_availability[instance] = models
90 else:
91 self.instance_status[instance] = False
92 except Exception as e:
93 logger.warning(f"Health check failed for {instance}: {str(e)}")
94 self.instance_status[instance] = False
95
96 def get_instance_for_model(self, model: str) -> Optional[str]:
97 """Get the best instance for a specific model."""
98 # Filter to healthy instances that have the model
99 candidates = [
100 instance for instance, status in self.instance_status.items()
101 if status and model in self.model_availability.get(instance, [])
102 ]
103
104 if not candidates:
105 return None
106
107 # Use random selection for basic load balancing
108 # A more sophisticated version would track load, response times, etc.
109 return random.choice(candidates)
110
111 def get_healthy_instance(self) -> Optional[str]:
112 """Get any healthy instance."""
113 candidates = [
114 instance for instance, status in self.instance_status.items()
115 if status
116 ]
117
118 if not candidates:
119 return None
120
121 return random.choice(candidates)
122
123 async def ensure_model_availability(self, model: str) -> bool:
124 """
125 Ensure at least one instance has the required model.
126 Returns True if model is available or successfully pulled.
127 """
128 # Check if any instance already has this model
129 for instance, models in self.model_availability.items():
130 if self.instance_status.get(instance, False) and model in models:
131 return True
132
133 # Try to pull the model on a healthy instance
134 instance = self.get_healthy_instance()
135 if not instance:
136 logger.error(f"No healthy Ollama instances available to pull model {model}")
137 return False
138
139 # Try to pull the model
140 try:
141 async with httpx.AsyncClient(timeout=300.0) as client: # Longer timeout for model pull
142 response = await client.post(
143 f"{instance}/api/pull",
144 json={"name": model}
145 )
146
147 if response.status_code == 200:
148 logger.info(f"Successfully pulled model {model} on {instance}")
149 # Update model availability
150 if instance in self.model_availability:
151 self.model_availability[instance].append(model)
152 return True
153 else:
154 logger.error(f"Failed to pull model {model} on {instance}: {response.text}")
155 return False
156 except Exception as e:
157 logger.error(f"Error pulling model {model} on {instance}: {str(e)}")
158 return False

Autoscaling Configuration for Cloud Deployments

yaml
1# kubernetes/autoscaler-config.yaml
2apiVersion: autoscaling.k8s.io/v1
3kind: VerticalPodAutoscaler
4metadata:
5 name: mcp-api-vpa
6spec:
7 targetRef:
8 apiVersion: "apps/v1"
9 kind: Deployment
10 name: mcp-api
11 updatePolicy:
12 updateMode: "Auto"
13 resourcePolicy:
14 containerPolicies:
15 - containerName: '*'
16 minAllowed:
17 cpu: 250m
18 memory: 256Mi
19 maxAllowed:
20 cpu: 2000m
21 memory: 4Gi
22 controlledResources: ["cpu", "memory"]
23---
24apiVersion: keda.sh/v1alpha1
25kind: ScaledObject
26metadata:
27 name: mcp-api-scaler
28spec:
29 scaleTargetRef:
30 name: mcp-api
31 minReplicaCount: 2
32 maxReplicaCount: 20
33 pollingInterval: 15
34 cooldownPeriod: 300
35 triggers:
36 - type: prometheus
37 metadata:
38 serverAddress: http://prometheus-service:9090
39 metricName: api_active_requests
40 threshold: '10'
41 query: sum(api_active_requests)
42 - type: prometheus
43 metadata:
44 serverAddress: http://prometheus-service:9090
45 metricName: api_response_time_p90
46 threshold: '2.0'
47 query: histogram_quantile(0.9, sum(rate(api_response_time_seconds_bucket[2m])) by (le))

Cost Optimization - Monthly Budget Tracking

python
1# app/services/budget_service.py
2import logging
3import time
4from datetime import datetime, timedelta
5import aioredis
6import json
7from typing import Dict, Any, Optional
8
9logger = logging.getLogger(__name__)
10
11class BudgetService:
12 """
13 Manages API budget tracking and quota enforcement.
14 """
15
16 def __init__(self, redis_url: str):
17 self.redis = None
18 self.redis_url = redis_url
19 self.monthly_budget = 0.0
20 self.daily_budget = 0.0
21 self.alert_threshold = 0.8 # Alert at 80% of budget
22 self.budget_lock_key = "budget:lock"
23 self.last_reset_check = 0
24
25 async def initialize(self, monthly_budget: float = 0.0):
26 """Initialize the budget service."""
27 self.redis = await aioredis.create_redis_pool(self.redis_url)
28 self.monthly_budget = monthly_budget
29 self.daily_budget = monthly_budget / 30 if monthly_budget > 0 else 0
30
31 # Initialize monthly budget in Redis if not already set
32 if not await self.redis.exists("budget:monthly:total"):
33 await self.redis.set("budget:monthly:total", str(monthly_budget))
34
35 # Initialize current usage if not already set
36 if not await self.redis.exists("budget:monthly:used"):
37 await self.redis.set("budget:monthly:used", "0.0")
38
39 # Set the reset day (1st of month)
40 if not await self.redis.exists("budget:reset_day"):
41 await self.redis.set("budget:reset_day", "1")
42
43 # Check if we need to reset the budget
44 await self._check_budget_reset()
45
46 logger.info(f"Budget service initialized with monthly budget: ${monthly_budget:.2f}")
47
48 async def close(self):
49 """Close the Redis connection."""
50 if self.redis:
51 self.redis.close()
52 await self.redis.wait_closed()
53
54 async def _check_budget_reset(self):
55 """Check if the budget needs to be reset (new month)."""
56 now = time.time()
57 # Only check once per hour to avoid excessive checks
58 if now - self.last_reset_check < 3600:
59 return
60
61 self.last_reset_check = now
62
63 try:
64 # Try to acquire lock to avoid multiple resets
65 lock = await self.redis.set(
66 self.budget_lock_key, "1",
67 expire=60, exist="SET_IF_NOT_EXIST"
68 )
69
70 if not lock:
71 return # Another process is handling reset
72
73 # Get the reset day (default to 1st of month)
74 reset_day = int(await self.redis.get("budget:reset_day") or "1")
75
76 # Get last reset timestamp
77 last_reset = float(await self.redis.get("budget:last_reset") or "0")
78
79 # Check if we're in a new month since last reset
80 last_reset_date = datetime.fromtimestamp(last_reset)
81 now_date = datetime.now()
82
83 # If it's a new month and we've passed the reset day
84 if (now_date.year > last_reset_date.year or
85 (now_date.year == last_reset_date.year and now_date.month > last_reset_date.month)) and \
86 now_date.day >= reset_day:
87
88 # Reset monthly usage
89 await self.redis.set("budget:monthly:used", "0.0")
90
91 # Update last reset timestamp
92 await self.redis.set("budget:last_reset", str(now))
93
94 # Log the reset
95 logger.info("Monthly budget reset performed")
96
97 # Archive previous month's usage for reporting
98 prev_month = last_reset_date.strftime("%Y-%m")
99 prev_usage = await self.redis.get("budget:monthly:used") or "0.0"
100 await self.redis.set(f"budget:archive:{prev_month}", prev_usage)
101 finally:
102 # Release lock
103 await self.redis.delete(self.budget_lock_key)
104
105 async def record_usage(self, cost: float, provider: str, model: str):
106 """Record API usage cost."""
107 if cost <= 0:
108 return
109
110 # Only track costs for OpenAI
111 if provider != "openai":
112 return
113
114 # Check if we need to reset first
115 await self._check_budget_reset()
116
117 # Update monthly usage
118 await self.redis.incrbyfloat("budget:monthly:used", cost)
119
120 # Update model-specific usage
121 await self.redis.incrbyfloat(f"budget:model:{model}", cost)
122
123 # Update daily usage
124 today = datetime.now().strftime("%Y-%m-%d")
125 await self.redis.incrbyfloat(f"budget:daily:{today}", cost)
126
127 # Log high-cost operations
128 if cost > 0.1: # Log individual requests that cost more than 10 cents
129 logger.info(f"High-cost API request: ${cost:.4f} for {provider}:{model}")
130
131 # Check if we've exceeded the alert threshold
132 usage = float(await self.redis.get("budget:monthly:used") or "0")
133 budget = float(await self.redis.get("budget:monthly:total") or "0")
134
135 if budget > 0 and usage >= budget * self.alert_threshold:
136 # Check if we've already alerted for this threshold
137 alerted = await self.redis.get(f"budget:alerted:{int(self.alert_threshold * 100)}")
138
139 if not alerted:
140 percentage = (usage / budget) * 100
141 logger.warning(f"Budget alert: Used ${usage:.2f} of ${budget:.2f} ({percentage:.1f}%)")
142
143 # Mark as alerted for this threshold
144 await self.redis.set(
145 f"budget:alerted:{int(self.alert_threshold * 100)}", "1",
146 expire=86400 # Expire after 1 day
147 )
148
149 async def check_budget_available(self, estimated_cost: float) -> bool:
150 """
151 Check if there's enough budget for an estimated operation.
152 Returns True if operation is allowed, False if it would exceed budget.
153 """
154 if estimated_cost <= 0:
155 return True
156
157 if self.monthly_budget <= 0:
158 return True # No budget constraints
159
160 # Get current usage
161 usage = float(await self.redis.get("budget:monthly:used") or "0")
162 budget = float(await self.redis.get("budget:monthly:total") or "0")
163
164 # Check if operation would exceed budget
165 return (usage + estimated_cost) <= budget
166
167 async def get_usage_stats(self) -> Dict[str, Any]:
168 """Get current budget usage statistics."""
169 usage = float(await self.redis.get("budget:monthly:used") or "0")
170 budget = float(await self.redis.get("budget:monthly:total") or "0")
171
172 # Get daily usage for the last 30 days
173 daily_usage = {}
174 today = datetime.now()
175
176 for i in range(30):
177 date = (today - timedelta(days=i)).strftime("%Y-%m-%d")
178 day_usage = float(await self.redis.get(f"budget:daily:{date}") or "0")
179 daily_usage[date] = day_usage
180
181 # Get usage by model
182 model_keys = await self.redis.keys("budget:model:*")
183 model_usage = {}
184
185 for key in model_keys:
186 model = key.decode('utf-8').replace("budget:model:", "")
187 model_cost = float(await self.redis.get(key) or "0")
188 model_usage[model] = model_cost
189
190 # Calculate percentage used
191 percentage_used = (usage / budget) * 100 if budget > 0 else 0
192
193 return {
194 "current_usage": usage,
195 "monthly_budget": budget,
196 "percentage_used": percentage_used,
197 "daily_usage": daily_usage,
198 "model_usage": model_usage,
199 "remaining_budget": budget - usage if budget > 0 else 0
200 }

Conclusion

The optimization and deployment strategies outlined in this document provide a comprehensive framework for implementing an efficient, cost-effective, and highly accurate hybrid AI system that leverages both OpenAI's cloud capabilities and Ollama's local inference.

Key aspects of this implementation include:

  1. Performance Optimization:

    • Query routing optimization based on complexity analysis
    • Semantic response caching for frequent queries
    • Parallel processing for complex queries
    • Dynamic batching for high-load scenarios
    • Model-specific prompt optimization
  2. Cost Reduction:

    • Intelligent token usage optimization
    • Tiered model selection based on task requirements
    • Local model prioritization for development
    • Request batching and rate limiting
    • Memory and context compression
  3. Response Accuracy:

    • Advanced prompt templating for different scenarios
    • Chain-of-thought reasoning for complex queries
    • Self-verification and error correction
    • Domain-specific knowledge integration
    • Dynamic few-shot learning with examples
  4. Deployment Options:

    • Local development environment with Docker Compose
    • Production Kubernetes deployment with autoscaling
    • AWS cloud deployment with CloudFormation
    • Comprehensive monitoring with Prometheus and Grafana
    • Budget tracking and cost optimization

These strategies work in concert to create a system that intelligently balances the tradeoffs between performance, cost, and accuracy, adapting to specific requirements and constraints in different deployment scenarios.

By implementing this hybrid approach, organizations can significantly reduce API costs while maintaining high quality responses, with the added benefits of enhanced privacy for sensitive data and reduced dependency on external services. The local inference capabilities also provide resilience against API outages and rate limiting, ensuring consistent service availability.

MCP (Modern Computational Paradigm) System

Comprehensive Documentation

This documentation provides a complete guide to understanding, installing, configuring, and using the MCP system - a hybrid architecture that integrates OpenAI's API capabilities with Ollama's local inference to create an optimized, cost-effective AI solution.


Table of Contents

  1. Introduction
  2. System Architecture
  3. Installation Guide
  4. Configuration
  5. API Reference
  6. Usage Examples
  7. Performance Optimization
  8. Cost Optimization
  9. Monitoring and Observability
  10. Troubleshooting
  11. Contributing
  12. License

README.md

markdown
1# MCP - Modern Computational Paradigm
2
3![MCP Status](https://img.shields.io/badge/status-stable-green)
4![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)
5![License MIT](https://img.shields.io/badge/license-MIT-green.svg)
6
7MCP is a hybrid AI system that intelligently integrates OpenAI's cloud capabilities with Ollama's local inference. This architecture optimizes for cost, performance, and privacy while maintaining response quality.
8
9## Key Features
10
11- **Intelligent Query Routing**: Automatically selects between OpenAI and Ollama based on query complexity, privacy requirements, and performance needs
12- **Advanced Agent Framework**: Configurable AI agents with specialized capabilities
13- **Cost Optimization**: Reduces API costs by up to 70% through local model usage, caching, and token optimization
14- **Privacy Control**: Keeps sensitive information local when appropriate
15- **Performance Optimization**: Parallel processing, response caching, and dynamic batching for high throughput
16- **Comprehensive Monitoring**: Built-in metrics and observability
17
18## Quick Start
19
20### Prerequisites
21
22- Python 3.11+
23- Docker and Docker Compose (for containerized deployment)
24- Ollama (for local model inference)
25- OpenAI API key
26
27### Installation
28
291. Clone the repository:
30 ```bash
31 git clone https://github.com/yourusername/mcp.git
32 cd mcp
  1. Create and activate a virtual environment:

    bash
    1python -m venv venv
    2source venv/bin/activate # On Windows: venv\Scripts\activate
  2. Install dependencies:

    bash
    1pip install -r requirements.txt
  3. Set up environment variables:

    bash
    1cp .env.example .env
    2# Edit .env with your configuration
  4. Start Ollama (if not already running):

    bash
    1ollama serve
  5. Start the application:

    bash
    1uvicorn app.main:app --reload

The API will be available at http://localhost:8000.

Docker Deployment

For containerized deployment:

bash
1docker-compose up -d

Documentation

For complete documentation, see:

Architecture

MCP uses a sophisticated routing architecture to determine the optimal inference provider for each request:

text
1┌─────────────────┐ ┌──────────────────┐ ┌─────────────┐
2│ │ │ │ │ │
3│ Client Request │────▶│ Routing Decision │────▶│ OpenAI API │
4│ │ │ │ │ │
5└─────────────────┘ └──────────────────┘ └─────────────┘
6
7
8
9 ┌─────────────┐
10 │ │
11 │ Ollama API │
12 │ │
13 └─────────────┘

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for details.

text
1---
2
3# Installation Guide
4
5## Prerequisites
6
7Before installing the MCP system, ensure your environment meets the following requirements:
8
9### System Requirements
10
11- **Operating System**: Linux (recommended), macOS, or Windows
12- **CPU**: 4+ cores recommended
13- **RAM**: Minimum 8GB, 16GB+ recommended
14- **Disk Space**: 10GB minimum for installation, 50GB+ recommended for model storage
15- **GPU**: Optional but recommended for Ollama (NVIDIA with CUDA support)
16
17### Software Requirements
18
19- **Python**: Version 3.11 or higher
20- **Docker**: Version 20.10 or higher (for containerized deployment)
21- **Docker Compose**: Version 2.0 or higher
22- **Kubernetes**: Version 1.21+ (for Kubernetes deployment)
23- **Ollama**: Latest version (for local model inference)
24- **Redis**: Version 6.0+ (for caching and rate limiting)
25
26### Required API Keys
27
28- **OpenAI API Key**: Register at [OpenAI Platform](https://platform.openai.com/)
29
30## Local Development Setup
31
32Follow these steps to set up a local development environment:
33
34### 1. Clone the Repository
35
36```bash
37git clone https://github.com/yourusername/mcp.git
38cd mcp

2. Set Up Virtual Environment

bash
1# Create virtual environment
2python -m venv venv
3
4# Activate virtual environment
5# On Linux/macOS:
6source venv/bin/activate
7# On Windows:
8venv\Scripts\activate

3. Install Dependencies

bash
1pip install --upgrade pip
2pip install -r requirements.txt
3pip install -r requirements-dev.txt # For development tools

4. Install and Configure Ollama

bash
1# macOS (using Homebrew)
2brew install ollama
3
4# Linux
5curl -fsSL https://ollama.com/install.sh | sh
6
7# Start Ollama service
8ollama serve

5. Pull Required Models

bash
1# Pull basic models
2ollama pull llama2
3ollama pull mistral
4ollama pull codellama

6. Set Up Environment Variables

bash
1# Copy the example environment file
2cp .env.example .env
3
4# Edit the file with your configuration
5# At minimum, set OPENAI_API_KEY
6nano .env

7. Initialize Local Services

bash
1# Start Redis using Docker
2docker-compose up -d redis
3
4# Initialize database (if applicable)
5python scripts/init_db.py

8. Start Development Server

bash
1# Start with auto-reload for development
2uvicorn app.main:app --reload --port 8000

9. Verify Installation

Open your browser and navigate to:

  • API documentation: http://localhost:8000/docs
  • Health check: http://localhost:8000/api/health

Docker Deployment

For a containerized deployment using Docker Compose:

1. Ensure Docker and Docker Compose are Installed

bash
1# Verify installation
2docker --version
3docker-compose --version

2. Configure Environment Variables

bash
1# Copy and edit environment variables
2cp .env.example .env
3nano .env

3. Start Services with Docker Compose

bash
1# Build and start all services
2docker-compose up -d
3
4# View logs
5docker-compose logs -f

The application will be available at http://localhost:8000.

4. Stopping the Services

bash
1docker-compose down

Kubernetes Deployment

For production deployment on Kubernetes:

1. Prerequisites

  • Kubernetes cluster
  • kubectl configured
  • Helm (optional, for Redis deployment)

2. Set Up Namespace and Secrets

bash
1# Create namespace
2kubectl create namespace mcp
3
4# Create secrets
5kubectl create secret generic mcp-secrets \
6 --from-literal=openai-api-key=YOUR_OPENAI_API_KEY \
7 --from-literal=redis-password=YOUR_REDIS_PASSWORD \
8 -n mcp

3. Deploy Redis (if needed)

bash
1# Using Helm
2helm repo add bitnami https://charts.bitnami.com/bitnami
3helm install redis bitnami/redis \
4 --namespace mcp \
5 --set auth.password=YOUR_REDIS_PASSWORD \
6 --set master.persistence.size=8Gi

4. Deploy MCP Components

bash
1# Apply Kubernetes manifests
2kubectl apply -f kubernetes/deployment.yaml -n mcp
3kubectl apply -f kubernetes/service.yaml -n mcp
4kubectl apply -f kubernetes/ingress.yaml -n mcp

5. Set Up Autoscaling (Optional)

bash
1kubectl apply -f kubernetes/hpa.yaml -n mcp

6. Check Deployment Status

bash
1kubectl get pods -n mcp
2kubectl get services -n mcp
3kubectl get ingress -n mcp

AWS Deployment

For deployment on AWS Cloud:

1. Prerequisites

  • AWS CLI configured
  • Appropriate IAM permissions

2. CloudFormation Deployment

bash
1# Deploy using CloudFormation template
2aws cloudformation create-stack \
3 --stack-name mcp-hybrid-system \
4 --template-body file://aws/cloudformation.yaml \
5 --capabilities CAPABILITY_IAM \
6 --parameters \
7 ParameterKey=Environment,ParameterValue=Production \
8 ParameterKey=OllamaInstanceType,ParameterValue=g4dn.xlarge
9
10# Check deployment status
11aws cloudformation describe-stacks --stack-name mcp-hybrid-system

3. Deploy API Image to ECR

bash
1# Log in to ECR
2aws ecr get-login-password | docker login --username AWS --password-stdin YOUR_AWS_ACCOUNT_ID.dkr.ecr.YOUR_REGION.amazonaws.com
3
4# Build and push image
5docker build -t YOUR_AWS_ACCOUNT_ID.dkr.ecr.YOUR_REGION.amazonaws.com/mcp-api:latest -f Dockerfile.prod .
6docker push YOUR_AWS_ACCOUNT_ID.dkr.ecr.YOUR_REGION.amazonaws.com/mcp-api:latest

4. Update ECS Service

bash
1# Force new deployment to use the updated image
2aws ecs update-service --cluster mcp-hybrid-system-cluster --service mcp-hybrid-system-api --force-new-deployment

API Reference

Authentication

The MCP API uses API key authentication. Include your API key in all requests using either:

Bearer Token Authentication

Authorization: Bearer YOUR_API_KEY

Query Parameter

?api_key=YOUR_API_KEY

Chat Endpoints

Create Chat Completion

Generates a completion for a given conversation.

Endpoint: POST /api/v1/chat/completions

Request Body:

json
1{
2 "messages": [
3 {"role": "system", "content": "You are a helpful assistant."},
4 {"role": "user", "content": "Hello, who are you?"}
5 ],
6 "model": "auto",
7 "temperature": 0.7,
8 "max_tokens": 1024,
9 "stream": false,
10 "routing_preferences": {
11 "force_provider": null,
12 "privacy_level": "standard",
13 "latency_preference": "balanced"
14 },
15 "tools": []
16}

Parameters:

| Name | Type | Description | |------|------|-------------| | messages | array | Array of message objects representing the conversation history | | model | string | The model to use, or "auto" for automatic selection | | temperature | number | Controls randomness (0-1) | | max_tokens | integer | Maximum tokens in response | | stream | boolean | Whether to stream the response | | routing_preferences | object | Preferences for provider selection | | tools | array | List of tools the assistant can use |

Response:

json
1{
2 "id": "resp_abc123",
3 "object": "chat.completion",
4 "created": 1677858242,
5 "provider": "openai",
6 "model": "gpt-4o",
7 "usage": {
8 "prompt_tokens": 56,
9 "completion_tokens": 325,
10 "total_tokens": 381
11 },
12 "message": {
13 "role": "assistant",
14 "content": "Hello! I'm an AI assistant...",
15 "tool_calls": []
16 },
17 "routing_metrics": {
18 "complexity_score": 0.78,
19 "privacy_impact": "low",
20 "decision_factors": ["complexity", "tool_requirements"]
21 }
22}

Stream Chat Completion

Stream a completion for a conversation.

Endpoint: POST /api/v1/chat/streaming

Request Body: Same as /api/v1/chat/completions but stream must be true.

Response: Server-sent events (SSE) stream of partial completions.

Hybrid Chat

Intelligent routing between OpenAI and Ollama based on query characteristics.

Endpoint: POST /api/v1/chat/hybrid

Request Body:

json
1{
2 "messages": [
3 {"role": "user", "content": "Explain quantum computing"}
4 ],
5 "mode": "auto",
6 "options": {
7 "prioritize_privacy": false,
8 "prioritize_speed": false
9 }
10}

Response: Same format as /api/v1/chat/completions.

Agent Endpoints

Run Agent

Execute an agent with specific configuration.

Endpoint: POST /api/v1/agents/run

Request Body:

json
1{
2 "agent_config": {
3 "instructions": "You are a research assistant...",
4 "model": "gpt-4o",
5 "tools": [
6 {
7 "type": "function",
8 "function": {
9 "name": "search_knowledge_base",
10 "description": "Search for information",
11 "parameters": {
12 "type": "object",
13 "properties": {
14 "query": {
15 "type": "string"
16 }
17 },
18 "required": ["query"]
19 }
20 }
21 }
22 ]
23 },
24 "messages": [
25 {"role": "user", "content": "Find information about renewable energy"}
26 ],
27 "metadata": {
28 "session_id": "user_session_123"
29 }
30}

Response:

json
1{
2 "run_id": "run_abc123",
3 "status": "in_progress",
4 "created_at": 1677858242,
5 "estimated_completion_time": 1677858260,
6 "polling_url": "/api/v1/agents/status/run_abc123"
7}

Get Agent Status

Check the status of a running agent.

Endpoint: GET /api/v1/agents/status/{run_id}

Response:

json
1{
2 "run_id": "run_abc123",
3 "status": "completed",
4 "result": {
5 "output": "Renewable energy comes from sources that are...",
6 "tool_calls": []
7 },
8 "created_at": 1677858242,
9 "completed_at": 1677858260
10}

List Available Agents

List all available agent configurations.

Endpoint: GET /api/v1/agents

Response:

json
1{
2 "agents": [
3 {
4 "id": "research",
5 "name": "Research Assistant",
6 "description": "Specialized in finding and synthesizing information"
7 },
8 {
9 "id": "coding",
10 "name": "Code Assistant",
11 "description": "Helps with programming tasks"
12 }
13 ]
14}

Model Management Endpoints

List Models

List all available models.

Endpoint: GET /api/v1/models

Response:

json
1{
2 "openai_models": [
3 {
4 "id": "gpt-4o",
5 "name": "GPT-4o",
6 "capabilities": ["general", "code", "reasoning"],
7 "context_window": 128000
8 },
9 {
10 "id": "gpt-3.5-turbo",
11 "name": "GPT-3.5 Turbo",
12 "capabilities": ["general"],
13 "context_window": 16000
14 }
15 ],
16 "ollama_models": [
17 {
18 "id": "llama2",
19 "name": "Llama 2",
20 "capabilities": ["general"],
21 "context_window": 4096
22 },
23 {
24 "id": "mistral",
25 "name": "Mistral",
26 "capabilities": ["general", "reasoning"],
27 "context_window": 8192
28 }
29 ]
30}

Get Model Details

Get detailed information about a specific model.

Endpoint: GET /api/v1/models/{model_id}

Response:

json
1{
2 "id": "mistral",
3 "name": "Mistral",
4 "provider": "ollama",
5 "capabilities": ["general", "reasoning"],
6 "context_window": 8192,
7 "recommended_usage": "General purpose tasks with reasoning requirements",
8 "performance_characteristics": {
9 "average_response_time": 2.4,
10 "tokens_per_second": 45
11 }
12}

Pull Ollama Model

Pull a new model for Ollama.

Endpoint: POST /api/v1/models/ollama/pull

Request Body:

json
1{
2 "model": "wizard-math"
3}

Response:

json
1{
2 "status": "pulling",
3 "model": "wizard-math",
4 "estimated_time": 120
5}

System Endpoints

Health Check

Check system health.

Endpoint: GET /api/v1/health

Response:

json
1{
2 "status": "ok",
3 "version": "1.0.0",
4 "providers": {
5 "openai": "connected",
6 "ollama": "connected"
7 },
8 "uptime": 3600
9}

System Configuration

Get current system configuration.

Endpoint: GET /api/v1/config

Response:

json
1{
2 "routing": {
3 "complexity_threshold": 0.65,
4 "privacy_sensitive_patterns": ["password", "secret", "key"],
5 "default_provider": "auto"
6 },
7 "caching": {
8 "enabled": true,
9 "ttl": 3600
10 },
11 "optimization": {
12 "token_optimization": true,
13 "parallel_processing": true
14 },
15 "monitoring": {
16 "metrics_collection": true,
17 "log_level": "info"
18 }
19}

Update Configuration

Update system configuration.

Endpoint: POST /api/v1/config

Request Body:

json
1{
2 "routing": {
3 "complexity_threshold": 0.7
4 },
5 "caching": {
6 "ttl": 7200
7 }
8}

Response:

json
1{
2 "status": "updated",
3 "updated_fields": ["routing.complexity_threshold", "caching.ttl"]
4}

System Metrics

Get system performance metrics.

Endpoint: GET /api/v1/metrics

Response:

json
1{
2 "requests": {
3 "total": 15420,
4 "last_minute": 42,
5 "last_hour": 1254
6 },
7 "routing": {
8 "openai_requests": 6210,
9 "ollama_requests": 9210,
10 "auto_routing_accuracy": 0.94
11 },
12 "performance": {
13 "average_response_time": 2.3,
14 "p95_response_time": 6.1,
15 "cache_hit_rate": 0.37
16 },
17 "cost": {
18 "total_openai_cost": 135.42,
19 "estimated_savings": 98.67,
20 "cost_per_request": 0.0088
21 }
22}

Configuration

Environment Variables

The MCP system can be configured using the following environment variables:

Core Configuration

| Variable | Description | Default Value | |----------|-------------|---------------| | OPENAI_API_KEY | OpenAI API Key | (Required) | | OPENAI_ORG_ID | OpenAI Organization ID | (Optional) | | OPENAI_MODEL | Default OpenAI model | gpt-4o | | OLLAMA_HOST | Ollama host URL | http://localhost:11434 | | OLLAMA_MODEL | Default Ollama model | llama2 | | APP_ENV | Environment (development, staging, production) | development | | LOG_LEVEL | Logging level | INFO | | PORT | API server port | 8000 |

Redis Configuration

| Variable | Description | Default Value | |----------|-------------|---------------| | REDIS_URL | Redis connection URL | redis://localhost:6379/0 | | REDIS_PASSWORD | Redis password | (Optional) | | ENABLE_CACHING | Enable response caching | true | | CACHE_TTL | Cache TTL in seconds | 3600 |

Routing Configuration

| Variable | Description | Default Value | |----------|-------------|---------------| | COMPLEXITY_THRESHOLD | Threshold for routing to OpenAI | 0.65 | | PRIVACY_SENSITIVE_TOKENS | Comma-separated list of privacy-sensitive tokens | password,secret,key | | DEFAULT_PROVIDER | Default provider if not specified | auto | | FORCE_OLLAMA | Force using Ollama for all requests | false | | FORCE_OPENAI | Force using OpenAI for all requests | false |

Performance Configuration

| Variable | Description | Default Value | |----------|-------------|---------------| | ENABLE_PARALLEL_PROCESSING | Enable parallel processing for complex queries | true | | MAX_PARALLEL_REQUESTS | Maximum number of parallel requests | 4 | | ENABLE_BATCHING | Enable request batching | true | | MAX_BATCH_SIZE | Maximum batch size | 5 | | REQUEST_TIMEOUT | Request timeout in seconds | 120 |

Cost Optimization

| Variable | Description | Default Value | |----------|-------------|---------------| | MONTHLY_BUDGET | Monthly budget cap for OpenAI usage (USD) | 0 (no limit) | | ENABLE_TOKEN_OPTIMIZATION | Enable token usage optimization | true | | TOKEN_BUDGET | Token budget per request | 0 (no limit) | | DEV_MODE_TOKEN_LIMIT | Token limit in development mode | 1000 |

Monitoring

| Variable | Description | Default Value | |----------|-------------|---------------| | ENABLE_METRICS | Enable metrics collection | true | | METRICS_PORT | Prometheus metrics port | 9090 | | ENABLE_TRACING | Enable distributed tracing | false | | SENTRY_DSN | Sentry DSN for error tracking | (Optional) |

Advanced Configuration

Configuration File

For more advanced configuration, create a YAML configuration file at config/config.yaml:

yaml
1routing:
2 # Complexity assessment weights
3 complexity_weights:
4 length: 0.3
5 specialized_terms: 0.4
6 sentence_structure: 0.3
7
8 # Ollama model routing
9 ollama_routing:
10 code_generation: "codellama"
11 mathematical: "wizard-math"
12 creative: "dolphin-mistral"
13 general: "mistral"
14
15 # OpenAI model routing
16 openai_routing:
17 complex_reasoning: "gpt-4o"
18 general: "gpt-3.5-turbo"
19
20caching:
21 # Semantic caching configuration
22 semantic:
23 enabled: true
24 similarity_threshold: 0.92
25 max_cached_items: 1000
26
27 # Exact match caching
28 exact:
29 enabled: true
30 max_cached_items: 500
31
32optimization:
33 # Chain of thought settings
34 chain_of_thought:
35 enabled: true
36 task_types: ["reasoning", "math", "decision"]
37
38 # Response verification
39 verification:
40 enabled: true
41 high_risk_categories: ["financial", "legal", "medical"]
42
43monitoring:
44 # Logging configuration
45 logging:
46 format: "json"
47 include_request_body: false
48 mask_sensitive_data: true
49
50 # Alert thresholds
51 alerts:
52 high_latency_threshold: 5.0 # seconds
53 error_rate_threshold: 0.05 # 5%
54 budget_warning_threshold: 0.8 # 80% of budget

Custom Provider Configuration

To configure additional inference providers, add a providers.yaml file:

yaml
1providers:
2 - name: azure-openai
3 type: openai-compatible
4 base_url: https://your-deployment.openai.azure.com
5 api_key_env: AZURE_OPENAI_API_KEY
6 models:
7 - id: gpt-4
8 deployment_id: your-gpt4-deployment
9 - id: gpt-35-turbo
10 deployment_id: your-gpt35-deployment
11
12 - name: local-inference
13 type: ollama-compatible
14 base_url: http://localhost:8080
15 models:
16 - id: local-model
17 capabilities: ["general"]

Model Selection

Model Tiers

MCP uses a tiered approach to model selection:

| Tier | OpenAI Models | Ollama Models | Use Cases | |------|---------------|--------------|-----------| | High | gpt-4o, gpt-4 | llama2:70b, codellama:34b | Complex reasoning, creative tasks, code generation | | Medium | gpt-3.5-turbo | mistral, codellama | General purpose, standard code tasks | | Low | gpt-3.5-turbo | llama2, phi | Simple queries, development testing |

Task-Specific Model Mapping

MCP maps specific task types to appropriate models:

| Task Type | High Tier | Medium Tier | Low Tier | |-----------|-----------|-------------|----------| | Code Generation | gpt-4o | codellama | codellama | | Creative Writing | gpt-4o | mistral | mistral | | Mathematical | gpt-4o | gpt-3.5-turbo | wizard-math | | General Knowledge | gpt-3.5-turbo | mistral | llama2 | | Summarization | gpt-3.5-turbo | mistral | llama2 |

To override the automatic model selection, specify the model explicitly in your request:

json
1{
2 "model": "openai:gpt-4o" // Force OpenAI GPT-4o
3}

Or:

json
1{
2 "model": "ollama:mistral" // Force Ollama Mistral
3}

Usage Examples

Basic Chat Interaction

Python Example

python
1import requests
2import json
3
4API_URL = "http://localhost:8000/api/v1"
5API_KEY = "your_api_key_here"
6
7headers = {
8 "Content-Type": "application/json",
9 "Authorization": f"Bearer {API_KEY}"
10}
11
12# Basic chat completion
13def chat(message, history=None):
14 history = history or []
15 history.append({"role": "user", "content": message})
16
17 response = requests.post(
18 f"{API_URL}/chat/completions",
19 headers=headers,
20 json={
21 "messages": history,
22 "model": "auto", # Let the system decide
23 "temperature": 0.7
24 }
25 )
26
27 if response.status_code == 200:
28 result = response.json()
29 assistant_message = result["message"]["content"]
30 history.append({"role": "assistant", "content": assistant_message})
31
32 print(f"Model used: {result['model']} via {result['provider']}")
33 return assistant_message, history
34 else:
35 print(f"Error: {response.status_code}")
36 print(response.text)
37 return None, history
38
39# Example conversation
40history = []
41response, history = chat("Hello! What can you tell me about artificial intelligence?", history)
42print(f"Assistant: {response}\n")
43
44response, history = chat("What are some practical applications?", history)
45print(f"Assistant: {response}")

cURL Example

bash
1# Simple completion
2curl -X POST http://localhost:8000/api/v1/chat/completions \
3 -H "Content-Type: application/json" \
4 -H "Authorization: Bearer your_api_key_here" \
5 -d '{
6 "messages": [
7 {"role": "user", "content": "Explain how photosynthesis works"}
8 ],
9 "model": "auto",
10 "temperature": 0.7
11 }'
12
13# Streaming response
14curl -X POST http://localhost:8000/api/v1/chat/streaming \
15 -H "Content-Type: application/json" \
16 -H "Authorization: Bearer your_api_key_here" \
17 -d '{
18 "messages": [
19 {"role": "user", "content": "Write a short poem about robots"}
20 ],
21 "model": "auto",
22 "stream": true
23 }'

Working with Agents

Python Example

python
1import requests
2import json
3import time
4
5API_URL = "http://localhost:8000/api/v1"
6API_KEY = "your_api_key_here"
7
8headers = {
9 "Content-Type": "application/json",
10 "Authorization": f"Bearer {API_KEY}"
11}
12
13# Run an agent with tools
14def run_research_agent(query):
15 # Define agent configuration with tools
16 agent_config = {
17 "instructions": "You are a research assistant specialized in finding information.",
18 "model": "gpt-4o",
19 "tools": [
20 {
21 "type": "function",
22 "function": {
23 "name": "search_web",
24 "description": "Search the web for information",
25 "parameters": {
26 "type": "object",
27 "properties": {
28 "query": {
29 "type": "string",
30 "description": "Search query"
31 },
32 "num_results": {
33 "type": "integer",
34 "description": "Number of results to return"
35 }
36 },
37 "required": ["query"]
38 }
39 }
40 }
41 ]
42 }
43
44 # Run the agent
45 response = requests.post(
46 f"{API_URL}/agents/run",
47 headers=headers,
48 json={
49 "agent_config": agent_config,
50 "messages": [
51 {"role": "user", "content": query}
52 ]
53 }
54 )
55
56 if response.status_code != 200:
57 print(f"Error: {response.status_code}")
58 print(response.text)
59 return None
60
61 result = response.json()
62 run_id = result["run_id"]
63
64 # Poll for completion
65 while True:
66 status_response = requests.get(
67 f"{API_URL}/agents/status/{run_id}",
68 headers=headers
69 )
70
71 if status_response.status_code != 200:
72 print(f"Error checking status: {status_response.status_code}")
73 return None
74
75 status_data = status_response.json()
76
77 if status_data["status"] == "completed":
78 return status_data["result"]["output"]
79 elif status_data["status"] == "failed":
80 print(f"Agent run failed: {status_data.get('error')}")
81 return None
82
83 time.sleep(1) # Poll every second
84
85# Example usage
86result = run_research_agent("What are the latest advancements in fusion energy?")
87print(result)

cURL Example

bash
1# Run an agent
2curl -X POST http://localhost:8000/api/v1/agents/run \
3 -H "Content-Type: application/json" \
4 -H "Authorization: Bearer your_api_key_here" \
5 -d '{
6 "agent_config": {
7 "instructions": "You are a coding assistant.",
8 "model": "gpt-4o",
9 "tools": [
10 {
11 "type": "function",
12 "function": {
13 "name": "generate_code",
14 "description": "Generate code in a specific language",
15 "parameters": {
16 "type": "object",
17 "properties": {
18 "language": {
19 "type": "string",
20 "description": "Programming language"
21 },
22 "task": {
23 "type": "string",
24 "description": "Task description"
25 }
26 },
27 "required": ["language", "task"]
28 }
29 }
30 }
31 ]
32 },
33 "messages": [
34 {"role": "user", "content": "Write a Python function to detect palindromes"}
35 ]
36 }'
37
38# Check status
39curl -X GET http://localhost:8000/api/v1/agents/status/run_abc123 \
40 -H "Authorization: Bearer your_api_key_here"

Customizing Model Selection

Python Example

python
1import requests
2
3API_URL = "http://localhost:8000/api/v1"
4API_KEY = "your_api_key_here"
5
6headers = {
7 "Content-Type": "application/json",
8 "Authorization": f"Bearer {API_KEY}"
9}
10
11# Custom routing preferences
12def custom_routing_chat(message, routing_preferences):
13 response = requests.post(
14 f"{API_URL}/chat/completions",
15 headers=headers,
16 json={
17 "messages": [
18 {"role": "user", "content": message}
19 ],
20 "routing_preferences": routing_preferences
21 }
22 )
23
24 if response.status_code == 200:
25 result = response.json()
26 print(f"Provider: {result['provider']}, Model: {result['model']}")
27 return result["message"]["content"]
28 else:
29 print(f"Error: {response.status_code}")
30 print(response.text)
31 return None
32
33# Examples with different routing preferences
34response = custom_routing_chat(
35 "What is the capital of France?",
36 {
37 "force_provider": "ollama", # Force Ollama
38 "privacy_level": "standard",
39 "latency_preference": "balanced"
40 }
41)
42print(f"Response: {response}\n")
43
44response = custom_routing_chat(
45 "Analyze the philosophical implications of artificial general intelligence.",
46 {
47 "force_provider": "openai", # Force OpenAI
48 "privacy_level": "standard",
49 "latency_preference": "quality" # Prefer quality over speed
50 }
51)
52print(f"Response: {response}\n")
53
54response = custom_routing_chat(
55 "What is my personal password?",
56 {
57 "force_provider": None, # Auto-select
58 "privacy_level": "high", # Privacy-sensitive query
59 "latency_preference": "balanced"
60 }
61)
62print(f"Response: {response}")

cURL Example

bash
1# Force Ollama for this request
2curl -X POST http://localhost:8000/api/v1/chat/completions \
3 -H "Content-Type: application/json" \
4 -H "Authorization: Bearer your_api_key_here" \
5 -d '{
6 "messages": [
7 {"role": "user", "content": "What is the capital of Sweden?"}
8 ],
9 "routing_preferences": {
10 "force_provider": "ollama",
11 "privacy_level": "standard",
12 "latency_preference": "speed"
13 }
14 }'
15
16# Force specific model
17curl -X POST http://localhost:8000/api/v1/chat/completions \
18 -H "Content-Type: application/json" \
19 -H "Authorization: Bearer your_api_key_here" \
20 -d '{
21 "messages": [
22 {"role": "user", "content": "Write Python code to implement merge sort"}
23 ],
24 "model": "ollama:codellama"
25 }'

Tool Integration

Python Example

python
1import requests
2
3API_URL = "http://localhost:8000/api/v1"
4API_KEY = "your_api_key_here"
5
6headers = {
7 "Content-Type": "application/json",
8 "Authorization": f"Bearer {API_KEY}"
9}
10
11# Chat with tool integration
12def chat_with_tools(message, tools):
13 response = requests.post(
14 f"{API_URL}/chat/completions",
15 headers=headers,
16 json={
17 "messages": [
18 {"role": "user", "content": message}
19 ],
20 "tools": tools
21 }
22 )
23
24 if response.status_code != 200:
25 print(f"Error: {response.status_code}")
26 print(response.text)
27 return None
28
29 result = response.json()
30
31 # Check if the model wants to call a tool
32 if "tool_calls" in result["message"] and result["message"]["tool_calls"]:
33 tool_calls = result["message"]["tool_calls"]
34 print(f"Tool calls requested: {len(tool_calls)}")
35
36 # Process each tool call
37 for tool_call in tool_calls:
38 # In a real implementation, you would execute the actual tool here
39 # For this example, we'll just simulate it
40 function_name = tool_call["function"]["name"]
41 arguments = json.loads(tool_call["function"]["arguments"])
42
43 print(f"Executing tool: {function_name}")
44 print(f"Arguments: {arguments}")
45
46 # Simulate tool execution
47 if function_name == "get_weather":
48 tool_result = f"Weather in {arguments['location']}: Sunny, 22°C"
49 elif function_name == "search_database":
50 tool_result = f"Database results for {arguments['query']}: 3 records found"
51 else:
52 tool_result = "Unknown tool"
53
54 # Send the tool result back
55 response = requests.post(
56 f"{API_URL}/chat/completions",
57 headers=headers,
58 json={
59 "messages": [
60 {"role": "user", "content": message},
61 {
62 "role": "assistant",
63 "content": result["message"]["content"],
64 "tool_calls": result["message"]["tool_calls"]
65 },
66 {
67 "role": "tool",
68 "tool_call_id": tool_call["id"],
69 "content": tool_result
70 }
71 ]
72 }
73 )
74
75 if response.status_code == 200:
76 final_result = response.json()
77 return final_result["message"]["content"]
78 else:
79 print(f"Error in tool response: {response.status_code}")
80 return None
81
82 # If no tool calls, return the direct response
83 return result["message"]["content"]
84
85# Define available tools
86tools = [
87 {
88 "type": "function",
89 "function": {
90 "name": "get_weather",
91 "description": "Get current weather in a location",
92 "parameters": {
93 "type": "object",
94 "properties": {
95 "location": {
96 "type": "string",
97 "description": "City name"
98 },
99 "unit": {
100 "type": "string",
101 "enum": ["celsius", "fahrenheit"],
102 "description": "Temperature unit"
103 }
104 },
105 "required": ["location"]
106 }
107 }
108 },
109 {
110 "type": "function",
111 "function": {
112 "name": "search_database",
113 "description": "Search a database for information",
114 "parameters": {
115 "type": "object",
116 "properties": {
117 "query": {
118 "type": "string",
119 "description": "Search query"
120 },
121 "limit": {
122 "type": "integer",
123 "description": "Maximum number of results"
124 }
125 },
126 "required": ["query"]
127 }
128 }
129 }
130]
131
132# Example usage
133response = chat_with_tools("What's the weather like in Paris?", tools)
134print(f"Final response: {response}")

Troubleshooting

Common Issues

Installation Issues

Ollama Installation Fails

Symptoms:

  • Error messages during Ollama installation
  • ollama serve command not found

Possible Solutions:

  1. Check system requirements (minimum 8GB RAM recommended)
  2. For Linux, ensure you have the required dependencies:
    bash
    1sudo apt-get update
    2sudo apt-get install -y ca-certificates curl
  3. Try the manual installation from ollama.ai
  4. Check if Ollama is running:
    bash
    1ps aux | grep ollama

Python Dependency Errors

Symptoms:

  • pip install fails with compatibility errors
  • Import errors when starting the application

Possible Solutions:

  1. Ensure you're using Python 3.11 or higher:
    bash
    1python --version
  2. Try creating a fresh virtual environment:
    bash
    1rm -rf venv
    2python -m venv venv
    3source venv/bin/activate
    4pip install --upgrade pip
  3. Install dependencies one by one to identify problematic packages:
    bash
    1pip install -r requirements.txt --no-deps
  4. Check for conflicts with pip:
    bash
    1pip check

API Connection Issues

OpenAI API Key Invalid

Symptoms:

  • Error messages about authentication
  • "Invalid API key" errors

Possible Solutions:

  1. Verify your API key is correct and active in the OpenAI dashboard
  2. Check if the key is properly set in your .env file or environment variables
  3. Ensure there are no spaces or unexpected characters in the key
  4. Test the key with a simple OpenAI API request:
    bash
    1curl https://api.openai.com/v1/models \
    2 -H "Authorization: Bearer YOUR_API_KEY"

Ollama Connection Failed

Symptoms:

  • "Connection refused" errors when connecting to Ollama
  • API requests to Ollama timeout

Possible Solutions:

  1. Verify Ollama is running:
    bash
    1ollama list # Should show available models
  2. If not running, start the Ollama service:
    bash
    1ollama serve
  3. Check if the Ollama port is accessible:
    bash
    1curl http://localhost:11434/api/tags
  4. Verify your OLLAMA_HOST setting in the configuration
  5. If using Docker, ensure proper network configuration between containers

Performance Issues

High Latency with Ollama

Symptoms:

  • Very slow responses from Ollama models
  • Timeouts during inference

Possible Solutions:

  1. Check if you have GPU support enabled:
    bash
    1nvidia-smi # Should show GPU usage
  2. Try a smaller model:
    bash
    1ollama pull tinyllama
  3. Adjust model parameters in your request:
    json
    1{
    2 "model": "ollama:llama2",
    3 "max_tokens": 512,
    4 "temperature": 0.7
    5}
  4. Check system resource usage:
    bash
    1htop
  5. Increase the timeout in your configuration

Memory Usage Too High

Symptoms:

  • Out of memory errors
  • System becomes unresponsive

Possible Solutions:

  1. Use smaller models (e.g., mistral:7b instead of larger variants)
  2. Reduce batch sizes in configuration
  3. Implement memory limits:
    bash
    1# In docker-compose.yml
    2services:
    3 ollama:
    4 deploy:
    5 resources:
    6 limits:
    7 memory: 12G
  4. Enable context window optimization:
    ENABLE_TOKEN_OPTIMIZATION=true
    

Routing and Model Issues

All Requests Going to One Provider

Symptoms:

  • All requests route to OpenAI despite configuration
  • All requests route to Ollama regardless of complexity

Possible Solutions:

  1. Check for environment variables forcing a provider:
    text
    1FORCE_OLLAMA=false
    2FORCE_OPENAI=false
  2. Verify complexity threshold setting:
    COMPLEXITY_THRESHOLD=0.65
    
  3. Review routing preferences in requests:
    json
    1{
    2 "routing_preferences": {
    3 "force_provider": null
    4 }
    5}
  4. Check logs for routing decisions

Model Not Found

Symptoms:

  • "Model not found" errors
  • Models available but not being used

Possible Solutions:

  1. List available models:
    bash
    1ollama list
  2. Pull the missing model:
    bash
    1ollama pull mistral
  3. Verify model names match exactly what you're requesting
  4. Check model mapping in configuration

Diagnostics

Log Analysis

MCP logs contain valuable diagnostic information. Use the following commands to analyze logs:

bash
1# View API logs
2docker-compose logs -f app
3
4# View Ollama logs
5docker-compose logs -f ollama
6
7# Search for errors
8docker-compose logs | grep -i error
9
10# Check routing decisions
11docker-compose logs app | grep "Routing decision"

Health Check

Use the health check endpoint to verify system status:

bash
1curl http://localhost:8000/api/v1/health
2
3# For more detailed health information
4curl http://localhost:8000/api/v1/health/details

Debug Mode

Enable debug logging for more detailed information:

bash
1# Set environment variable
2export LOG_LEVEL=DEBUG
3
4# Or modify in .env file
5LOG_LEVEL=DEBUG

Performance Testing

Use the built-in benchmark tool to test system performance:

bash
1python scripts/benchmark.py --provider both --queries 10 --complexity mixed

Log Management

Log Levels

MCP uses the following log levels:

  • ERROR: Critical errors that require immediate attention
  • WARNING: Non-critical issues that might indicate problems
  • INFO: General operational information
  • DEBUG: Detailed information for debugging purposes

Log Formats

Logs can be formatted as text or JSON:

bash
1# Set JSON logging
2export LOG_FORMAT=json
3
4# Set text logging (default)
5export LOG_FORMAT=text

External Log Management

For production environments, consider forwarding logs to an external system:

bash
1# Using Fluentd
2docker-compose -f docker-compose.yml -f docker-compose.logging.yml up -d

Or configure log drivers in Docker:

yaml
1# In docker-compose.yml
2services:
3 app:
4 logging:
5 driver: "json-file"
6 options:
7 max-size: "10m"
8 max-file: "3"

Contributing

Contributions to the MCP system are welcome! Please follow these guidelines:

Getting Started

  1. Fork the Repository

    Fork the repository on GitHub and clone your fork locally:

    bash
    1git clone https://github.com/YOUR-USERNAME/mcp.git
    2cd mcp
  2. Set Up Development Environment

    Follow the installation instructions in the Installation Guide section.

  3. Create a Branch

    Create a branch for your feature or bugfix:

    bash
    1git checkout -b feature/your-feature-name
    2# or
    3git checkout -b fix/your-bugfix-name

Development Guidelines

Code Style

  • Follow PEP 8 style guidelines for Python code
  • Use type hints for all function definitions
  • Format code with Black
  • Verify style with flake8
bash
1# Install development tools
2pip install black flake8 mypy
3
4# Format code
5black app tests
6
7# Check style
8flake8 app tests
9
10# Run type checking
11mypy app

Testing

  • Write unit tests for all new functionality
  • Ensure existing tests pass before submitting a PR
  • Maintain or improve code coverage
bash
1# Run tests
2pytest
3
4# Run tests with coverage
5pytest --cov=app tests/
6
7# Run only unit tests
8pytest tests/unit/
9
10# Run integration tests
11pytest tests/integration/

Documentation

  • Update documentation for any new features or changes
  • Document all public APIs with docstrings
  • Keep the README and guides up to date

Submitting Changes

  1. Commit Your Changes

    Make focused commits with meaningful commit messages:

    bash
    1git add .
    2git commit -m "Add feature: detailed description of changes"
  2. Pull Latest Changes

    Rebase your branch on the latest main:

    bash
    1git checkout main
    2git pull upstream main
    3git checkout your-branch
    4git rebase main
  3. Push to Your Fork

    bash
    1git push origin your-branch
  4. Create a Pull Request

    Open a pull request from your fork to the main repository:

    • Provide a clear title and description
    • Reference any related issues
    • Describe testing performed
    • Include screenshots for UI changes

Code of Conduct

  • Be respectful and inclusive in all interactions
  • Provide constructive feedback
  • Focus on the issues, not the people
  • Welcome contributors of all backgrounds and experience levels

License

By contributing to this project, you agree that your contributions will be licensed under the project's MIT License.


License

MIT License

text
1Copyright (c) 2023 MCP Contributors
2
3Permission is hereby granted, free of charge, to any person obtaining a copy
4of this software and associated documentation files (the "Software"), to deal
5in the Software without restriction, including without limitation the rights
6to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7copies of the Software, and to permit persons to whom the Software is
8furnished to do so, subject to the following conditions:
9
10The above copyright notice and this permission notice shall be included in all
11copies or substantial portions of the Software.
12
13THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19SOFTWARE.

Third-Party Licenses

This project incorporates several third-party open-source libraries, each with its own license:

  • FastAPI: MIT License
  • Pydantic: MIT License
  • Uvicorn: BSD 3-Clause License
  • OpenAI Python: MIT License
  • Redis-py: MIT License
  • Prometheus Client: Apache License 2.0
  • Ollama: MIT License

Full license texts are included in the LICENSE-3RD-PARTY file in the repository.

Usage Restrictions

While the MCP system itself is open source, usage of the OpenAI API is subject to OpenAI's terms of service and usage policies. Please ensure your use of the API complies with these terms.

Sovereign AI book cover

Sovereign AI: Building Local-First Intelligent Systems

by Daniel Kliewer · Paperback · 72 pages

The hands-on guide to building AI that runs on your hardware, keeps your data private, and eliminates cloud dependence. Working code included.