OpenAI Agents SDK & Ollama Integration: Complete Architecture Guide
This comprehensive guide demonstrates how to integrate the official OpenAI Agents SDK with Ollama to create AI agents that run entirely on local infrastructure. By the end, you'll understand both the theoretical foundations and practical implementation of locally-hosted AI agents.
Daniel Kliewer
Author, Sovereign AI


Architectural Synthesis: Integrating OpenAI's Agents SDK with Ollama
A Convergence of Contemporary AI Paradigms
In the evolving landscape of artificial intelligence systems, the architectural integration of OpenAI's Agents SDK with Ollama represents a sophisticated approach to creating hybrid, responsive computational entities. This synthesis enables a dialectical interaction between cloud-based intelligence and local computational resources, creating what might be conceptualized as a Modern Computational Paradigm (MCP) system.
Theoretical Framework and Architectural Considerations
The foundational architecture of this integration leverages the strengths of both paradigms: OpenAI's Agents SDK provides a structured framework for creating autonomous agents capable of orchestrating complex, multi-step reasoning processes, while Ollama offers localized execution of large language models with reduced latency and enhanced privacy guarantees.
At its epistemological core, this architecture addresses the fundamental tension between computational capability and data sovereignty. The implementation creates a fluid boundary between local and remote processing, determined by contextual parameters including:
- Computational complexity thresholds
- Privacy requirements of specific data domains
- Latency tolerance for particular interaction modalities
- Economic considerations regarding API utilization
Functional Capabilities and Implementation Vectors
This architectural synthesis manifests several advanced capabilities:
-
Cognitive Load Distribution: The system intelligently routes cognitive tasks between local and remote execution environments based on complexity, resource requirements, and privacy constraints.
-
Tool Integration Framework: Both OpenAI's agents and Ollama instances can leverage a unified tool ecosystem, allowing for consistent interaction patterns with external systems.
-
Conversational State Management: A sophisticated state management system maintains coherent interaction context across the distributed computational environment.
-
Fallback Mechanisms: The architecture implements graceful degradation pathways, ensuring functionality persistence when either component faces constraints.
Implementation Methodology
The GitHub repository (kliewerdaniel/OpenAIAgentsSDKOllama01) provides the foundational code structure for this integration. The implementation follows a modular approach that encapsulates:
- Abstraction layers for model interactions
- Contextual routing logic
- Unified response formatting
- Configurable threshold parameters for decision boundaries
Theoretical Implications and Future Directions
This architectural approach represents a significant advancement in distributed AI systems theory. By creating a harmonious integration of cloud and edge AI capabilities, it establishes a framework for future systems that may further blur the boundaries between computational environments.
The integration opens avenues for research in several domains:
- Optimal decision boundaries for computational routing
- Privacy-preserving techniques for sensitive information processing
- Economic models for hybrid AI systems
- Cognitive load balancing algorithms
Conclusion
The integration of OpenAI's Agents SDK with Ollama represents not merely a technical implementation but a philosophical statement about the future of AI architectures. It suggests a path toward systems that transcend binary distinctions between local and remote, private and shared, efficient and powerful—instead creating a nuanced computational environment that adapts to the specific needs of each interaction context.
This approach invites further exploration and refinement, as the field continues to evolve toward increasingly sophisticated hybrid AI architectures that balance capability, privacy, efficiency, and cost.
Technical Infrastructure: Establishing the Development Environment for OpenAI-Ollama Integration
Foundational Dependencies and Technological Requisites
The implementation of a sophisticated hybrid AI architecture integrating OpenAI's Agents SDK with Ollama necessitates a carefully curated technological stack. This infrastructure must accommodate both cloud-based intelligence and local inference capabilities within a coherent framework.
Core Dependencies
Python Environment
Python 3.10+ (3.11 recommended for optimal performance characteristics)
Essential Python Packages
text1openai>=1.12.0 # Provides Agents SDK capabilities2ollama>=0.1.6 # Python client for Ollama interaction3fastapi>=0.109.0 # API framework for service endpoints4uvicorn>=0.27.0 # ASGI server implementation5pydantic>=2.5.0 # Data validation and settings management6python-dotenv>=1.0.0 # Environment variable management7requests>=2.31.0 # HTTP requests for external service interaction8websockets>=12.0 # WebSocket support for real-time communication9tenacity>=8.2.3 # Retry logic for resilient API interactions
External Services
text1OpenAI API access (API key required)2Ollama (local installation)
Environment Configuration
Installation Procedure
-
Python Environment Initialization
bash1# Create isolated environment2python -m venv venv34# Activate environment5# On Unix/macOS:6source venv/bin/activate7# On Windows:8venv\Scripts\activate -
Dependency Installation
bash1pip install openai ollama fastapi uvicorn pydantic python-dotenv requests websockets tenacity -
Ollama Installation
bash1# macOS (using Homebrew)2brew install ollama34# Linux (using curl)5curl -fsSL https://ollama.com/install.sh | sh67# Windows8# Download from https://ollama.com/download/windows -
Model Initialization for Ollama
bash1# Pull high-performance local model (e.g., Llama2)2ollama pull llama234# Optional: Pull additional specialized models5ollama pull mistral6ollama pull codellama
Environment Configuration
Create a .env file in the project root with the following parameters:
text1# OpenAI Configuration2OPENAI_API_KEY=sk-...3OPENAI_ORG_ID=org-... # Optional45# Model Configuration6OPENAI_MODEL=gpt-4o7OLLAMA_MODEL=llama28OLLAMA_HOST=http://localhost:11434910# System Behavior11TEMPERATURE=0.712MAX_TOKENS=409613REQUEST_TIMEOUT=1201415# Routing Configuration16COMPLEXITY_THRESHOLD=0.6517PRIVACY_SENSITIVE_TOKENS=["password", "secret", "token", "key", "credential"]1819# Logging Configuration20LOG_LEVEL=INFO
Development Environment Setup
Repository Initialization
bash1git clone https://github.com/kliewerdaniel/OpenAIAgentsSDKOllama01.git2cd OpenAIAgentsSDKOllama01
Project Structure Implementation
bash1mkdir -p app/core app/models app/routers app/services app/utils tests2touch app/__init__.py app/core/__init__.py app/models/__init__.py app/routers/__init__.py app/services/__init__.py app/utils/__init__.py
Local Development Server
bash1# Start Ollama service2ollama serve34# In a separate terminal, start the application5uvicorn app.main:app --reload
Containerization (Optional)
For reproducible environments and deployment consistency:
dockerfile1# Dockerfile2FROM python:3.11-slim34WORKDIR /app56COPY requirements.txt .7RUN pip install --no-cache-dir -r requirements.txt89COPY . .1011CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
With Docker Compose integration for Ollama:
yaml1# docker-compose.yml2version: '3.8'34services:5 app:6 build: .7 ports:8 - "8000:8000"9 environment:10 - OLLAMA_HOST=http://ollama:1143411 depends_on:12 - ollama13 volumes:14 - .:/app1516 ollama:17 image: ollama/ollama:latest18 ports:19 - "11434:11434"20 volumes:21 - ollama_data:/root/.ollama2223volumes:24 ollama_data:
Verification of Installation
To validate the environment configuration:
bash1python -c "import openai; import ollama; print('OpenAI SDK Version:', openai.__version__); print('Ollama Client Version:', ollama.__version__)"
To test Ollama connectivity:
bash1python -c "import ollama; print(ollama.list())"
To test OpenAI API connectivity:
bash1python -c "import openai; import os; from dotenv import load_dotenv; load_dotenv(); client = openai.OpenAI(); print(client.models.list())"
This comprehensive environment setup establishes the foundation for a sophisticated hybrid AI system that leverages both cloud-based intelligence and local inference capabilities. The configuration allows for flexible routing of requests based on privacy considerations, computational complexity, and performance requirements.
Integration Architecture: OpenAI Responses API within the MCP Framework
Theoretical Framework for API Integration
The integration of OpenAI's Responses API within our Modern Computational Paradigm (MCP) framework represents a sophisticated exercise in distributed intelligence architecture. This document delineates the structural components, interface definitions, and operational parameters for establishing a cohesive integration that leverages both cloud-based and local inference capabilities.
API Architectural Design
Core Endpoints Structure
The system exposes a carefully designed set of endpoints that abstract the underlying complexity of model routing and response generation:
text1/api/v12├── /chat3│ ├── POST /completions # Primary conversational interface4│ ├── POST /streaming # Event-stream response generation5│ └── POST /hybrid # Intelligent routing between OpenAI and Ollama6├── /tools7│ ├── POST /execute # Tool execution framework8│ └── GET /available # Tool discovery mechanism9├── /agents10│ ├── POST /run # Agent execution with Agents SDK11│ ├── GET /status/{run_id} # Asynchronous execution status12│ └── POST /cancel/{run_id} # Execution termination13└── /system14 ├── GET /health # Service health verification15 ├── GET /models # Available model enumeration16 └── POST /config # Runtime configuration adjustment
Request/Response Schemata
Primary Chat Interface
json1// POST /api/v1/chat/completions2// Request3{4 "messages": [5 {"role": "system", "content": "You are a helpful assistant."},6 {"role": "user", "content": "Explain quantum computing."}7 ],8 "model": "auto", // "auto", "openai:<model_id>", or "ollama:<model_id>"9 "temperature": 0.7,10 "max_tokens": 1024,11 "stream": false,12 "routing_preferences": {13 "force_provider": null, // null, "openai", "ollama"14 "privacy_level": "standard", // "standard", "high", "max"15 "latency_preference": "balanced" // "speed", "balanced", "quality"16 },17 "tools": [...] // Optional tool definitions18}1920// Response21{22 "id": "resp_abc123",23 "object": "chat.completion",24 "created": 1677858242,25 "provider": "openai", // The actual provider used26 "model": "gpt-4o",27 "usage": {28 "prompt_tokens": 56,29 "completion_tokens": 325,30 "total_tokens": 38131 },32 "message": {33 "role": "assistant",34 "content": "Quantum computing is...",35 "tool_calls": [] // Optional tool calls if requested36 },37 "routing_metrics": {38 "complexity_score": 0.78,39 "privacy_impact": "low",40 "decision_factors": ["complexity", "tool_requirements"]41 }42}
Agent Execution Interface
json1// POST /api/v1/agents/run2// Request3{4 "agent_config": {5 "instructions": "You are a research assistant. Help the user find information about recent AI developments.",6 "model": "gpt-4o",7 "tools": [8 // Tool definitions following OpenAI's format9 ]10 },11 "messages": [12 {"role": "user", "content": "Find recent papers on transformer efficiency."}13 ],14 "metadata": {15 "session_id": "user_session_abc123",16 "locale": "en-US"17 }18}1920// Response21{22 "run_id": "run_def456",23 "status": "in_progress",24 "created_at": 1677858242,25 "estimated_completion_time": 1677858260,26 "polling_url": "/api/v1/agents/status/run_def456"27}
Authentication & Security Framework
Authentication Mechanisms
The system implements a layered authentication approach:
-
API Key Authentication
Authorization: Bearer {api_key} -
OpenAI Credential Management
- Server-side credential storage with encryption at rest
- Optional client-provided credentials per request
json1// Optional credential override2{3 "auth_override": {4 "openai_api_key": "sk_...",5 "openai_org_id": "org-..."6 }7} -
Session-Based Authentication (Web Interface)
- JWT-based authentication with refresh token rotation
- PKCE flow for authorization code exchanges
Security Considerations
- TLS 1.3 required for all communications
- Request signing for high-security deployments
- Content-Security-Policy headers to prevent XSS
- Rate limiting by user/IP with exponential backoff
Error Handling Architecture
The system implements a comprehensive error handling framework:
json1// Error Response Structure2{3 "error": {4 "code": "provider_error",5 "message": "OpenAI API returned an error",6 "details": {7 "provider": "openai",8 "status_code": 429,9 "original_message": "Rate limit exceeded",10 "request_id": "req_ghi789"11 },12 "remediation": {13 "retry_after": 30,14 "alternatives": ["switch_provider", "reduce_complexity"],15 "fallback_available": true16 }17 }18}
Error Categories
-
Provider Errors (
provider_error)- OpenAI API failures
- Ollama execution failures
- Network connectivity issues
-
Input Validation Errors (
validation_error)- Schema validation failures
- Content policy violations
- Size limit exceedances
-
System Errors (
system_error)- Resource exhaustion
- Internal component failures
- Dependency service outages
-
Authentication Errors (
auth_error)- Invalid credentials
- Expired tokens
- Insufficient permissions
Rate Limiting Architecture
The system implements a sophisticated rate limiting structure:
Tiered Rate Limiting
text1Standard tier:2 - 10 requests/minute3 - 100 requests/hour4 - 1000 requests/day56Premium tier:7 - 60 requests/minute8 - 1000 requests/hour9 - 10000 requests/day
Dynamic Rate Adjustment
- Token bucket algorithm with dynamic refill rates
- Separate buckets for different endpoint categories
- Priority-based token distribution
Rate Limit Response
json1{2 "error": {3 "code": "rate_limit_exceeded",4 "message": "You have exceeded the rate limit",5 "details": {6 "rate_limit": {7 "tier": "standard",8 "limit": "10 per minute",9 "remaining": 0,10 "reset_at": "2023-03-01T12:35:00Z",11 "retry_after": 2512 },13 "usage_statistics": {14 "current_minute": 11,15 "current_hour": 43,16 "current_day": 17817 }18 },19 "remediation": {20 "upgrade_url": "/account/upgrade",21 "alternatives": ["reduce_frequency", "batch_requests"]22 }23 }24}
Implementation Strategy
Provider Abstraction Layer
python1# Pseudocode for the Provider Abstraction Layer2class ModelProvider(ABC):3 @abstractmethod4 async def generate_completion(self, messages, params):5 pass67 @abstractmethod8 async def stream_completion(self, messages, params):9 pass1011 @classmethod12 def get_provider(cls, provider_name, model_id):13 if provider_name == "openai":14 return OpenAIProvider(model_id)15 elif provider_name == "ollama":16 return OllamaProvider(model_id)17 else:18 return AutoRoutingProvider()
Intelligent Routing Decision Engine
python1# Pseudocode for Routing Logic2class RoutingEngine:3 def __init__(self, config):4 self.config = config56 async def determine_route(self, request):7 # Analyze request complexity8 complexity = self._analyze_complexity(request.messages)910 # Check for privacy constraints11 privacy_impact = self._assess_privacy_impact(request.messages)1213 # Consider tool requirements14 tools_compatible = self._check_tool_compatibility(15 request.tools, available_providers)1617 # Make routing decision18 if request.routing_preferences.force_provider:19 return request.routing_preferences.force_provider2021 if privacy_impact == "high" and self.config.privacy_first:22 return "ollama"2324 if complexity > self.config.complexity_threshold:25 return "openai"2627 # Default routing logic28 return "ollama" if self.config.prefer_local else "openai"
Authentication Implementation
python1# Middleware for API Key Authentication2async def api_key_middleware(request, call_next):3 api_key = request.headers.get("Authorization")45 if not api_key or not api_key.startswith("Bearer "):6 return JSONResponse(7 status_code=401,8 content={"error": {9 "code": "auth_error",10 "message": "Missing or invalid API key"11 }}12 )1314 # Extract and validate token15 token = api_key.replace("Bearer ", "")16 user = await validate_api_key(token)1718 if not user:19 return JSONResponse(20 status_code=401,21 content={"error": {22 "code": "auth_error",23 "message": "Invalid API key"24 }}25 )2627 # Attach user to request state28 request.state.user = user29 return await call_next(request)
Rate Limiting Implementation
python1# Rate Limiter Implementation2class RateLimiter:3 def __init__(self, redis_client):4 self.redis = redis_client56 async def check_rate_limit(self, user_id, endpoint_category):7 # Generate Redis keys for different time windows8 minute_key = f"rate:user:{user_id}:{endpoint_category}:minute"9 hour_key = f"rate:user:{user_id}:{endpoint_category}:hour"1011 # Get user tier and corresponding limits12 user_tier = await self._get_user_tier(user_id)13 tier_limits = TIER_LIMITS[user_tier]1415 # Check limits for each window16 pipe = self.redis.pipeline()17 pipe.incr(minute_key)18 pipe.expire(minute_key, 60)19 pipe.incr(hour_key)20 pipe.expire(hour_key, 3600)21 results = await pipe.execute()2223 minute_count, _, hour_count, _ = results2425 # Check if limits are exceeded26 if minute_count > tier_limits["per_minute"]:27 return {28 "allowed": False,29 "window": "minute",30 "limit": tier_limits["per_minute"],31 "current": minute_count,32 "retry_after": self._calculate_retry_after(minute_key)33 }3435 if hour_count > tier_limits["per_hour"]:36 return {37 "allowed": False,38 "window": "hour",39 "limit": tier_limits["per_hour"],40 "current": hour_count,41 "retry_after": self._calculate_retry_after(hour_key)42 }4344 return {"allowed": True}4546 async def _calculate_retry_after(self, key):47 ttl = await self.redis.ttl(key)48 return max(1, ttl)
Operational Considerations
-
Monitoring and Observability
- Structured logging with correlation IDs
- Prometheus metrics for request routing decisions
- Tracing with OpenTelemetry
-
Fallback Mechanisms
- Circuit breaker pattern for provider failures
- Graceful degradation to simpler models
- Response caching for common queries
-
Deployment Strategy
- Containerized deployment with Kubernetes
- Blue/green deployment for zero-downtime updates
- Regional deployment for latency optimization
Conclusion
This integration architecture establishes a robust framework for leveraging both OpenAI's cloud capabilities and Ollama's local inference within a unified system. The design emphasizes flexibility, security, and resilience while providing sophisticated routing logic to optimize for different operational parameters including cost, privacy, and performance.
The implementation allows for progressive enhancement as requirements evolve, with clear extension points for additional providers, tools, and routing strategies.
Autonomous Agent Architecture: Python Implementations for MCP Integration
Theoretical Framework for Agent Design
This collection of Python implementations establishes a comprehensive agent architecture leveraging the Modern Computational Paradigm (MCP) system. The design emphasizes cognitive capabilities including knowledge retrieval, conversation flow management, and contextual awareness through a modular approach to agent construction.
Core Agent Infrastructure
Base Agent Class
python1# app/agents/base_agent.py2from abc import ABC, abstractmethod3from typing import Dict, List, Any, Optional4import uuid5import logging6from pydantic import BaseModel, Field78from app.services.provider_service import ProviderService9from app.models.message import Message, MessageRole10from app.models.tool import Tool1112logger = logging.getLogger(__name__)1314class AgentState(BaseModel):15 """Represents the internal state of an agent."""16 conversation_history: List[Message] = Field(default_factory=list)17 memory: Dict[str, Any] = Field(default_factory=dict)18 context: Dict[str, Any] = Field(default_factory=dict)19 metadata: Dict[str, Any] = Field(default_factory=dict)20 session_id: str = Field(default_factory=lambda: str(uuid.uuid4()))2122class BaseAgent(ABC):23 """Abstract base class for all agents in the system."""2425 def __init__(26 self,27 provider_service: ProviderService,28 system_prompt: str,29 tools: Optional[List[Tool]] = None,30 state: Optional[AgentState] = None31 ):32 self.provider_service = provider_service33 self.system_prompt = system_prompt34 self.tools = tools or []35 self.state = state or AgentState()3637 # Initialize conversation with system prompt38 self._initialize_conversation()3940 def _initialize_conversation(self):41 """Initialize the conversation history with the system prompt."""42 self.state.conversation_history.append(43 Message(role=MessageRole.SYSTEM, content=self.system_prompt)44 )4546 async def process_message(self, message: str, user_id: str) -> str:47 """Process a user message and return a response."""48 # Add user message to conversation history49 user_message = Message(role=MessageRole.USER, content=message)50 self.state.conversation_history.append(user_message)5152 # Process the message and generate a response53 response = await self._generate_response(user_id)5455 # Add assistant response to conversation history56 assistant_message = Message(role=MessageRole.ASSISTANT, content=response)57 self.state.conversation_history.append(assistant_message)5859 return response6061 @abstractmethod62 async def _generate_response(self, user_id: str) -> str:63 """Generate a response based on the conversation history."""64 pass6566 async def add_context(self, key: str, value: Any):67 """Add contextual information to the agent's state."""68 self.state.context[key] = value6970 def get_conversation_history(self) -> List[Message]:71 """Return the conversation history."""72 return self.state.conversation_history7374 def clear_conversation(self, keep_system_prompt: bool = True):75 """Clear the conversation history."""76 if keep_system_prompt and self.state.conversation_history:77 system_messages = [78 msg for msg in self.state.conversation_history79 if msg.role == MessageRole.SYSTEM80 ]81 self.state.conversation_history = system_messages82 else:83 self.state.conversation_history = []84 self._initialize_conversation()
Specialized Agent Implementations
Research Agent with Knowledge Retrieval
python1# app/agents/research_agent.py2from typing import List, Dict, Any, Optional3import logging45from app.agents.base_agent import BaseAgent6from app.services.knowledge_service import KnowledgeService7from app.models.message import Message, MessageRole8from app.models.tool import Tool910logger = logging.getLogger(__name__)1112class ResearchAgent(BaseAgent):13 """Agent specialized for research tasks with knowledge retrieval capabilities."""1415 def __init__(self, *args, knowledge_service: KnowledgeService, **kwargs):16 super().__init__(*args, **kwargs)17 self.knowledge_service = knowledge_service1819 # Register knowledge retrieval tools20 self.tools.extend([21 Tool(22 name="search_knowledge_base",23 description="Search the knowledge base for relevant information",24 parameters={25 "type": "object",26 "properties": {27 "query": {28 "type": "string",29 "description": "The search query"30 },31 "max_results": {32 "type": "integer",33 "description": "Maximum number of results to return",34 "default": 335 }36 },37 "required": ["query"]38 }39 ),40 Tool(41 name="retrieve_document",42 description="Retrieve a specific document by ID",43 parameters={44 "type": "object",45 "properties": {46 "document_id": {47 "type": "string",48 "description": "The ID of the document to retrieve"49 }50 },51 "required": ["document_id"]52 }53 )54 ])5556 async def _generate_response(self, user_id: str) -> str:57 """Generate a response with knowledge augmentation."""58 # Extract the last user message59 last_user_message = next(60 (msg for msg in reversed(self.state.conversation_history)61 if msg.role == MessageRole.USER),62 None63 )6465 if not last_user_message:66 return "I don't have any messages to respond to."6768 # Perform knowledge retrieval to augment the response69 relevant_information = await self._retrieve_relevant_knowledge(last_user_message.content)7071 # Add retrieved information to context72 if relevant_information:73 context_message = Message(74 role=MessageRole.SYSTEM,75 content=f"Relevant information: {relevant_information}"76 )77 augmented_history = self.state.conversation_history.copy()78 augmented_history.insert(-1, context_message)79 else:80 augmented_history = self.state.conversation_history8182 # Generate response using the provider service83 response = await self.provider_service.generate_completion(84 messages=[msg.model_dump() for msg in augmented_history],85 tools=self.tools,86 user=user_id87 )8889 # Process tool calls if any90 if response.get("tool_calls"):91 tool_responses = await self._process_tool_calls(response["tool_calls"])9293 # Add tool responses to conversation history94 for tool_response in tool_responses:95 self.state.conversation_history.append(96 Message(97 role=MessageRole.TOOL,98 content=tool_response["content"],99 tool_call_id=tool_response["tool_call_id"]100 )101 )102103 # Generate a new response with tool results104 final_response = await self.provider_service.generate_completion(105 messages=[msg.model_dump() for msg in self.state.conversation_history],106 tools=self.tools,107 user=user_id108 )109 return final_response["message"]["content"]110111 return response["message"]["content"]112113 async def _retrieve_relevant_knowledge(self, query: str) -> Optional[str]:114 """Retrieve relevant information from knowledge base."""115 try:116 results = await self.knowledge_service.search(query, max_results=3)117118 if not results:119 return None120121 # Format the results122 formatted_results = "\n\n".join([123 f"Source: {result['title']}\n"124 f"Content: {result['content']}\n"125 f"Relevance: {result['relevance_score']}"126 for result in results127 ])128129 return formatted_results130 except Exception as e:131 logger.error(f"Error retrieving knowledge: {str(e)}")132 return None133134 async def _process_tool_calls(self, tool_calls: List[Dict[str, Any]]) -> List[Dict[str, Any]]:135 """Process tool calls and return tool responses."""136 tool_responses = []137138 for tool_call in tool_calls:139 tool_name = tool_call["function"]["name"]140 tool_args = tool_call["function"]["arguments"]141 tool_call_id = tool_call["id"]142143 try:144 if tool_name == "search_knowledge_base":145 results = await self.knowledge_service.search(146 query=tool_args["query"],147 max_results=tool_args.get("max_results", 3)148 )149 formatted_results = "\n\n".join([150 f"Document ID: {result['id']}\n"151 f"Title: {result['title']}\n"152 f"Summary: {result['summary']}"153 for result in results154 ])155156 tool_responses.append({157 "tool_call_id": tool_call_id,158 "content": formatted_results or "No results found."159 })160161 elif tool_name == "retrieve_document":162 document = await self.knowledge_service.retrieve_document(163 document_id=tool_args["document_id"]164 )165166 if document:167 tool_responses.append({168 "tool_call_id": tool_call_id,169 "content": f"Title: {document['title']}\n\n{document['content']}"170 })171 else:172 tool_responses.append({173 "tool_call_id": tool_call_id,174 "content": "Document not found."175 })176 except Exception as e:177 logger.error(f"Error processing tool call {tool_name}: {str(e)}")178 tool_responses.append({179 "tool_call_id": tool_call_id,180 "content": f"Error processing tool call: {str(e)}"181 })182183 return tool_responses
Conversational Flow Manager Agent
python1# app/agents/conversation_manager.py2from typing import Dict, List, Any, Optional3import logging4import json56from app.agents.base_agent import BaseAgent7from app.models.message import Message, MessageRole89logger = logging.getLogger(__name__)1011class ConversationState(BaseModel):12 """Tracks the state of a conversation."""13 current_topic: Optional[str] = None14 topic_history: List[str] = Field(default_factory=list)15 user_preferences: Dict[str, Any] = Field(default_factory=dict)16 conversation_stage: str = "opening" # opening, exploring, focusing, concluding17 open_questions: List[str] = Field(default_factory=list)18 satisfaction_score: Optional[float] = None1920class ConversationManager(BaseAgent):21 """Agent specialized in managing conversation flow and context."""2223 def __init__(self, *args, **kwargs):24 super().__init__(*args, **kwargs)25 self.conversation_state = ConversationState()2627 # Register conversation management tools28 self.tools.extend([29 {30 "type": "function",31 "function": {32 "name": "update_conversation_state",33 "description": "Update the state of the conversation based on analysis",34 "parameters": {35 "type": "object",36 "properties": {37 "current_topic": {38 "type": "string",39 "description": "The current topic of conversation"40 },41 "conversation_stage": {42 "type": "string",43 "description": "The current stage of the conversation",44 "enum": ["opening", "exploring", "focusing", "concluding"]45 },46 "detected_preferences": {47 "type": "object",48 "description": "Preferences detected from the user"49 },50 "open_questions": {51 "type": "array",52 "items": {"type": "string"},53 "description": "Questions that remain unanswered"54 },55 "satisfaction_estimate": {56 "type": "number",57 "description": "Estimated user satisfaction (0-1)"58 }59 }60 }61 }62 }63 ])6465 async def _generate_response(self, user_id: str) -> str:66 """Generate a response with conversation flow management."""67 # First, analyze the conversation to update state68 analysis_prompt = self._create_analysis_prompt()6970 analysis_messages = [71 {"role": "system", "content": analysis_prompt},72 {"role": "user", "content": "Analyze the following conversation and update the conversation state."},73 {"role": "user", "content": self._format_conversation_history()}74 ]7576 analysis_response = await self.provider_service.generate_completion(77 messages=analysis_messages,78 tools=self.tools,79 tool_choice={"type": "function", "function": {"name": "update_conversation_state"}},80 user=user_id81 )8283 # Process conversation state update84 if analysis_response.get("tool_calls"):85 tool_call = analysis_response["tool_calls"][0]86 if tool_call["function"]["name"] == "update_conversation_state":87 try:88 state_update = json.loads(tool_call["function"]["arguments"])89 self._update_conversation_state(state_update)90 except Exception as e:91 logger.error(f"Error updating conversation state: {str(e)}")9293 # Now generate the actual response with enhanced context94 enhanced_messages = self.state.conversation_history.copy()9596 # Add conversation state as context97 context_message = Message(98 role=MessageRole.SYSTEM,99 content=self._format_conversation_context()100 )101 enhanced_messages.insert(-1, context_message)102103 response = await self.provider_service.generate_completion(104 messages=[msg.model_dump() for msg in enhanced_messages],105 user=user_id106 )107108 return response["message"]["content"]109110 def _create_analysis_prompt(self) -> str:111 """Create a prompt for conversation analysis."""112 return """113 You are a conversation analysis expert. Your task is to analyze the conversation114 and extract key information about the current state of the dialogue.115116 Specifically, you should:117 1. Identify the current main topic of conversation118 2. Determine the stage of the conversation (opening, exploring, focusing, or concluding)119 3. Detect user preferences and interests from their messages120 4. Track open questions that haven't been fully addressed121 5. Estimate user satisfaction based on their engagement and responses122123 Use the update_conversation_state function to provide this analysis.124 """125126 def _format_conversation_history(self) -> str:127 """Format the conversation history for analysis."""128 formatted = []129130 for msg in self.state.conversation_history:131 if msg.role == MessageRole.SYSTEM:132 continue133 formatted.append(f"{msg.role.value}: {msg.content}")134135 return "\n\n".join(formatted)136137 def _update_conversation_state(self, update: Dict[str, Any]):138 """Update the conversation state with analysis results."""139 if "current_topic" in update and update["current_topic"]:140 if self.conversation_state.current_topic != update["current_topic"]:141 if self.conversation_state.current_topic:142 self.conversation_state.topic_history.append(143 self.conversation_state.current_topic144 )145 self.conversation_state.current_topic = update["current_topic"]146147 if "conversation_stage" in update:148 self.conversation_state.conversation_stage = update["conversation_stage"]149150 if "detected_preferences" in update:151 for key, value in update["detected_preferences"].items():152 self.conversation_state.user_preferences[key] = value153154 if "open_questions" in update:155 self.conversation_state.open_questions = update["open_questions"]156157 if "satisfaction_estimate" in update:158 self.conversation_state.satisfaction_score = update["satisfaction_estimate"]159160 def _format_conversation_context(self) -> str:161 """Format the conversation state as context for response generation."""162 return f"""163 Current conversation context:164 - Topic: {self.conversation_state.current_topic or 'Not yet established'}165 - Conversation stage: {self.conversation_state.conversation_stage}166 - User preferences: {json.dumps(self.conversation_state.user_preferences, indent=2)}167 - Open questions: {', '.join(self.conversation_state.open_questions) if self.conversation_state.open_questions else 'None'}168169 Previous topics: {', '.join(self.conversation_state.topic_history) if self.conversation_state.topic_history else 'None'}170171 Adapt your response to this conversation context. If in exploring stage, ask open-ended questions.172 If in focusing stage, provide detailed information on the current topic. If in concluding stage,173 summarize key points and check if the user needs anything else.174 """
Memory-Enhanced Contextual Agent
python1# app/agents/contextual_agent.py2from typing import List, Dict, Any, Optional, Tuple3import logging4import time5from datetime import datetime67from app.agents.base_agent import BaseAgent8from app.services.memory_service import MemoryService9from app.models.message import Message, MessageRole1011logger = logging.getLogger(__name__)1213class ContextualAgent(BaseAgent):14 """Agent with enhanced contextual awareness and memory capabilities."""1516 def __init__(self, *args, memory_service: MemoryService, **kwargs):17 super().__init__(*args, **kwargs)18 self.memory_service = memory_service1920 # Initialize memory collections21 self.episodic_memory = [] # Stores specific interactions/events22 self.semantic_memory = {} # Stores facts and knowledge23 self.working_memory = [] # Currently active context2425 self.max_working_memory = 10 # Max items in working memory2627 async def _generate_response(self, user_id: str) -> str:28 """Generate a response with contextual memory enhancement."""29 # Update memories based on recent conversation30 await self._update_memories(user_id)3132 # Retrieve relevant memories for current context33 relevant_memories = await self._retrieve_relevant_memories(user_id)3435 # Create context-enhanced prompt36 context_message = Message(37 role=MessageRole.SYSTEM,38 content=self._create_context_prompt(relevant_memories)39 )4041 # Insert context before the last user message42 enhanced_history = self.state.conversation_history.copy()43 user_message_index = next(44 (i for i, msg in enumerate(reversed(enhanced_history))45 if msg.role == MessageRole.USER),46 None47 )48 if user_message_index is not None:49 user_message_index = len(enhanced_history) - 1 - user_message_index50 enhanced_history.insert(user_message_index, context_message)5152 # Generate response53 response = await self.provider_service.generate_completion(54 messages=[msg.model_dump() for msg in enhanced_history],55 tools=self.tools,56 user=user_id57 )5859 # Process memory-related tool calls if any60 if response.get("tool_calls"):61 memory_updates = await self._process_memory_tools(response["tool_calls"])62 if memory_updates:63 # If memory was updated, we might want to regenerate with new context64 return await self._generate_response(user_id)6566 # Update working memory with the response67 if response["message"]["content"]:68 self.working_memory.append({69 "type": "assistant_response",70 "content": response["message"]["content"],71 "timestamp": time.time()72 })73 self._prune_working_memory()7475 return response["message"]["content"]7677 async def _update_memories(self, user_id: str):78 """Update the agent's memories based on recent conversation."""79 # Get last user message80 last_user_message = next(81 (msg for msg in reversed(self.state.conversation_history)82 if msg.role == MessageRole.USER),83 None84 )8586 if not last_user_message:87 return8889 # Add to working memory90 self.working_memory.append({91 "type": "user_message",92 "content": last_user_message.content,93 "timestamp": time.time()94 })9596 # Extract potential semantic memories (facts, preferences)97 if len(self.state.conversation_history) > 2:98 extraction_messages = [99 {"role": "system", "content": "Extract key facts, preferences, or personal details from this user message that would be useful to remember for future interactions. Return in JSON format with keys: 'facts', 'preferences', 'personal_details', each containing an array of strings."},100 {"role": "user", "content": last_user_message.content}101 ]102103 try:104 extraction = await self.provider_service.generate_completion(105 messages=extraction_messages,106 user=user_id,107 response_format={"type": "json_object"}108 )109110 content = extraction["message"]["content"]111 if content:112 import json113 memory_data = json.loads(content)114115 # Store in semantic memory116 timestamp = datetime.now().isoformat()117 for category, items in memory_data.items():118 if not isinstance(items, list):119 continue120 for item in items:121 if not item or not isinstance(item, str):122 continue123 memory_key = f"{category}:{self._generate_memory_key(item)}"124 self.semantic_memory[memory_key] = {125 "content": item,126 "category": category,127 "last_accessed": timestamp,128 "created_at": timestamp,129 "importance": self._calculate_importance(item)130 }131132 # Store in memory service for persistence133 await self.memory_service.store_memories(134 user_id=user_id,135 memories=self.semantic_memory136 )137 except Exception as e:138 logger.error(f"Error extracting memories: {str(e)}")139140 # Prune working memory if needed141 self._prune_working_memory()142143 async def _retrieve_relevant_memories(self, user_id: str) -> Dict[str, List[Any]]:144 """Retrieve memories relevant to the current context."""145 # Get conversation summary or last few messages146 if len(self.state.conversation_history) <= 2:147 query = self.state.conversation_history[-1].content148 else:149 recent_messages = self.state.conversation_history[-3:]150 query = " ".join([msg.content for msg in recent_messages if msg.role != MessageRole.SYSTEM])151152 # Retrieve from memory service153 stored_memories = await self.memory_service.retrieve_memories(154 user_id=user_id,155 query=query,156 limit=5157 )158159 # Combine with local semantic memory160 all_memories = {161 "facts": [],162 "preferences": [],163 "personal_details": [],164 "episodic": self.episodic_memory[-3:] if self.episodic_memory else []165 }166167 # Add from semantic memory168 for key, memory in self.semantic_memory.items():169 category = memory["category"]170 if category in all_memories and len(all_memories[category]) < 5:171 all_memories[category].append(memory["content"])172173 # Add from stored memories174 for memory in stored_memories:175 category = memory.get("category", "facts")176 if category in all_memories and len(all_memories[category]) < 5:177 all_memories[category].append(memory["content"])178179 # Update last accessed180 if memory.get("id"):181 memory_key = f"{category}:{memory['id']}"182 if memory_key in self.semantic_memory:183 self.semantic_memory[memory_key]["last_accessed"] = datetime.now().isoformat()184185 return all_memories186187 def _create_context_prompt(self, memories: Dict[str, List[Any]]) -> str:188 """Create a context prompt with relevant memories."""189 context_parts = ["Additional context to consider:"]190191 if memories["facts"]:192 facts = "\n".join([f"- {fact}" for fact in memories["facts"]])193 context_parts.append(f"Facts about the user or relevant topics:\n{facts}")194195 if memories["preferences"]:196 prefs = "\n".join([f"- {pref}" for pref in memories["preferences"]])197 context_parts.append(f"User preferences:\n{prefs}")198199 if memories["personal_details"]:200 details = "\n".join([f"- {detail}" for detail in memories["personal_details"]])201 context_parts.append(f"Personal details:\n{details}")202203 if memories["episodic"]:204 episodes = "\n".join([f"- {ep.get('summary', '')}" for ep in memories["episodic"]])205 context_parts.append(f"Recent interactions:\n{episodes}")206207 # Add working memory summary208 if self.working_memory:209 working_context = "Current context:\n"210 for item in self.working_memory[-5:]:211 item_type = item["type"]212 content_preview = item["content"][:100] + "..." if len(item["content"]) > 100 else item["content"]213 working_context += f"- [{item_type}] {content_preview}\n"214 context_parts.append(working_context)215216 context_parts.append("Use this information to personalize your response, but don't explicitly mention that you're using saved information unless directly relevant.")217218 return "\n\n".join(context_parts)219220 def _prune_working_memory(self):221 """Prune working memory to stay within limits."""222 if len(self.working_memory) > self.max_working_memory:223 # Instead of simple truncation, we prioritize by recency and importance224 self.working_memory.sort(key=lambda x: (x.get("importance", 0.5), x["timestamp"]), reverse=True)225 self.working_memory = self.working_memory[:self.max_working_memory]226227 def _generate_memory_key(self, content: str) -> str:228 """Generate a unique key for memory storage."""229 import hashlib230 return hashlib.md5(content.encode()).hexdigest()[:10]231232 def _calculate_importance(self, content: str) -> float:233 """Calculate the importance score of a memory item."""234 # Simple heuristic based on content length and presence of certain keywords235 importance_keywords = ["always", "never", "hate", "love", "favorite", "important", "must", "need"]236237 base_score = min(len(content) / 100, 0.5) # Longer items get higher base score, up to 0.5238239 keyword_score = sum(0.1 for word in importance_keywords if word in content.lower())240 keyword_score = min(keyword_score, 0.5) # Cap at 0.5241242 return base_score + keyword_score243244 async def _process_memory_tools(self, tool_calls: List[Dict[str, Any]]) -> bool:245 """Process memory-related tool calls."""246 # Implement if we add memory-specific tools247 return False
Advanced Tool Integration
Collaborative Task Management Agent
python1# app/agents/task_agent.py2from typing import List, Dict, Any, Optional3import logging4import json5import asyncio67from app.agents.base_agent import BaseAgent8from app.models.message import Message, MessageRole9from app.models.tool import Tool10from app.services.task_service import TaskService1112logger = logging.getLogger(__name__)1314class TaskManagementAgent(BaseAgent):15 """Agent specialized in collaborative task management."""1617 def __init__(self, *args, task_service: TaskService, **kwargs):18 super().__init__(*args, **kwargs)19 self.task_service = task_service2021 # Register task management tools22 self.tools.extend([23 Tool(24 name="list_tasks",25 description="List tasks for the user",26 parameters={27 "type": "object",28 "properties": {29 "status": {30 "type": "string",31 "enum": ["pending", "in_progress", "completed", "all"],32 "description": "Filter tasks by status"33 },34 "limit": {35 "type": "integer",36 "description": "Maximum number of tasks to return",37 "default": 1038 }39 }40 }41 ),42 Tool(43 name="create_task",44 description="Create a new task",45 parameters={46 "type": "object",47 "properties": {48 "title": {49 "type": "string",50 "description": "Title of the task"51 },52 "description": {53 "type": "string",54 "description": "Detailed description of the task"55 },56 "due_date": {57 "type": "string",58 "description": "Due date in ISO format (YYYY-MM-DD)"59 },60 "priority": {61 "type": "string",62 "enum": ["low", "medium", "high"],63 "description": "Priority level of the task"64 }65 },66 "required": ["title"]67 }68 ),69 Tool(70 name="update_task",71 description="Update an existing task",72 parameters={73 "type": "object",74 "properties": {75 "task_id": {76 "type": "string",77 "description": "ID of the task to update"78 },79 "title": {80 "type": "string",81 "description": "New title of the task"82 },83 "description": {84 "type": "string",85 "description": "New description of the task"86 },87 "status": {88 "type": "string",89 "enum": ["pending", "in_progress", "completed"],90 "description": "New status of the task"91 },92 "due_date": {93 "type": "string",94 "description": "New due date in ISO format (YYYY-MM-DD)"95 },96 "priority": {97 "type": "string",98 "enum": ["low", "medium", "high"],99 "description": "New priority level of the task"100 }101 },102 "required": ["task_id"]103 }104 ),105 Tool(106 name="delete_task",107 description="Delete a task",108 parameters={109 "type": "object",110 "properties": {111 "task_id": {112 "type": "string",113 "description": "ID of the task to delete"114 },115 "confirm": {116 "type": "boolean",117 "description": "Confirmation to delete the task",118 "default": False119 }120 },121 "required": ["task_id", "confirm"]122 }123 )124 ])125126 async def _generate_response(self, user_id: str) -> str:127 """Generate a response with task management capabilities."""128 # Prepare messages for completion129 messages = [msg.model_dump() for msg in self.state.conversation_history]130131 # Generate initial response132 response = await self.provider_service.generate_completion(133 messages=messages,134 tools=self.tools,135 user=user_id136 )137138 # Process tool calls if any139 if response.get("tool_calls"):140 tool_responses = await self._process_tool_calls(response["tool_calls"], user_id)141142 # Add tool responses to conversation history143 for tool_response in tool_responses:144 self.state.conversation_history.append(145 Message(146 role=MessageRole.TOOL,147 content=tool_response["content"],148 tool_call_id=tool_response["tool_call_id"]149 )150 )151152 # Generate new response with tool results153 updated_messages = [msg.model_dump() for msg in self.state.conversation_history]154 final_response = await self.provider_service.generate_completion(155 messages=updated_messages,156 tools=self.tools,157 user=user_id158 )159160 # Handle any additional tool calls (recursive)161 if final_response.get("tool_calls"):162 # For simplicity, we'll limit to one level of recursion163 return await self._handle_recursive_tool_calls(final_response, user_id)164165 return final_response["message"]["content"]166167 return response["message"]["content"]168169 async def _handle_recursive_tool_calls(self, response: Dict[str, Any], user_id: str) -> str:170 """Handle additional tool calls recursively."""171 tool_responses = await self._process_tool_calls(response["tool_calls"], user_id)172173 # Add tool responses to conversation history174 for tool_response in tool_responses:175 self.state.conversation_history.append(176 Message(177 role=MessageRole.TOOL,178 content=tool_response["content"],179 tool_call_id=tool_response["tool_call_id"]180 )181 )182183 # Generate final response with all tool results184 updated_messages = [msg.model_dump() for msg in self.state.conversation_history]185 final_response = await self.provider_service.generate_completion(186 messages=updated_messages,187 tools=self.tools,188 user=user_id189 )190191 return final_response["message"]["content"]192193 async def _process_tool_calls(self, tool_calls: List[Dict[str, Any]], user_id: str) -> List[Dict[str, Any]]:194 """Process tool calls and return tool responses."""195 tool_responses = []196197 for tool_call in tool_calls:198 tool_name = tool_call["function"]["name"]199 tool_args_json = tool_call["function"]["arguments"]200 tool_call_id = tool_call["id"]201202 try:203 # Parse arguments as JSON204 tool_args = json.loads(tool_args_json)205206 # Process based on tool name207 if tool_name == "list_tasks":208 result = await self.task_service.list_tasks(209 user_id=user_id,210 status=tool_args.get("status", "all"),211 limit=tool_args.get("limit", 10)212 )213214 if result:215 tasks_formatted = "\n\n".join([216 f"ID: {task['id']}\n"217 f"Title: {task['title']}\n"218 f"Status: {task['status']}\n"219 f"Priority: {task['priority']}\n"220 f"Due Date: {task['due_date']}\n"221 f"Description: {task['description']}"222 for task in result223 ])224 tool_responses.append({225 "tool_call_id": tool_call_id,226 "content": f"Found {len(result)} tasks:\n\n{tasks_formatted}"227 })228 else:229 tool_responses.append({230 "tool_call_id": tool_call_id,231 "content": "No tasks found matching your criteria."232 })233234 elif tool_name == "create_task":235 result = await self.task_service.create_task(236 user_id=user_id,237 title=tool_args["title"],238 description=tool_args.get("description", ""),239 due_date=tool_args.get("due_date"),240 priority=tool_args.get("priority", "medium")241 )242243 tool_responses.append({244 "tool_call_id": tool_call_id,245 "content": f"Task created successfully.\n\nID: {result['id']}\nTitle: {result['title']}"246 })247248 elif tool_name == "update_task":249 update_data = {k: v for k, v in tool_args.items() if k != "task_id"}250 result = await self.task_service.update_task(251 user_id=user_id,252 task_id=tool_args["task_id"],253 **update_data254 )255256 if result:257 tool_responses.append({258 "tool_call_id": tool_call_id,259 "content": f"Task updated successfully.\n\nID: {result['id']}\nTitle: {result['title']}\nStatus: {result['status']}"260 })261 else:262 tool_responses.append({263 "tool_call_id": tool_call_id,264 "content": f"Task with ID {tool_args['task_id']} not found or you don't have permission to update it."265 })266267 elif tool_name == "delete_task":268 if not tool_args.get("confirm", False):269 tool_responses.append({270 "tool_call_id": tool_call_id,271 "content": "Task deletion requires confirmation. Please set 'confirm' to true to proceed."272 })273 else:274 result = await self.task_service.delete_task(275 user_id=user_id,276 task_id=tool_args["task_id"]277 )278279 if result:280 tool_responses.append({281 "tool_call_id": tool_call_id,282 "content": f"Task with ID {tool_args['task_id']} has been deleted successfully."283 })284 else:285 tool_responses.append({286 "tool_call_id": tool_call_id,287 "content": f"Task with ID {tool_args['task_id']} not found or you don't have permission to delete it."288 })289290 except json.JSONDecodeError:291 tool_responses.append({292 "tool_call_id": tool_call_id,293 "content": "Error: Invalid JSON in tool arguments."294 })295 except KeyError as e:296 tool_responses.append({297 "tool_call_id": tool_call_id,298 "content": f"Error: Missing required parameter: {str(e)}"299 })300 except Exception as e:301 logger.error(f"Error processing tool call {tool_name}: {str(e)}")302 tool_responses.append({303 "tool_call_id": tool_call_id,304 "content": f"Error executing {tool_name}: {str(e)}"305 })306307 return tool_responses
Agent Factory and Orchestration
python1# app/agents/agent_factory.py2from typing import Dict, Any, Optional, List, Type3import logging45from app.agents.base_agent import BaseAgent6from app.agents.research_agent import ResearchAgent7from app.agents.conversation_manager import ConversationManager8from app.agents.contextual_agent import ContextualAgent9from app.agents.task_agent import TaskManagementAgent1011from app.services.provider_service import ProviderService12from app.services.knowledge_service import KnowledgeService13from app.services.memory_service import MemoryService14from app.services.task_service import TaskService1516logger = logging.getLogger(__name__)1718class AgentFactory:19 """Factory for creating agent instances based on requirements."""2021 def __init__(self,22 provider_service: ProviderService,23 knowledge_service: Optional[KnowledgeService] = None,24 memory_service: Optional[MemoryService] = None,25 task_service: Optional[TaskService] = None):26 self.provider_service = provider_service27 self.knowledge_service = knowledge_service28 self.memory_service = memory_service29 self.task_service = task_service3031 # Register available agent types32 self.agent_types: Dict[str, Type[BaseAgent]] = {33 "research": ResearchAgent,34 "conversation": ConversationManager,35 "contextual": ContextualAgent,36 "task": TaskManagementAgent37 }3839 def create_agent(self,40 agent_type: str,41 system_prompt: str,42 tools: Optional[List[Dict[str, Any]]] = None,43 **kwargs) -> BaseAgent:44 """Create and return an agent instance of the specified type."""45 if agent_type not in self.agent_types:46 raise ValueError(f"Unknown agent type: {agent_type}. Available types: {list(self.agent_types.keys())}")4748 agent_class = self.agent_types[agent_type]4950 # Prepare required services based on agent type51 agent_kwargs = {52 "provider_service": self.provider_service,53 "system_prompt": system_prompt,54 "tools": tools55 }5657 # Add specialized services based on agent type58 if agent_type == "research" and self.knowledge_service:59 agent_kwargs["knowledge_service"] = self.knowledge_service6061 if agent_type == "contextual" and self.memory_service:62 agent_kwargs["memory_service"] = self.memory_service6364 if agent_type == "task" and self.task_service:65 agent_kwargs["task_service"] = self.task_service6667 # Add any additional kwargs68 agent_kwargs.update(kwargs)6970 # Create and return the agent instance71 return agent_class(**agent_kwargs)
Metaframework for Agent Composition
python1# app/agents/meta_agent.py2from typing import Dict, List, Any, Optional3import logging4import asyncio5import json67from app.agents.base_agent import BaseAgent, AgentState8from app.models.message import Message, MessageRole9from app.services.provider_service import ProviderService1011logger = logging.getLogger(__name__)1213class AgentSubsystem:14 """Represents a specialized agent within the MetaAgent."""1516 def __init__(self, name: str, agent: BaseAgent, role: str):17 self.name = name18 self.agent = agent19 self.role = role20 self.active = True2122class MetaAgent(BaseAgent):23 """A meta-agent that coordinates multiple specialized agents."""2425 def __init__(self,26 provider_service: ProviderService,27 system_prompt: str,28 subsystems: Optional[List[AgentSubsystem]] = None,29 state: Optional[AgentState] = None):30 super().__init__(provider_service, system_prompt, [], state)31 self.subsystems = subsystems or []3233 # Tools specific to the meta-agent34 self.tools.extend([35 {36 "type": "function",37 "function": {38 "name": "route_to_subsystem",39 "description": "Route a task to a specific subsystem agent",40 "parameters": {41 "type": "object",42 "properties": {43 "subsystem": {44 "type": "string",45 "description": "The name of the subsystem to route to"46 },47 "task": {48 "type": "string",49 "description": "The task to be performed by the subsystem"50 },51 "context": {52 "type": "object",53 "description": "Additional context for the subsystem"54 }55 },56 "required": ["subsystem", "task"]57 }58 }59 },60 {61 "type": "function",62 "function": {63 "name": "parallel_processing",64 "description": "Process a task in parallel across multiple subsystems",65 "parameters": {66 "type": "object",67 "properties": {68 "task": {69 "type": "string",70 "description": "The task to process in parallel"71 },72 "subsystems": {73 "type": "array",74 "items": {75 "type": "string"76 },77 "description": "List of subsystems to involve"78 }79 },80 "required": ["task", "subsystems"]81 }82 }83 }84 ])8586 def add_subsystem(self, subsystem: AgentSubsystem):87 """Add a new subsystem to the meta-agent."""88 # Check for duplicate names89 if any(sys.name == subsystem.name for sys in self.subsystems):90 raise ValueError(f"Subsystem with name '{subsystem.name}' already exists")9192 self.subsystems.append(subsystem)9394 def get_subsystem(self, name: str) -> Optional[AgentSubsystem]:95 """Get a subsystem by name."""96 for subsystem in self.subsystems:97 if subsystem.name == name:98 return subsystem99 return None100101 async def _generate_response(self, user_id: str) -> str:102 """Generate a response using the meta-agent architecture."""103 # Extract the last user message104 last_user_message = next(105 (msg for msg in reversed(self.state.conversation_history)106 if msg.role == MessageRole.USER),107 None108 )109110 if not last_user_message:111 return "I don't have any messages to respond to."112113 # First, determine routing strategy using the coordinator114 coordinator_messages = [115 {"role": "system", "content": f"""116 You are the coordinator of a multi-agent system with the following subsystems:117118 {self._format_subsystems()}119120 Your job is to analyze the user's message and determine the optimal processing strategy:121 1. If the query is best handled by a single specialized subsystem, use route_to_subsystem122 2. If the query would benefit from multiple perspectives, use parallel_processing123124 Choose the most appropriate strategy based on the complexity and nature of the request.125 """},126 {"role": "user", "content": last_user_message.content}127 ]128129 routing_response = await self.provider_service.generate_completion(130 messages=coordinator_messages,131 tools=self.tools,132 tool_choice="auto",133 user=user_id134 )135136 # Process based on the routing decision137 if routing_response.get("tool_calls"):138 tool_call = routing_response["tool_calls"][0]139 function_name = tool_call["function"]["name"]140141 try:142 function_args = json.loads(tool_call["function"]["arguments"])143144 if function_name == "route_to_subsystem":145 return await self._handle_single_subsystem_route(146 function_args["subsystem"],147 function_args["task"],148 function_args.get("context", {}),149 user_id150 )151152 elif function_name == "parallel_processing":153 return await self._handle_parallel_processing(154 function_args["task"],155 function_args["subsystems"],156 user_id157 )158159 except json.JSONDecodeError:160 logger.error("Error parsing function arguments")161 except KeyError as e:162 logger.error(f"Missing required parameter: {e}")163 except Exception as e:164 logger.error(f"Error in routing: {e}")165166 # Fallback to direct response167 return await self._handle_direct_response(user_id)168169 async def _handle_single_subsystem_route(self,170 subsystem_name: str,171 task: str,172 context: Dict[str, Any],173 user_id: str) -> str:174 """Handle routing to a single subsystem."""175 subsystem = self.get_subsystem(subsystem_name)176177 if not subsystem or not subsystem.active:178 return f"Error: Subsystem '{subsystem_name}' not found or not active. Please try a different approach."179180 # Process with the selected subsystem181 response = await subsystem.agent.process_message(task, user_id)182183 # Format the response to indicate the source184 return f"[{subsystem.name} - {subsystem.role}] {response}"185186 async def _handle_parallel_processing(self,187 task: str,188 subsystem_names: List[str],189 user_id: str) -> str:190 """Handle parallel processing across multiple subsystems."""191 # Validate subsystems192 valid_subsystems = []193 for name in subsystem_names:194 subsystem = self.get_subsystem(name)195 if subsystem and subsystem.active:196 valid_subsystems.append(subsystem)197198 if not valid_subsystems:199 return "Error: None of the specified subsystems are available."200201 # Process in parallel202 tasks = [subsystem.agent.process_message(task, user_id) for subsystem in valid_subsystems]203 responses = await asyncio.gather(*tasks)204205 # Format responses206 formatted_responses = [207 f"## {subsystem.name} ({subsystem.role}):\n{response}"208 for subsystem, response in zip(valid_subsystems, responses)209 ]210211 # Synthesize a final response212 synthesis_prompt = f"""213 The user's request was processed by multiple specialized agents:214215 {"".join(formatted_responses)}216217 Synthesize a comprehensive response that incorporates these perspectives.218 Highlight areas of agreement and provide a balanced view where there are differences.219 """220221 synthesis_messages = [222 {"role": "system", "content": "You are a synthesis agent that combines multiple specialized perspectives into a coherent response."},223 {"role": "user", "content": synthesis_prompt}224 ]225226 synthesis = await self.provider_service.generate_completion(227 messages=synthesis_messages,228 user=user_id229 )230231 return synthesis["message"]["content"]232233 async def _handle_direct_response(self, user_id: str) -> str:234 """Handle direct response when no routing is determined."""235 # Generate a response directly using the provider service236 response = await self.provider_service.generate_completion(237 messages=[msg.model_dump() for msg in self.state.conversation_history],238 user=user_id239 )240241 return response["message"]["content"]242243 def _format_subsystems(self) -> str:244 """Format subsystem information for the coordinator prompt."""245 return "\n".join([246 f"- {subsystem.name}: {subsystem.role}"247 for subsystem in self.subsystems if subsystem.active248 ])
Sample Agent Usage Implementation
python1# app/main.py2import asyncio3import logging4from fastapi import FastAPI, HTTPException, Depends, Header5from pydantic import BaseModel6from typing import List, Optional, Dict, Any78from app.agents.agent_factory import AgentFactory9from app.agents.meta_agent import MetaAgent, AgentSubsystem10from app.services.provider_service import ProviderService11from app.services.knowledge_service import KnowledgeService12from app.services.memory_service import MemoryService13from app.services.task_service import TaskService1415# Configure logging16logging.basicConfig(level=logging.INFO)17logger = logging.getLogger(__name__)1819app = FastAPI(title="MCP Agent System")2021# Initialize services22provider_service = ProviderService()23knowledge_service = KnowledgeService()24memory_service = MemoryService()25task_service = TaskService()2627# Initialize agent factory28agent_factory = AgentFactory(29 provider_service=provider_service,30 knowledge_service=knowledge_service,31 memory_service=memory_service,32 task_service=task_service33)3435# Agent session storage36agent_sessions = {}3738# Define request/response models39class MessageRequest(BaseModel):40 message: str41 session_id: Optional[str] = None42 agent_type: Optional[str] = None4344class MessageResponse(BaseModel):45 response: str46 session_id: str4748# Auth dependency49async def verify_api_key(authorization: Optional[str] = Header(None)):50 if not authorization or not authorization.startswith("Bearer "):51 raise HTTPException(status_code=401, detail="Invalid or missing API key")5253 # Simple validation for demo purposes54 token = authorization.replace("Bearer ", "")55 if token != "demo_api_key": # In production, validate against secure storage56 raise HTTPException(status_code=401, detail="Invalid API key")5758 return token5960# Routes61@app.post("/api/v1/chat", response_model=MessageResponse)62async def chat(63 request: MessageRequest,64 api_key: str = Depends(verify_api_key)65):66 user_id = "demo_user" # In production, extract from API key or auth token6768 # Create or retrieve session69 session_id = request.session_id70 if not session_id or session_id not in agent_sessions:71 # Create a new agent instance if session doesn't exist72 session_id = f"session_{len(agent_sessions) + 1}"7374 # Determine agent type75 agent_type = request.agent_type or "meta"7677 if agent_type == "meta":78 # Create a meta-agent with multiple specialized subsystems79 research_agent = agent_factory.create_agent(80 agent_type="research",81 system_prompt="You are a research specialist that provides in-depth, accurate information based on available knowledge."82 )8384 conversation_agent = agent_factory.create_agent(85 agent_type="conversation",86 system_prompt="You are a conversation expert that helps maintain engaging, relevant, and structured discussions."87 )8889 task_agent = agent_factory.create_agent(90 agent_type="task",91 system_prompt="You are a task management specialist that helps organize, track, and complete tasks efficiently."92 )9394 meta_agent = MetaAgent(95 provider_service=provider_service,96 system_prompt="You are an advanced assistant that coordinates multiple specialized systems to provide optimal responses."97 )9899 # Add subsystems to meta-agent100 meta_agent.add_subsystem(AgentSubsystem(101 name="research",102 agent=research_agent,103 role="Knowledge and information retrieval specialist"104 ))105106 meta_agent.add_subsystem(AgentSubsystem(107 name="conversation",108 agent=conversation_agent,109 role="Conversation flow and engagement specialist"110 ))111112 meta_agent.add_subsystem(AgentSubsystem(113 name="task",114 agent=task_agent,115 role="Task management and organization specialist"116 ))117118 agent = meta_agent119 else:120 # Create a specialized agent121 agent = agent_factory.create_agent(122 agent_type=agent_type,123 system_prompt=f"You are a helpful assistant specializing in {agent_type} tasks."124 )125126 agent_sessions[session_id] = agent127 else:128 agent = agent_sessions[session_id]129130 # Process the message131 try:132 response = await agent.process_message(request.message, user_id)133 return MessageResponse(response=response, session_id=session_id)134 except Exception as e:135 logger.exception("Error processing message")136 raise HTTPException(status_code=500, detail=f"Error processing message: {str(e)}")137138# Startup event139@app.on_event("startup")140async def startup_event():141 # Initialize services142 await provider_service.initialize()143 await knowledge_service.initialize()144 await memory_service.initialize()145 await task_service.initialize()146147 logger.info("All services initialized")148149# Shutdown event150@app.on_event("shutdown")151async def shutdown_event():152 # Cleanup153 await provider_service.cleanup()154 await knowledge_service.cleanup()155 await memory_service.cleanup()156 await task_service.cleanup()157158 logger.info("All services shut down")159160if __name__ == "__main__":161 import uvicorn162 uvicorn.run(app, host="0.0.0.0", port=8000)
Conclusion
This comprehensive implementation demonstrates the integration of OpenAI's Responses API within a sophisticated agent architecture. The modular design allows for specialized cognitive capabilities including knowledge retrieval, conversation management, contextual awareness, and task coordination.
Key architectural features include:
-
Abstraction Layers: The system maintains clean separation between provider services, agent logic, and specialized capabilities.
-
Contextual Enhancement: Agents utilize memory systems and knowledge retrieval to maintain context and provide more relevant responses.
-
Tool Integration: The implementation leverages OpenAI's function calling capabilities to integrate with external systems and services.
-
Meta-Agent Architecture: The meta-agent pattern enables composition of specialized agents into a coherent system that routes queries optimally.
-
Stateful Conversations: All agents maintain conversation state, allowing for continuity and context preservation across interactions.
This architecture provides a foundation for building sophisticated AI applications that leverage both OpenAI's cloud capabilities and local Ollama models through the MCP system's intelligent routing.
Hybrid Intelligence Architecture: Integrating Ollama with OpenAI's Agent SDK
Theoretical Framework for Hybrid Model Inference
The integration of Ollama with OpenAI's Agent SDK represents a significant advancement in hybrid AI architectures. This document articulates the methodological approach for implementing a sophisticated orchestration layer that intelligently routes inference tasks between cloud-based and local computational resources based on contextual parameters.
Ollama Integration Architecture
Core Integration Components
python1# app/services/ollama_service.py2import os3import json4import logging5from typing import List, Dict, Any, Optional, Union6import aiohttp7import asyncio8from tenacity import retry, stop_after_attempt, wait_exponential910from app.models.message import Message, MessageRole11from app.config import settings1213logger = logging.getLogger(__name__)1415class OllamaService:16 """Service for interacting with Ollama's local inference capabilities."""1718 def __init__(self):19 self.base_url = settings.OLLAMA_HOST20 self.default_model = settings.OLLAMA_MODEL21 self.timeout = aiohttp.ClientTimeout(total=settings.REQUEST_TIMEOUT)22 self.session = None2324 # Capability mapping for different models25 self.model_capabilities = {26 "llama2": {27 "supports_tools": False,28 "context_window": 4096,29 "strengths": ["general_knowledge", "reasoning"],30 "max_tokens": 204831 },32 "codellama": {33 "supports_tools": False,34 "context_window": 8192,35 "strengths": ["code_generation", "technical_explanation"],36 "max_tokens": 204837 },38 "mistral": {39 "supports_tools": False,40 "context_window": 8192,41 "strengths": ["instruction_following", "reasoning"],42 "max_tokens": 204843 },44 "dolphin-mistral": {45 "supports_tools": False,46 "context_window": 8192,47 "strengths": ["conversational", "creative_writing"],48 "max_tokens": 204849 }50 }5152 async def initialize(self):53 """Initialize the Ollama service."""54 self.session = aiohttp.ClientSession(timeout=self.timeout)5556 # Verify connectivity57 try:58 await self.list_models()59 logger.info("Ollama service initialized successfully")60 except Exception as e:61 logger.error(f"Failed to initialize Ollama service: {str(e)}")62 raise6364 async def cleanup(self):65 """Clean up resources."""66 if self.session:67 await self.session.close()68 self.session = None6970 @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10))71 async def list_models(self) -> List[Dict[str, Any]]:72 """List available models in Ollama."""73 if not self.session:74 self.session = aiohttp.ClientSession(timeout=self.timeout)7576 async with self.session.get(f"{self.base_url}/api/tags") as response:77 if response.status != 200:78 error_text = await response.text()79 raise Exception(f"Failed to list models: {error_text}")8081 data = await response.json()82 return data.get("models", [])8384 async def generate_completion(85 self,86 messages: List[Dict[str, str]],87 model: Optional[str] = None,88 temperature: float = 0.7,89 max_tokens: Optional[int] = None,90 tools: Optional[List[Dict[str, Any]]] = None,91 stream: bool = False,92 **kwargs93 ) -> Dict[str, Any]:94 """Generate a completion using Ollama."""95 model_name = model or self.default_model9697 # Check if specified model is available98 try:99 available_models = await self.list_models()100 model_names = [m.get("name") for m in available_models]101102 if model_name not in model_names:103 fallback_model = self.default_model104 logger.warning(105 f"Model '{model_name}' not available in Ollama. "106 f"Using fallback model '{fallback_model}'."107 )108 model_name = fallback_model109 except Exception as e:110 logger.error(f"Error checking model availability: {str(e)}")111 model_name = self.default_model112113 # Get model capabilities114 model_base_name = model_name.split(':')[0] if ':' in model_name else model_name115 capabilities = self.model_capabilities.get(116 model_base_name,117 {"supports_tools": False, "context_window": 4096, "max_tokens": 2048}118 )119120 # Check if tools are requested but not supported121 if tools and not capabilities["supports_tools"]:122 logger.warning(123 f"Model '{model_name}' does not support tools. "124 "Tool functionality will be simulated with prompt engineering."125 )126 # We'll handle this by incorporating tool descriptions into the prompt127128 # Format messages for Ollama129 prompt = self._format_messages_for_ollama(messages, tools)130131 # Set max_tokens based on capabilities if not provided132 if max_tokens is None:133 max_tokens = capabilities["max_tokens"]134 else:135 max_tokens = min(max_tokens, capabilities["max_tokens"])136137 # Prepare request payload138 payload = {139 "model": model_name,140 "prompt": prompt,141 "stream": stream,142 "options": {143 "temperature": temperature,144 "num_predict": max_tokens145 }146 }147148 if stream:149 return await self._stream_completion(payload)150 else:151 return await self._generate_completion_sync(payload)152153 async def _generate_completion_sync(self, payload: Dict[str, Any]) -> Dict[str, Any]:154 """Generate a completion synchronously."""155 if not self.session:156 self.session = aiohttp.ClientSession(timeout=self.timeout)157158 try:159 async with self.session.post(160 f"{self.base_url}/api/generate",161 json=payload162 ) as response:163 if response.status != 200:164 error_text = await response.text()165 raise Exception(f"Ollama generate error: {error_text}")166167 result = await response.json()168169 # Format the response to match OpenAI's format for consistency170 formatted_response = self._format_ollama_response(result, payload)171 return formatted_response172173 except Exception as e:174 logger.error(f"Error generating completion: {str(e)}")175 raise176177 async def _stream_completion(self, payload: Dict[str, Any]):178 """Stream a completion."""179 if not self.session:180 self.session = aiohttp.ClientSession(timeout=self.timeout)181182 try:183 async with self.session.post(184 f"{self.base_url}/api/generate",185 json=payload,186 timeout=aiohttp.ClientTimeout(total=60)187 ) as response:188 if response.status != 200:189 error_text = await response.text()190 raise Exception(f"Ollama generate error: {error_text}")191192 # Stream the response193 full_text = ""194 async for line in response.content:195 if not line:196 continue197198 try:199 chunk = json.loads(line)200 text_chunk = chunk.get("response", "")201 full_text += text_chunk202203 # Yield formatted chunk for streaming204 yield self._format_ollama_stream_chunk(text_chunk)205206 # Check if done207 if chunk.get("done", False):208 break209 except json.JSONDecodeError:210 logger.warning(f"Invalid JSON in stream: {line}")211212 # Send the final done chunk213 yield self._format_ollama_stream_chunk("", done=True, full_text=full_text)214215 except Exception as e:216 logger.error(f"Error streaming completion: {str(e)}")217 raise218219 def _format_messages_for_ollama(220 self,221 messages: List[Dict[str, str]],222 tools: Optional[List[Dict[str, Any]]] = None223 ) -> str:224 """Format messages for Ollama."""225 formatted_messages = []226227 # Add tools descriptions if provided228 if tools:229 tools_description = self._format_tools_description(tools)230 formatted_messages.append(f"[System]\n{tools_description}\n")231232 for msg in messages:233 role = msg["role"]234 content = msg["content"] or ""235236 if role == "system":237 formatted_messages.append(f"[System]\n{content}")238 elif role == "user":239 formatted_messages.append(f"[User]\n{content}")240 elif role == "assistant":241 formatted_messages.append(f"[Assistant]\n{content}")242 elif role == "tool":243 # Format tool responses244 tool_call_id = msg.get("tool_call_id", "unknown")245 formatted_messages.append(f"[Tool Result: {tool_call_id}]\n{content}")246247 # Add final prompt for assistant response248 formatted_messages.append("[Assistant]\n")249250 return "\n\n".join(formatted_messages)251252 def _format_tools_description(self, tools: List[Dict[str, Any]]) -> str:253 """Format tools description for inclusion in the prompt."""254 tools_text = ["You have access to the following tools:"]255256 for tool in tools:257 if tool.get("type") == "function":258 function = tool["function"]259 function_name = function["name"]260 function_description = function.get("description", "")261262 tools_text.append(f"Tool: {function_name}")263 tools_text.append(f"Description: {function_description}")264265 # Format parameters if available266 if "parameters" in function:267 parameters = function["parameters"]268 if "properties" in parameters:269 tools_text.append("Parameters:")270 for param_name, param_details in parameters["properties"].items():271 param_type = param_details.get("type", "unknown")272 param_desc = param_details.get("description", "")273 required = "Required" if param_name in parameters.get("required", []) else "Optional"274 tools_text.append(f" - {param_name} ({param_type}, {required}): {param_desc}")275276 tools_text.append("") # Empty line between tools277278 tools_text.append("""279When you need to use a tool, specify it clearly using the format:280281<tool>282{283 "name": "tool_name",284 "parameters": {285 "param1": "value1",286 "param2": "value2"287 }288}289</tool>290291Wait for the tool result before continuing.292""")293294 return "\n".join(tools_text)295296 def _format_ollama_response(self, result: Dict[str, Any], request: Dict[str, Any]) -> Dict[str, Any]:297 """Format Ollama response to match OpenAI's format."""298 response_text = result.get("response", "")299300 # Check for tool calls in the response301 tool_calls = self._extract_tool_calls(response_text)302303 # Calculate token counts (approximate)304 prompt_tokens = len(request["prompt"]) // 4 # Rough approximation305 completion_tokens = len(response_text) // 4 # Rough approximation306307 response = {308 "id": f"ollama-{result.get('id', 'unknown')}",309 "object": "chat.completion",310 "created": int(result.get("created_at", 0)),311 "model": request["model"],312 "provider": "ollama",313 "usage": {314 "prompt_tokens": prompt_tokens,315 "completion_tokens": completion_tokens,316 "total_tokens": prompt_tokens + completion_tokens317 },318 "message": {319 "role": "assistant",320 "content": self._clean_tool_calls_from_text(response_text) if tool_calls else response_text,321 "tool_calls": tool_calls322 }323 }324325 return response326327 def _format_ollama_stream_chunk(328 self,329 chunk_text: str,330 done: bool = False,331 full_text: Optional[str] = None332 ) -> Dict[str, Any]:333 """Format a streaming chunk to match OpenAI's format."""334 if done and full_text:335 # Final chunk might include tool calls336 tool_calls = self._extract_tool_calls(full_text)337 cleaned_text = self._clean_tool_calls_from_text(full_text) if tool_calls else full_text338339 return {340 "id": f"ollama-chunk-{id(chunk_text)}",341 "object": "chat.completion.chunk",342 "created": int(time.time()),343 "model": self.default_model,344 "choices": [{345 "index": 0,346 "delta": {347 "content": "",348 "tool_calls": tool_calls if tool_calls else None349 },350 "finish_reason": "stop"351 }]352 }353 else:354 return {355 "id": f"ollama-chunk-{id(chunk_text)}",356 "object": "chat.completion.chunk",357 "created": int(time.time()),358 "model": self.default_model,359 "choices": [{360 "index": 0,361 "delta": {362 "content": chunk_text363 },364 "finish_reason": None365 }]366 }367368 def _extract_tool_calls(self, text: str) -> Optional[List[Dict[str, Any]]]:369 """Extract tool calls from response text."""370 import re371 import uuid372373 # Look for tool calls in the format <tool>...</tool>374 tool_pattern = re.compile(r'<tool>(.*?)</tool>', re.DOTALL)375 matches = tool_pattern.findall(text)376377 if not matches:378 return None379380 tool_calls = []381 for i, match in enumerate(matches):382 try:383 # Try to parse as JSON384 tool_data = json.loads(match.strip())385386 tool_calls.append({387 "id": f"call_{uuid.uuid4().hex[:8]}",388 "type": "function",389 "function": {390 "name": tool_data.get("name", "unknown_tool"),391 "arguments": json.dumps(tool_data.get("parameters", {}))392 }393 })394 except json.JSONDecodeError:395 # If not valid JSON, try to extract name and arguments using regex396 name_match = re.search(r'"name"\s*:\s*"([^"]+)"', match)397 args_match = re.search(r'"parameters"\s*:\s*(\{.*\})', match)398399 if name_match:400 tool_name = name_match.group(1)401 tool_args = "{}" if not args_match else args_match.group(1)402403 tool_calls.append({404 "id": f"call_{uuid.uuid4().hex[:8]}",405 "type": "function",406 "function": {407 "name": tool_name,408 "arguments": tool_args409 }410 })411412 return tool_calls if tool_calls else None413414 def _clean_tool_calls_from_text(self, text: str) -> str:415 """Remove tool calls from response text."""416 import re417418 # Remove <tool>...</tool> blocks419 cleaned_text = re.sub(r'<tool>.*?</tool>', '', text, flags=re.DOTALL)420421 # Remove any leftover tool usage instructions422 cleaned_text = re.sub(r'I will use a tool to help with this\.', '', cleaned_text)423 cleaned_text = re.sub(r'Let me use the .* tool\.', '', cleaned_text)424425 # Clean up multiple newlines426 cleaned_text = re.sub(r'\n{3,}', '\n\n', cleaned_text)427428 return cleaned_text.strip()
Provider Selection Service
python1# app/services/provider_service.py2import os3import json4import logging5import time6from typing import List, Dict, Any, Optional, Union, AsyncGenerator7import asyncio8from enum import Enum9import hashlib1011import openai12from openai import AsyncOpenAI13from app.services.ollama_service import OllamaService14from app.config import settings1516logger = logging.getLogger(__name__)1718class Provider(str, Enum):19 OPENAI = "openai"20 OLLAMA = "ollama"21 AUTO = "auto"2223class ModelSelectionCriteria:24 """Criteria for model selection in auto-routing."""25 def __init__(26 self,27 complexity_threshold: float = 0.65,28 privacy_sensitive_tokens: List[str] = None,29 latency_requirement: Optional[float] = None,30 token_budget: Optional[int] = None,31 tool_requirements: Optional[List[str]] = None32 ):33 self.complexity_threshold = complexity_threshold34 self.privacy_sensitive_tokens = privacy_sensitive_tokens or []35 self.latency_requirement = latency_requirement36 self.token_budget = token_budget37 self.tool_requirements = tool_requirements3839class ProviderService:40 """Service for routing requests to the appropriate provider."""4142 def __init__(self):43 self.openai_client = None44 self.ollama_service = OllamaService()45 self.model_selection_criteria = ModelSelectionCriteria(46 complexity_threshold=settings.COMPLEXITY_THRESHOLD,47 privacy_sensitive_tokens=settings.PRIVACY_SENSITIVE_TOKENS.split(",") if hasattr(settings, "PRIVACY_SENSITIVE_TOKENS") else []48 )4950 # Model mappings51 self.default_openai_model = settings.OPENAI_MODEL52 self.default_ollama_model = settings.OLLAMA_MODEL5354 # Response cache55 self.cache_enabled = getattr(settings, "ENABLE_RESPONSE_CACHE", False)56 self.cache = {}57 self.cache_ttl = getattr(settings, "RESPONSE_CACHE_TTL", 3600) # 1 hour default5859 async def initialize(self):60 """Initialize the provider service."""61 # Initialize OpenAI client62 self.openai_client = AsyncOpenAI(63 api_key=settings.OPENAI_API_KEY,64 organization=getattr(settings, "OPENAI_ORG_ID", None)65 )6667 # Initialize Ollama service68 await self.ollama_service.initialize()6970 logger.info("Provider service initialized")7172 async def cleanup(self):73 """Clean up resources."""74 await self.ollama_service.cleanup()7576 async def generate_completion(77 self,78 messages: List[Dict[str, str]],79 model: Optional[str] = None,80 provider: Optional[Union[str, Provider]] = None,81 tools: Optional[List[Dict[str, Any]]] = None,82 stream: bool = False,83 temperature: float = 0.7,84 max_tokens: Optional[int] = None,85 user: Optional[str] = None,86 **kwargs87 ) -> Dict[str, Any]:88 """Generate a completion from the selected provider."""89 # Determine the provider and model90 selected_provider, selected_model = await self._select_provider_and_model(91 messages, model, provider, tools, **kwargs92 )9394 # Check cache if enabled and not streaming95 if self.cache_enabled and not stream:96 cache_key = self._generate_cache_key(97 messages, selected_provider, selected_model, tools, temperature, max_tokens, kwargs98 )99 cached_response = self._get_from_cache(cache_key)100 if cached_response:101 logger.info(f"Cache hit for {selected_provider}:{selected_model}")102 return cached_response103104 # Generate completion based on selected provider105 try:106 if selected_provider == Provider.OPENAI:107 response = await self._generate_openai_completion(108 messages, selected_model, tools, stream, temperature, max_tokens, user, **kwargs109 )110 else: # OLLAMA111 response = await self._generate_ollama_completion(112 messages, selected_model, tools, stream, temperature, max_tokens, **kwargs113 )114115 # Add provider info and cache if appropriate116 if not stream and response:117 response["provider"] = selected_provider.value118 if self.cache_enabled:119 self._add_to_cache(cache_key, response)120121 return response122 except Exception as e:123 logger.error(f"Error generating completion with {selected_provider}: {str(e)}")124125 # Try fallback if auto-routing was enabled126 if provider == Provider.AUTO:127 fallback_provider = Provider.OLLAMA if selected_provider == Provider.OPENAI else Provider.OPENAI128 logger.info(f"Attempting fallback to {fallback_provider}")129130 try:131 if fallback_provider == Provider.OPENAI:132 fallback_model = self.default_openai_model133 response = await self._generate_openai_completion(134 messages, fallback_model, tools, stream, temperature, max_tokens, user, **kwargs135 )136 else: # OLLAMA137 fallback_model = self.default_ollama_model138 response = await self._generate_ollama_completion(139 messages, fallback_model, tools, stream, temperature, max_tokens, **kwargs140 )141142 if not stream and response:143 response["provider"] = fallback_provider.value144 # Don't cache fallback responses145146 return response147 except Exception as fallback_error:148 logger.error(f"Fallback also failed: {str(fallback_error)}")149150 # Re-raise the original error if we couldn't fall back151 raise152153 async def stream_completion(154 self,155 messages: List[Dict[str, str]],156 model: Optional[str] = None,157 provider: Optional[Union[str, Provider]] = None,158 tools: Optional[List[Dict[str, Any]]] = None,159 temperature: float = 0.7,160 max_tokens: Optional[int] = None,161 user: Optional[str] = None,162 **kwargs163 ) -> AsyncGenerator[Dict[str, Any], None]:164 """Stream a completion from the selected provider."""165 # Always stream with this method166 kwargs["stream"] = True167168 # Determine the provider and model169 selected_provider, selected_model = await self._select_provider_and_model(170 messages, model, provider, tools, **kwargs171 )172173 try:174 if selected_provider == Provider.OPENAI:175 async for chunk in self._stream_openai_completion(176 messages, selected_model, tools, temperature, max_tokens, user, **kwargs177 ):178 chunk["provider"] = selected_provider.value179 yield chunk180 else: # OLLAMA181 async for chunk in self._stream_ollama_completion(182 messages, selected_model, tools, temperature, max_tokens, **kwargs183 ):184 chunk["provider"] = selected_provider.value185 yield chunk186 except Exception as e:187 logger.error(f"Error streaming completion with {selected_provider}: {str(e)}")188189 # Try fallback if auto-routing was enabled190 if provider == Provider.AUTO:191 fallback_provider = Provider.OLLAMA if selected_provider == Provider.OPENAI else Provider.OPENAI192 logger.info(f"Attempting fallback to {fallback_provider}")193194 try:195 if fallback_provider == Provider.OPENAI:196 fallback_model = self.default_openai_model197 async for chunk in self._stream_openai_completion(198 messages, fallback_model, tools, temperature, max_tokens, user, **kwargs199 ):200 chunk["provider"] = fallback_provider.value201 yield chunk202 else: # OLLAMA203 fallback_model = self.default_ollama_model204 async for chunk in self._stream_ollama_completion(205 messages, fallback_model, tools, temperature, max_tokens, **kwargs206 ):207 chunk["provider"] = fallback_provider.value208 yield chunk209 except Exception as fallback_error:210 logger.error(f"Fallback streaming also failed: {str(fallback_error)}")211 # Nothing more we can do here212213 # For streaming, we don't re-raise since we've already started the response214215 async def _select_provider_and_model(216 self,217 messages: List[Dict[str, str]],218 model: Optional[str] = None,219 provider: Optional[Union[str, Provider]] = None,220 tools: Optional[List[Dict[str, Any]]] = None,221 **kwargs222 ) -> tuple[Provider, str]:223 """Select the provider and model based on input and criteria."""224 # Handle explicit provider/model specification225 if model and ":" in model:226 # Format: "provider:model", e.g. "openai:gpt-4" or "ollama:llama2"227 provider_str, model_name = model.split(":", 1)228 selected_provider = Provider(provider_str.lower())229 return selected_provider, model_name230231 # Handle explicit provider with default model232 if provider and provider != Provider.AUTO:233 selected_provider = Provider(provider) if isinstance(provider, str) else provider234 selected_model = model or (235 self.default_openai_model if selected_provider == Provider.OPENAI236 else self.default_ollama_model237 )238 return selected_provider, selected_model239240 # If model specified without provider, infer provider241 if model:242 # Heuristic: OpenAI models typically start with "gpt-" or "text-"243 if model.startswith(("gpt-", "text-")):244 return Provider.OPENAI, model245 else:246 return Provider.OLLAMA, model247248 # Auto-routing based on message content and requirements249 if not provider or provider == Provider.AUTO:250 selected_provider = await self._auto_route(messages, tools, **kwargs)251 selected_model = (252 self.default_openai_model if selected_provider == Provider.OPENAI253 else self.default_ollama_model254 )255 return selected_provider, selected_model256257 # Default fallback258 return Provider.OPENAI, self.default_openai_model259260 async def _auto_route(261 self,262 messages: List[Dict[str, str]],263 tools: Optional[List[Dict[str, Any]]] = None,264 **kwargs265 ) -> Provider:266 """Automatically route to the appropriate provider based on content and requirements."""267 # 1. Check for tool requirements268 if tools:269 # If tools are required, prefer OpenAI as Ollama's tool support is limited270 return Provider.OPENAI271272 # 2. Check for privacy concerns273 if self._contains_sensitive_information(messages):274 logger.info("Privacy sensitive information detected, routing to Ollama")275 return Provider.OLLAMA276277 # 3. Assess complexity278 complexity_score = await self._assess_complexity(messages)279 logger.info(f"Content complexity score: {complexity_score}")280281 if complexity_score > self.model_selection_criteria.complexity_threshold:282 logger.info(f"High complexity content ({complexity_score}), routing to OpenAI")283 return Provider.OPENAI284285 # 4. Consider token budget (if specified)286 token_budget = kwargs.get("token_budget") or self.model_selection_criteria.token_budget287 if token_budget:288 estimated_tokens = self._estimate_token_count(messages)289 if estimated_tokens > token_budget:290 logger.info(f"Token budget ({token_budget}) exceeded ({estimated_tokens}), routing to OpenAI")291 return Provider.OPENAI292293 # Default to Ollama for standard requests294 logger.info("Standard request, routing to Ollama")295 return Provider.OLLAMA296297 def _contains_sensitive_information(self, messages: List[Dict[str, str]]) -> bool:298 """Check if messages contain privacy-sensitive information."""299 sensitive_tokens = self.model_selection_criteria.privacy_sensitive_tokens300 if not sensitive_tokens:301 return False302303 combined_text = " ".join([msg.get("content", "") or "" for msg in messages])304 combined_text = combined_text.lower()305306 for token in sensitive_tokens:307 if token.lower() in combined_text:308 return True309310 return False311312 async def _assess_complexity(self, messages: List[Dict[str, str]]) -> float:313 """Assess the complexity of the messages."""314 # Simple heuristics for complexity:315 # 1. Length of content316 # 2. Presence of complex tokens (technical terms, specialized vocabulary)317 # 3. Sentence complexity318319 user_messages = [msg.get("content", "") for msg in messages if msg.get("role") == "user"]320 if not user_messages:321 return 0.0322323 last_message = user_messages[-1] or ""324325 # 1. Length factor (normalized to 0-1 range)326 length = len(last_message)327 length_factor = min(length / 1000, 1.0) * 0.3 # 30% weight to length328329 # 2. Complexity indicators330 complex_terms = [331 "analyze", "synthesize", "evaluate", "compare", "contrast",332 "explain", "technical", "detailed", "comprehensive", "algorithm",333 "implementation", "architecture", "design", "optimize", "complex"334 ]335336 term_count = sum(1 for term in complex_terms if term in last_message.lower())337 term_factor = min(term_count / 10, 1.0) * 0.4 # 40% weight to complex terms338339 # 3. Sentence complexity (approximated by average sentence length)340 sentences = [s.strip() for s in last_message.split(".") if s.strip()]341 if sentences:342 avg_sentence_length = sum(len(s.split()) for s in sentences) / len(sentences)343 sentence_factor = min(avg_sentence_length / 25, 1.0) * 0.3 # 30% weight to sentence complexity344 else:345 sentence_factor = 0.0346347 # Combined complexity score348 complexity = length_factor + term_factor + sentence_factor349350 return complexity351352 def _estimate_token_count(self, messages: List[Dict[str, str]]) -> int:353 """Estimate the token count for the messages."""354 # Simple approximation: 1 token ≈ 4 characters355 combined_text = " ".join([msg.get("content", "") or "" for msg in messages])356 return len(combined_text) // 4357358 async def _generate_openai_completion(359 self,360 messages: List[Dict[str, str]],361 model: str,362 tools: Optional[List[Dict[str, Any]]] = None,363 stream: bool = False,364 temperature: float = 0.7,365 max_tokens: Optional[int] = None,366 user: Optional[str] = None,367 **kwargs368 ) -> Dict[str, Any]:369 """Generate a completion using OpenAI."""370 completion_kwargs = {371 "model": model,372 "messages": messages,373 "temperature": temperature,374 "stream": stream375 }376377 if max_tokens:378 completion_kwargs["max_tokens"] = max_tokens379380 if tools:381 completion_kwargs["tools"] = tools382383 if "tool_choice" in kwargs:384 completion_kwargs["tool_choice"] = kwargs["tool_choice"]385386 if "response_format" in kwargs:387 completion_kwargs["response_format"] = kwargs["response_format"]388389 if user:390 completion_kwargs["user"] = user391392 if stream:393 response_stream = await self.openai_client.chat.completions.create(**completion_kwargs)394395 full_response = None396 async for chunk in response_stream:397 if not full_response:398 full_response = chunk399 yield chunk.model_dump()400 else:401 response = await self.openai_client.chat.completions.create(**completion_kwargs)402 return response.model_dump()403404 async def _stream_openai_completion(405 self,406 messages: List[Dict[str, str]],407 model: str,408 tools: Optional[List[Dict[str, Any]]] = None,409 temperature: float = 0.7,410 max_tokens: Optional[int] = None,411 user: Optional[str] = None,412 **kwargs413 ) -> AsyncGenerator[Dict[str, Any], None]:414 """Stream a completion from OpenAI."""415 # This is just a wrapper around _generate_openai_completion with stream=True416 async for chunk in self._generate_openai_completion(417 messages, model, tools, True, temperature, max_tokens, user, **kwargs418 ):419 yield chunk420421 async def _generate_ollama_completion(422 self,423 messages: List[Dict[str, str]],424 model: str,425 tools: Optional[List[Dict[str, Any]]] = None,426 stream: bool = False,427 temperature: float = 0.7,428 max_tokens: Optional[int] = None,429 **kwargs430 ) -> Dict[str, Any]:431 """Generate a completion using Ollama."""432 if stream:433 # For streaming, return the first chunk to maintain API consistency434 async for chunk in self.ollama_service.generate_completion(435 messages=messages,436 model=model,437 temperature=temperature,438 max_tokens=max_tokens,439 tools=tools,440 stream=True,441 **kwargs442 ):443 return chunk444 else:445 return await self.ollama_service.generate_completion(446 messages=messages,447 model=model,448 temperature=temperature,449 max_tokens=max_tokens,450 tools=tools,451 stream=False,452 **kwargs453 )454455 async def _stream_ollama_completion(456 self,457 messages: List[Dict[str, str]],458 model: str,459 tools: Optional[List[Dict[str, Any]]] = None,460 temperature: float = 0.7,461 max_tokens: Optional[int] = None,462 **kwargs463 ) -> AsyncGenerator[Dict[str, Any], None]:464 """Stream a completion from Ollama."""465 async for chunk in self.ollama_service.generate_completion(466 messages=messages,467 model=model,468 temperature=temperature,469 max_tokens=max_tokens,470 tools=tools,471 stream=True,472 **kwargs473 ):474 yield chunk475476 def _generate_cache_key(self, *args) -> str:477 """Generate a cache key based on the input parameters."""478 # Convert complex objects to JSON strings first479 args_str = json.dumps([arg if not isinstance(arg, (dict, list)) else json.dumps(arg, sort_keys=True) for arg in args])480 return hashlib.md5(args_str.encode()).hexdigest()481482 def _get_from_cache(self, key: str) -> Optional[Dict[str, Any]]:483 """Get a response from cache if available and not expired."""484 if key not in self.cache:485 return None486487 cached_item = self.cache[key]488 if time.time() - cached_item["timestamp"] > self.cache_ttl:489 # Expired490 del self.cache[key]491 return None492493 return cached_item["response"]494495 def _add_to_cache(self, key: str, response: Dict[str, Any]):496 """Add a response to the cache."""497 self.cache[key] = {498 "response": response,499 "timestamp": time.time()500 }501502 # Simple cache size management - remove oldest if too many items503 max_cache_size = getattr(settings, "RESPONSE_CACHE_MAX_ITEMS", 1000)504 if len(self.cache) > max_cache_size:505 # Remove oldest 10% of items506 items_to_remove = max(1, int(max_cache_size * 0.1))507 oldest_keys = sorted(508 self.cache.keys(),509 key=lambda k: self.cache[k]["timestamp"]510 )[:items_to_remove]511512 for old_key in oldest_keys:513 del self.cache[old_key]
Configuration Settings
python1# app/config.py2import os3from pydantic_settings import BaseSettings4from typing import List, Optional, Dict, Any5from dotenv import load_dotenv67# Load environment variables from .env file8load_dotenv()910class Settings(BaseSettings):11 # API Keys and Authentication12 OPENAI_API_KEY: str13 OPENAI_ORG_ID: Optional[str] = None1415 # Model Configuration16 OPENAI_MODEL: str = "gpt-4o"17 OLLAMA_MODEL: str = "llama2"18 OLLAMA_HOST: str = "http://localhost:11434"1920 # System Behavior21 TEMPERATURE: float = 0.722 MAX_TOKENS: int = 409623 REQUEST_TIMEOUT: int = 1202425 # Routing Configuration26 COMPLEXITY_THRESHOLD: float = 0.6527 PRIVACY_SENSITIVE_TOKENS: str = "password,secret,token,key,credential"2829 # Caching Configuration30 ENABLE_RESPONSE_CACHE: bool = True31 RESPONSE_CACHE_TTL: int = 3600 # 1 hour32 RESPONSE_CACHE_MAX_ITEMS: int = 10003334 # Logging Configuration35 LOG_LEVEL: str = "INFO"3637 # Database Configuration38 DATABASE_URL: Optional[str] = None3940 # Advanced Ollama Configuration41 OLLAMA_MODELS_MAPPING: Dict[str, str] = {42 "gpt-3.5-turbo": "llama2",43 "gpt-4": "llama2",44 "gpt-4o": "mistral",45 "code-llama": "codellama"46 }4748 class Config:49 env_file = ".env"50 env_file_encoding = "utf-8"5152settings = Settings()
Model Selection and Configuration
Below is a table of recommended Ollama models and their optimal use cases:
python1# app/models/model_catalog.py2from typing import Dict, List, Any, Optional34class ModelCapability:5 """Represents the capabilities of a model."""6 def __init__(7 self,8 context_window: int,9 strengths: List[str],10 supports_tools: bool,11 recommended_temperature: float,12 approximate_speed: str # "fast", "medium", "slow"13 ):14 self.context_window = context_window15 self.strengths = strengths16 self.supports_tools = supports_tools17 self.recommended_temperature = recommended_temperature18 self.approximate_speed = approximate_speed1920# Ollama model catalog21OLLAMA_MODELS = {22 "llama2": ModelCapability(23 context_window=4096,24 strengths=["general_knowledge", "reasoning", "instruction_following"],25 supports_tools=False,26 recommended_temperature=0.7,27 approximate_speed="medium"28 ),29 "llama2:13b": ModelCapability(30 context_window=4096,31 strengths=["general_knowledge", "reasoning", "instruction_following"],32 supports_tools=False,33 recommended_temperature=0.7,34 approximate_speed="medium"35 ),36 "llama2:70b": ModelCapability(37 context_window=4096,38 strengths=["general_knowledge", "reasoning", "instruction_following"],39 supports_tools=False,40 recommended_temperature=0.65,41 approximate_speed="slow"42 ),43 "mistral": ModelCapability(44 context_window=8192,45 strengths=["instruction_following", "reasoning", "versatility"],46 supports_tools=False,47 recommended_temperature=0.7,48 approximate_speed="medium"49 ),50 "mistral:7b-instruct": ModelCapability(51 context_window=8192,52 strengths=["instruction_following", "chat", "versatility"],53 supports_tools=False,54 recommended_temperature=0.7,55 approximate_speed="medium"56 ),57 "codellama": ModelCapability(58 context_window=16384,59 strengths=["code_generation", "code_explanation", "technical_writing"],60 supports_tools=False,61 recommended_temperature=0.5,62 approximate_speed="medium"63 ),64 "codellama:34b": ModelCapability(65 context_window=16384,66 strengths=["code_generation", "code_explanation", "technical_writing"],67 supports_tools=False,68 recommended_temperature=0.5,69 approximate_speed="slow"70 ),71 "dolphin-mistral": ModelCapability(72 context_window=8192,73 strengths=["conversational", "creative", "helpfulness"],74 supports_tools=False,75 recommended_temperature=0.7,76 approximate_speed="medium"77 ),78 "neural-chat": ModelCapability(79 context_window=8192,80 strengths=["conversational", "instruction_following", "helpfulness"],81 supports_tools=False,82 recommended_temperature=0.7,83 approximate_speed="medium"84 ),85 "orca-mini": ModelCapability(86 context_window=4096,87 strengths=["efficiency", "general_knowledge", "basic_reasoning"],88 supports_tools=False,89 recommended_temperature=0.8,90 approximate_speed="fast"91 ),92 "vicuna": ModelCapability(93 context_window=4096,94 strengths=["conversational", "instruction_following"],95 supports_tools=False,96 recommended_temperature=0.7,97 approximate_speed="medium"98 ),99 "wizard-math": ModelCapability(100 context_window=4096,101 strengths=["mathematics", "problem_solving", "logical_reasoning"],102 supports_tools=False,103 recommended_temperature=0.5,104 approximate_speed="medium"105 ),106 "phi": ModelCapability(107 context_window=2048,108 strengths=["efficiency", "basic_tasks", "lightweight"],109 supports_tools=False,110 recommended_temperature=0.7,111 approximate_speed="fast"112 )113}114115# OpenAI -> Ollama model mapping for fallback scenarios116OPENAI_TO_OLLAMA_MAPPING = {117 "gpt-3.5-turbo": "llama2",118 "gpt-3.5-turbo-16k": "mistral:7b-instruct",119 "gpt-4": "llama2:70b",120 "gpt-4o": "mistral",121 "gpt-4-turbo": "mistral",122 "code-llama": "codellama"123}124125# Use case to model recommendations126USE_CASE_RECOMMENDATIONS = {127 "code_generation": ["codellama:34b", "codellama"],128 "creative_writing": ["dolphin-mistral", "mistral:7b-instruct"],129 "mathematical_reasoning": ["wizard-math", "llama2:70b"],130 "conversational": ["neural-chat", "dolphin-mistral"],131 "knowledge_intensive": ["llama2:70b", "mistral"],132 "resource_constrained": ["phi", "orca-mini"]133}134135def recommend_ollama_model(use_case: str, performance_tier: str = "medium") -> str:136 """Recommend an Ollama model based on use case and performance requirements."""137 if use_case in USE_CASE_RECOMMENDATIONS:138 models = USE_CASE_RECOMMENDATIONS[use_case]139140 # Filter by performance tier if needed141 if performance_tier == "high":142 for model in models:143 if ":70b" in model or ":34b" in model:144 return model145 return models[0] # Return first if no high-tier match146 elif performance_tier == "low":147 return "orca-mini" if use_case != "code_generation" else "codellama"148 else: # medium tier149 return models[0]150151 # Default recommendations152 if performance_tier == "high":153 return "llama2:70b"154 elif performance_tier == "low":155 return "orca-mini"156 else:157 return "mistral"
Agent Adapter for Model Selection
python1# app/agents/adaptive_agent.py2from typing import List, Dict, Any, Optional3import logging4from app.agents.base_agent import BaseAgent5from app.models.message import Message, MessageRole6from app.services.provider_service import ProviderService, Provider7from app.models.model_catalog import recommend_ollama_model, OLLAMA_MODELS89logger = logging.getLogger(__name__)1011class AdaptiveAgent(BaseAgent):12 """Agent that adapts its model selection based on task requirements."""1314 def __init__(self, *args, **kwargs):15 super().__init__(*args, **kwargs)16 self.last_used_model = None17 self.last_used_provider = None18 self.performance_metrics = {}1920 async def _generate_response(self, user_id: str) -> str:21 """Generate a response with dynamic model selection."""22 # Extract the last user message23 last_user_message = next(24 (msg for msg in reversed(self.state.conversation_history)25 if msg.role == MessageRole.USER),26 None27 )2829 if not last_user_message:30 return "I don't have any messages to respond to."3132 # Analyze the message to determine the best model33 provider, model = await self._select_optimal_model(last_user_message.content)3435 logger.info(f"Selected model for response: {provider}:{model}")3637 # Track the selected model for monitoring38 self.last_used_model = model39 self.last_used_provider = provider4041 # Get model-specific parameters42 params = self._get_model_parameters(provider, model)4344 # Start timing for performance metrics45 import time46 start_time = time.time()4748 # Generate the response49 response = await self.provider_service.generate_completion(50 messages=[msg.model_dump() for msg in self.state.conversation_history],51 model=f"{provider}:{model}" if provider != "auto" else None,52 provider=provider,53 tools=self.tools,54 temperature=params.get("temperature", 0.7),55 max_tokens=params.get("max_tokens"),56 user=user_id57 )5859 # Record performance metrics60 execution_time = time.time() - start_time61 self._update_performance_metrics(provider, model, execution_time, response)6263 if response.get("tool_calls"):64 # Process tool calls if needed65 # ... (tool call handling code)66 pass6768 return response["message"]["content"]6970 async def _select_optimal_model(self, message: str) -> tuple[str, str]:71 """Select the optimal model based on message analysis."""72 # 1. Analyze for use case73 use_case = await self._determine_use_case(message)7475 # 2. Determine performance needs76 performance_tier = self._determine_performance_tier(message)7778 # 3. Check if tools are required79 tools_required = len(self.tools) > 08081 # 4. Check message complexity82 is_complex = await self._is_complex_request(message)8384 # Decision logic85 if tools_required:86 # OpenAI is better for tool usage87 return "openai", "gpt-4o"8889 if is_complex:90 # For complex requests, prefer OpenAI or high-tier Ollama models91 if performance_tier == "high":92 return "openai", "gpt-4o"93 else:94 ollama_model = recommend_ollama_model(use_case, "high")95 return "ollama", ollama_model9697 # For standard requests, use Ollama with appropriate model98 ollama_model = recommend_ollama_model(use_case, performance_tier)99 return "ollama", ollama_model100101 async def _determine_use_case(self, message: str) -> str:102 """Determine the use case based on message content."""103 message_lower = message.lower()104105 # Simple heuristic classification106 if any(term in message_lower for term in ["code", "program", "function", "class", "algorithm"]):107 return "code_generation"108109 if any(term in message_lower for term in ["story", "creative", "imagine", "write", "novel"]):110 return "creative_writing"111112 if any(term in message_lower for term in ["math", "calculate", "equation", "solve", "formula"]):113 return "mathematical_reasoning"114115 if any(term in message_lower for term in ["chat", "talk", "discuss", "conversation"]):116 return "conversational"117118 if len(message.split()) > 50 or any(term in message_lower for term in ["explain", "detail", "analysis"]):119 return "knowledge_intensive"120121 # Default to conversational122 return "conversational"123124 def _determine_performance_tier(self, message: str) -> str:125 """Determine the performance tier needed based on message characteristics."""126 # Length-based heuristic127 word_count = len(message.split())128129 if word_count > 100 or "detailed" in message.lower() or "comprehensive" in message.lower():130 return "high"131132 if word_count < 20 and not any(term in message.lower() for term in ["complex", "difficult", "advanced"]):133 return "low"134135 return "medium"136137 async def _is_complex_request(self, message: str) -> bool:138 """Determine if this is a complex request requiring more powerful models."""139 # Check for indicators of complexity140 complexity_indicators = [141 "complex", "detailed", "thorough", "comprehensive", "in-depth",142 "analyze", "compare", "synthesize", "evaluate", "technical",143 "step by step", "advanced", "sophisticated", "nuanced"144 ]145146 indicator_count = sum(1 for indicator in complexity_indicators if indicator in message.lower())147148 # Length is also an indicator of complexity149 is_long = len(message.split()) > 50150151 # Multiple questions indicate complexity152 question_count = message.count("?")153 has_multiple_questions = question_count > 1154155 return (indicator_count >= 2) or (is_long and indicator_count >= 1) or has_multiple_questions156157 def _get_model_parameters(self, provider: str, model: str) -> Dict[str, Any]:158 """Get model-specific parameters."""159 if provider == "ollama":160 if model in OLLAMA_MODELS:161 capabilities = OLLAMA_MODELS[model]162 return {163 "temperature": capabilities.recommended_temperature,164 "max_tokens": capabilities.context_window // 2 # Conservative estimate165 }166 else:167 # Default Ollama parameters168 return {"temperature": 0.7, "max_tokens": 2048}169 else:170 # OpenAI models171 if "gpt-4" in model:172 return {"temperature": 0.7, "max_tokens": 4096}173 else:174 return {"temperature": 0.7, "max_tokens": 2048}175176 def _update_performance_metrics(177 self,178 provider: str,179 model: str,180 execution_time: float,181 response: Dict[str, Any]182 ):183 """Update performance metrics for this model."""184 model_key = f"{provider}:{model}"185186 if model_key not in self.performance_metrics:187 self.performance_metrics[model_key] = {188 "calls": 0,189 "total_time": 0,190 "avg_time": 0,191 "token_usage": {192 "prompt": 0,193 "completion": 0,194 "total": 0195 }196 }197198 metrics = self.performance_metrics[model_key]199 metrics["calls"] += 1200 metrics["total_time"] += execution_time201 metrics["avg_time"] = metrics["total_time"] / metrics["calls"]202203 # Update token usage if available204 if "usage" in response:205 usage = response["usage"]206 metrics["token_usage"]["prompt"] += usage.get("prompt_tokens", 0)207 metrics["token_usage"]["completion"] += usage.get("completion_tokens", 0)208 metrics["token_usage"]["total"] += usage.get("total_tokens", 0)
Agent Controller with Model Selection
python1# app/controllers/agent_controller.py2from fastapi import APIRouter, Depends, HTTPException, Query, BackgroundTasks3from pydantic import BaseModel, Field4from typing import List, Dict, Any, Optional5import logging67from app.agents.agent_factory import AgentFactory8from app.agents.adaptive_agent import AdaptiveAgent9from app.services.provider_service import Provider10from app.services.auth_service import get_current_user11from app.config import settings1213logger = logging.getLogger(__name__)1415router = APIRouter(prefix="/api/v1/agents", tags=["agents"])1617class ModelSelectionParams(BaseModel):18 """Parameters for model selection."""19 provider: Optional[str] = Field(None, description="Provider to use (openai, ollama, auto)")20 model: Optional[str] = Field(None, description="Specific model to use")21 auto_select: bool = Field(True, description="Whether to auto-select the optimal model")22 use_case: Optional[str] = Field(None, description="Specific use case for model recommendation")23 performance_tier: Optional[str] = Field("medium", description="Performance tier (low, medium, high)")2425class ChatRequest(BaseModel):26 message: str27 session_id: Optional[str] = None28 model_params: Optional[ModelSelectionParams] = None29 stream: bool = False3031class ChatResponse(BaseModel):32 response: str33 session_id: str34 model_used: str35 provider_used: str36 execution_metrics: Optional[Dict[str, Any]] = None3738# Agent sessions storage39agent_sessions = {}4041# Get agent factory instance42agent_factory = Depends(lambda: get_agent_factory())4344def get_agent_factory():45 # Initialize and return agent factory46 # In a real implementation, this would be properly initialized47 return AgentFactory()4849@router.post("/chat", response_model=ChatResponse)50async def chat(51 request: ChatRequest,52 background_tasks: BackgroundTasks,53 current_user: Dict = Depends(get_current_user),54 factory: AgentFactory = agent_factory55):56 """Chat with an agent that intelligently selects the appropriate model."""57 user_id = current_user["id"]5859 # Create or retrieve session60 session_id = request.session_id61 if not session_id or session_id not in agent_sessions:62 # Create a new adaptive agent63 agent = factory.create_agent(64 agent_type="adaptive",65 agent_class=AdaptiveAgent,66 system_prompt="You are a helpful assistant that provides accurate, relevant information."67 )6869 session_id = f"session_{user_id}_{len(agent_sessions) + 1}"70 agent_sessions[session_id] = agent71 else:72 agent = agent_sessions[session_id]7374 # Apply model selection parameters if provided75 if request.model_params:76 if not request.model_params.auto_select:77 # Force specific provider/model78 provider = request.model_params.provider or "auto"79 model = request.model_params.model8081 if provider != "auto" and model:82 logger.info(f"Forcing model selection: {provider}:{model}")83 # Set for next generation84 agent.last_used_provider = provider85 agent.last_used_model = model8687 try:88 # Process the message89 if request.stream:90 # Implement streaming logic if needed91 pass92 else:93 response = await agent.process_message(request.message, user_id)9495 # Get the model and provider that were used96 model_used = agent.last_used_model or "unknown"97 provider_used = agent.last_used_provider or "unknown"9899 # Get execution metrics100 model_key = f"{provider_used}:{model_used}"101 execution_metrics = agent.performance_metrics.get(model_key)102103 # Schedule background task to analyze performance and adjust preferences104 background_tasks.add_task(105 analyze_performance,106 agent,107 model_key,108 execution_metrics109 )110111 return ChatResponse(112 response=response,113 session_id=session_id,114 model_used=model_used,115 provider_used=provider_used,116 execution_metrics=execution_metrics117 )118 except Exception as e:119 logger.exception(f"Error processing message: {str(e)}")120 raise HTTPException(status_code=500, detail=f"Error processing message: {str(e)}")121122@router.get("/models/recommend")123async def recommend_model(124 use_case: str = Query(..., description="The use case (code_generation, creative_writing, etc.)"),125 performance_tier: str = Query("medium", description="Performance tier (low, medium, high)"),126 current_user: Dict = Depends(get_current_user)127):128 """Get model recommendations for a specific use case."""129 from app.models.model_catalog import recommend_ollama_model, OLLAMA_MODELS130131 # Get recommended Ollama model132 recommended_model = recommend_ollama_model(use_case, performance_tier)133134 # Get OpenAI equivalent135 openai_equivalent = "gpt-4o" if performance_tier == "high" else "gpt-3.5-turbo"136137 # Get model capabilities if available138 capabilities = OLLAMA_MODELS.get(recommended_model, {})139140 return {141 "ollama_recommendation": recommended_model,142 "openai_recommendation": openai_equivalent,143 "capabilities": capabilities,144 "use_case": use_case,145 "performance_tier": performance_tier146 }147148async def analyze_performance(agent, model_key, metrics):149 """Analyze model performance and adjust preferences."""150 if not metrics or metrics["calls"] < 5:151 # Not enough data to analyze152 return153154 # Analyze average response time155 avg_time = metrics["avg_time"]156157 # If response time is too slow, consider adjusting default models158 if avg_time > 5.0: # More than 5 seconds159 logger.info(f"Model {model_key} showing slow performance: {avg_time}s avg")160161 # In a real implementation, we might adjust preferred models here162 pass
Dockerfile for Local Deployment
dockerfile1# Dockerfile2FROM python:3.11-slim34WORKDIR /app56# Install system dependencies7RUN apt-get update && apt-get install -y --no-install-recommends \8 curl \9 && rm -rf /var/lib/apt/lists/*1011# Copy requirements12COPY requirements.txt .13RUN pip install --no-cache-dir -r requirements.txt1415# Copy application code16COPY . .1718# Set up environment19ENV PYTHONPATH=/app20ENV OPENAI_API_KEY="your-api-key-here"21ENV OLLAMA_HOST="http://ollama:11434"22ENV OLLAMA_MODEL="llama2"2324# Default command25CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
Docker Compose for Development
yaml1# docker-compose.yml2version: '3.8'34services:5 app:6 build: .7 ports:8 - "8000:8000"9 volumes:10 - .:/app11 environment:12 - OLLAMA_HOST=http://ollama:1143413 - OPENAI_API_KEY=${OPENAI_API_KEY}14 - OPENAI_MODEL=${OPENAI_MODEL:-gpt-4o}15 - OLLAMA_MODEL=${OLLAMA_MODEL:-llama2}16 depends_on:17 - ollama18 restart: unless-stopped1920 ollama:21 image: ollama/ollama:latest22 volumes:23 - ollama_data:/root/.ollama24 ports:25 - "11434:11434"26 deploy:27 resources:28 reservations:29 devices:30 - driver: nvidia31 count: all32 capabilities: [gpu]3334volumes:35 ollama_data:
Model Preload Script
python1# scripts/preload_models.py2#!/usr/bin/env python3import argparse4import requests5import time6import sys7import os8from typing import List, Dict910def main():11 parser = argparse.ArgumentParser(description='Preload Ollama models')12 parser.add_argument('--host', default="http://localhost:11434", help='Ollama host URL')13 parser.add_argument('--models', default="llama2,mistral,codellama", help='Comma-separated list of models to preload')14 parser.add_argument('--timeout', type=int, default=3600, help='Timeout in seconds for each model pull')15 args = parser.parse_args()1617 models = [m.strip() for m in args.models.split(',')]18 preload_models(args.host, models, args.timeout)1920def preload_models(host: str, models: List[str], timeout: int):21 """Preload models into Ollama."""22 print(f"Preloading {len(models)} models on {host}...")2324 # Check Ollama availability25 try:26 response = requests.get(f"{host}/api/tags")27 if response.status_code != 200:28 print(f"Error connecting to Ollama: Status {response.status_code}")29 sys.exit(1)3031 available_models = [m["name"] for m in response.json().get("models", [])]32 print(f"Currently available models: {', '.join(available_models)}")33 except Exception as e:34 print(f"Error connecting to Ollama: {str(e)}")35 sys.exit(1)3637 # Pull each model38 for model in models:39 if model in available_models:40 print(f"Model {model} is already available, skipping...")41 continue4243 print(f"Pulling model: {model}")44 try:45 start_time = time.time()46 response = requests.post(47 f"{host}/api/pull",48 json={"name": model},49 timeout=timeout50 )5152 if response.status_code != 200:53 print(f"Error pulling model {model}: Status {response.status_code}")54 print(response.text)55 continue5657 elapsed = time.time() - start_time58 print(f"Successfully pulled {model} in {elapsed:.1f} seconds")59 except Exception as e:60 print(f"Error pulling model {model}: {str(e)}")6162 # Verify available models after pulling63 try:64 response = requests.get(f"{host}/api/tags")65 if response.status_code == 200:66 available_models = [m["name"] for m in response.json().get("models", [])]67 print(f"Available models: {', '.join(available_models)}")68 except Exception as e:69 print(f"Error checking available models: {str(e)}")7071if __name__ == "__main__":72 main()
Implementation Guide
Setting up Ollama
-
Installation:
bash1# macOS2brew install ollama34# Linux5curl -fsSL https://ollama.com/install.sh | sh67# Windows8# Download from https://ollama.com/download/windows -
Pull Base Models:
bash1ollama pull llama22ollama pull mistral3ollama pull codellama -
Start Ollama Server:
bash1ollama serve
Application Configuration
-
Create .env file:
text1OPENAI_API_KEY=sk-...2OPENAI_ORG_ID=org-... # Optional3OPENAI_MODEL=gpt-4o4OLLAMA_MODEL=llama25OLLAMA_HOST=http://localhost:114346COMPLEXITY_THRESHOLD=0.657PRIVACY_SENSITIVE_TOKENS=password,secret,token,key,credential -
Initialize Application:
bash1# Install dependencies2pip install -r requirements.txt34# Start the application5uvicorn app.main:app --reload
Model Selection Criteria
The system determines which provider (OpenAI or Ollama) to use based on several criteria:
-
Complexity Analysis:
- Messages are analyzed for complexity based on length, specialized terminology, and sentence structure.
- The
COMPLEXITY_THRESHOLDsetting (default: 0.65) determines when to route to OpenAI for more complex queries.
-
Privacy Concerns:
- Messages containing sensitive terms (configured in
PRIVACY_SENSITIVE_TOKENS) are preferentially routed to Ollama. - This ensures sensitive information remains on local infrastructure.
- Messages containing sensitive terms (configured in
-
Tool Requirements:
- Requests requiring tools/functions are routed to OpenAI as Ollama has limited native tool support.
- The system simulates tool usage in Ollama using prompt engineering when necessary.
-
Resource Constraints:
- Token budget constraints can trigger routing to OpenAI for longer conversations.
- Local hardware capabilities are considered when selecting Ollama models.
Ollama Model Selection
The system intelligently selects the appropriate Ollama model based on the query's requirements:
- For code generation:
codellama(default) orcodellama:34b(high performance) - For creative tasks:
dolphin-mistralorneural-chat - For mathematical reasoning:
wizard-math - For general knowledge:
llama2(base),llama2:13b(medium), orllama2:70b(high performance) - For resource-constrained environments:
phiororca-mini
Performance Optimization
-
Response Caching:
- Common responses are cached to improve performance.
- Cache TTL and maximum items are configurable.
-
Dynamic Temperature Adjustment:
- Each model has recommended temperature settings for optimal performance.
- The system adjusts temperature based on the task type.
-
Adaptive Routing:
- The system learns from performance metrics and adjusts routing preferences over time.
- Models with consistently poor performance receive fewer requests.
Fallback Mechanisms
The system implements robust fallback mechanisms:
-
Provider Fallback:
- If OpenAI is unavailable, the system falls back to Ollama.
- If Ollama fails, the system falls back to OpenAI.
-
Model Fallback:
- If a requested model is unavailable, the system selects an appropriate alternative.
- Fallback chains are configured for each model to ensure graceful degradation.
-
Error Handling:
- Network errors, timeout issues, and model limitations are handled gracefully.
- The system provides informative error messages when fallbacks are exhausted.
Conclusion
The integration of Ollama with OpenAI's Agent SDK creates a sophisticated hybrid architecture that combines the strengths of both local and cloud-based inference. This implementation provides:
- Enhanced privacy by keeping sensitive information local when appropriate
- Cost optimization by routing suitable queries to local infrastructure
- Robust fallbacks ensuring system resilience against failures
- Task-appropriate model selection based on sophisticated analysis
- Seamless integration with the agent framework and tools ecosystem
This architecture represents a significant advancement in responsible AI deployment, balancing the power of cloud-based models with the privacy and cost benefits of local inference. By intelligently routing requests based on their characteristics, the system provides optimal performance while respecting critical constraints around privacy, latency, and resource utilization.
Comprehensive Testing Strategy for OpenAI-Ollama Hybrid Agent System
Theoretical Framework for Validation Methodology
The integration of cloud-based and local inferencing capabilities within a unified agent architecture necessitates a multifaceted testing approach that encompasses both individual components and their systemic interactions. This document establishes a rigorous testing framework that addresses the unique challenges of validating a hybrid AI system across multiple dimensions of functionality, performance, and reliability.
Strategic Testing Layers
1. Unit Testing Framework
Core Component Isolation Testing
python1# tests/unit/test_provider_service.py2import pytest3import asyncio4from unittest.mock import AsyncMock, patch, MagicMock5import json67from app.services.provider_service import ProviderService, Provider8from app.services.ollama_service import OllamaService910class TestProviderService:11 @pytest.fixture12 def provider_service(self):13 """Create a provider service with mocked dependencies for testing."""14 service = ProviderService()15 service.openai_client = AsyncMock()16 service.ollama_service = AsyncMock(spec=OllamaService)17 return service1819 @pytest.mark.asyncio20 async def test_select_provider_and_model_explicit(self, provider_service):21 """Test explicit provider and model selection."""22 # Test explicit provider:model format23 provider, model = await provider_service._select_provider_and_model(24 messages=[{"role": "user", "content": "Hello"}],25 model="openai:gpt-4"26 )27 assert provider == Provider.OPENAI28 assert model == "gpt-4"2930 # Test explicit provider with default model31 provider, model = await provider_service._select_provider_and_model(32 messages=[{"role": "user", "content": "Hello"}],33 provider="ollama"34 )35 assert provider == Provider.OLLAMA36 assert model == provider_service.default_ollama_model3738 @pytest.mark.asyncio39 async def test_auto_routing_complex_content(self, provider_service):40 """Test auto-routing with complex content."""41 # Mock complexity assessment to return high complexity42 provider_service._assess_complexity = AsyncMock(return_value=0.8)43 provider_service.model_selection_criteria.complexity_threshold = 0.74445 provider = await provider_service._auto_route(46 messages=[{"role": "user", "content": "Complex technical question"}]47 )4849 assert provider == Provider.OPENAI50 provider_service._assess_complexity.assert_called_once()5152 @pytest.mark.asyncio53 async def test_auto_routing_privacy_sensitive(self, provider_service):54 """Test auto-routing with privacy sensitive content."""55 provider_service.model_selection_criteria.privacy_sensitive_tokens = ["password", "secret"]5657 provider = await provider_service._auto_route(58 messages=[{"role": "user", "content": "What is my password?"}]59 )6061 assert provider == Provider.OLLAMA6263 @pytest.mark.asyncio64 async def test_auto_routing_with_tools(self, provider_service):65 """Test auto-routing with tool requirements."""66 provider = await provider_service._auto_route(67 messages=[{"role": "user", "content": "Simple question"}],68 tools=[{"type": "function", "function": {"name": "get_weather"}}]69 )7071 assert provider == Provider.OPENAI7273 @pytest.mark.asyncio74 async def test_generate_completion_openai(self, provider_service):75 """Test generating completion with OpenAI."""76 # Setup mock response77 mock_response = MagicMock()78 mock_response.model_dump.return_value = {79 "id": "test-id",80 "object": "chat.completion",81 "model": "gpt-4",82 "usage": {"total_tokens": 10},83 "message": {"content": "Test response"}84 }85 provider_service.openai_client.chat.completions.create = AsyncMock(return_value=mock_response)8687 response = await provider_service._generate_openai_completion(88 messages=[{"role": "user", "content": "Hello"}],89 model="gpt-4"90 )9192 assert response["message"]["content"] == "Test response"93 provider_service.openai_client.chat.completions.create.assert_called_once()9495 @pytest.mark.asyncio96 async def test_generate_completion_ollama(self, provider_service):97 """Test generating completion with Ollama."""98 provider_service.ollama_service.generate_completion.return_value = {99 "id": "ollama-test",100 "model": "llama2",101 "provider": "ollama",102 "message": {"content": "Ollama response"}103 }104105 response = await provider_service._generate_ollama_completion(106 messages=[{"role": "user", "content": "Hello"}],107 model="llama2"108 )109110 assert response["message"]["content"] == "Ollama response"111 provider_service.ollama_service.generate_completion.assert_called_once()112113 @pytest.mark.asyncio114 async def test_fallback_mechanism(self, provider_service):115 """Test fallback mechanism when primary provider fails."""116 # Mock the primary provider (OpenAI) to fail117 provider_service._generate_openai_completion = AsyncMock(side_effect=Exception("API error"))118119 # Mock the fallback provider (Ollama) to succeed120 provider_service._generate_ollama_completion = AsyncMock(return_value={121 "id": "ollama-fallback",122 "provider": "ollama",123 "message": {"content": "Fallback response"}124 })125126 # Test the generate_completion method with auto provider127 response = await provider_service.generate_completion(128 messages=[{"role": "user", "content": "Hello"}],129 provider="auto"130 )131132 # Check that fallback was used133 assert response["provider"] == "ollama"134 assert response["message"]["content"] == "Fallback response"135 provider_service._generate_openai_completion.assert_called_once()136 provider_service._generate_ollama_completion.assert_called_once()
Model Selection Logic Testing
python1# tests/unit/test_model_selection.py2import pytest3from unittest.mock import AsyncMock, patch4import json56from app.models.model_catalog import recommend_ollama_model, OLLAMA_MODELS7from app.agents.adaptive_agent import AdaptiveAgent89class TestModelSelection:10 @pytest.mark.parametrize("use_case,performance_tier,expected_model", [11 ("code_generation", "high", "codellama:34b"),12 ("creative_writing", "medium", "dolphin-mistral"),13 ("mathematical_reasoning", "low", "orca-mini"),14 ("conversational", "high", "neural-chat"),15 ("knowledge_intensive", "high", "llama2:70b"),16 ("resource_constrained", "low", "phi"),17 ])18 def test_model_recommendations(self, use_case, performance_tier, expected_model):19 """Test model recommendation logic for different use cases."""20 model = recommend_ollama_model(use_case, performance_tier)21 assert model == expected_model2223 @pytest.mark.asyncio24 async def test_adaptive_agent_use_case_detection(self):25 """Test adaptive agent's use case detection logic."""26 provider_service = AsyncMock()27 agent = AdaptiveAgent(28 provider_service=provider_service,29 system_prompt="You are a helpful assistant."30 )3132 # Test code-related message33 code_use_case = await agent._determine_use_case(34 "Can you help me write a Python function to calculate Fibonacci numbers?"35 )36 assert code_use_case == "code_generation"3738 # Test creative writing message39 creative_use_case = await agent._determine_use_case(40 "Write a short story about a robot discovering emotions."41 )42 assert creative_use_case == "creative_writing"4344 # Test mathematical reasoning message45 math_use_case = await agent._determine_use_case(46 "Solve this equation: 3x² + 2x - 5 = 0"47 )48 assert math_use_case == "mathematical_reasoning"4950 @pytest.mark.asyncio51 async def test_complexity_assessment(self):52 """Test complexity assessment logic."""53 provider_service = AsyncMock()54 agent = AdaptiveAgent(55 provider_service=provider_service,56 system_prompt="You are a helpful assistant."57 )5859 # Simple message60 simple_message = "What time is it?"61 is_complex_simple = await agent._is_complex_request(simple_message)62 assert not is_complex_simple6364 # Complex message65 complex_message = "Can you provide a detailed analysis of the socioeconomic factors that contributed to the Industrial Revolution in England, and compare those with the conditions in contemporary developing economies?"66 is_complex_detailed = await agent._is_complex_request(complex_message)67 assert is_complex_detailed6869 # Multiple questions70 multi_question = "What is quantum computing? How does it differ from classical computing? What are its potential applications?"71 is_complex_multi = await agent._is_complex_request(multi_question)72 assert is_complex_multi
Ollama Service Testing
python1# tests/unit/test_ollama_service.py2import pytest3import json4import asyncio5from unittest.mock import AsyncMock, patch, MagicMock67from app.services.ollama_service import OllamaService89class TestOllamaService:10 @pytest.fixture11 def ollama_service(self):12 """Create an Ollama service with mocked session for testing."""13 service = OllamaService()14 service.session = AsyncMock()15 return service1617 @pytest.mark.asyncio18 async def test_list_models(self, ollama_service):19 """Test listing available models."""20 mock_response = AsyncMock()21 mock_response.status = 20022 mock_response.json = AsyncMock(return_value={"models": [23 {"name": "llama2"},24 {"name": "mistral"}25 ]})2627 # Mock the context manager28 ollama_service.session.get = AsyncMock()29 ollama_service.session.get.return_value.__aenter__.return_value = mock_response3031 models = await ollama_service.list_models()3233 assert len(models) == 234 assert models[0]["name"] == "llama2"35 assert models[1]["name"] == "mistral"3637 @pytest.mark.asyncio38 async def test_generate_completion(self, ollama_service):39 """Test generating a completion."""40 # Mock the response41 mock_response = AsyncMock()42 mock_response.status = 20043 mock_response.json = AsyncMock(return_value={44 "id": "test-id",45 "response": "This is a test response",46 "created_at": 167785824247 })4849 # Mock the context manager50 ollama_service.session.post = AsyncMock()51 ollama_service.session.post.return_value.__aenter__.return_value = mock_response5253 # Test the completion generation54 response = await ollama_service._generate_completion_sync({55 "model": "llama2",56 "prompt": "Hello, world!",57 "stream": False,58 "options": {"temperature": 0.7}59 })6061 # Check the formatted response62 assert "message" in response63 assert response["message"]["content"] == "This is a test response"64 assert response["provider"] == "ollama"6566 @pytest.mark.asyncio67 async def test_format_messages_for_ollama(self, ollama_service):68 """Test formatting messages for Ollama."""69 messages = [70 {"role": "system", "content": "You are a helpful assistant."},71 {"role": "user", "content": "Hello!"},72 {"role": "assistant", "content": "Hi there!"},73 {"role": "user", "content": "How are you?"}74 ]7576 formatted = ollama_service._format_messages_for_ollama(messages)7778 assert "[System]" in formatted79 assert "[User]" in formatted80 assert "[Assistant]" in formatted81 assert "You are a helpful assistant." in formatted82 assert "Hello!" in formatted83 assert "How are you?" in formatted8485 @pytest.mark.asyncio86 async def test_tool_call_extraction(self, ollama_service):87 """Test extracting tool calls from response text."""88 # Response with a tool call89 response_with_tool = """90 I'll help you get the weather information.9192 <tool>93 {94 "name": "get_weather",95 "parameters": {96 "location": "New York",97 "unit": "celsius"98 }99 }100 </tool>101102 Let me check the weather for you.103 """104105 tool_calls = ollama_service._extract_tool_calls(response_with_tool)106107 assert tool_calls is not None108 assert len(tool_calls) == 1109 assert tool_calls[0]["function"]["name"] == "get_weather"110 assert "New York" in tool_calls[0]["function"]["arguments"]111112 # Response without a tool call113 response_without_tool = "The weather in New York is sunny."114 assert ollama_service._extract_tool_calls(response_without_tool) is None115116 @pytest.mark.asyncio117 async def test_clean_tool_calls_from_text(self, ollama_service):118 """Test cleaning tool calls from response text."""119 response_with_tool = """120 I'll help you get the weather information.121122 <tool>123 {124 "name": "get_weather",125 "parameters": {126 "location": "New York",127 "unit": "celsius"128 }129 }130 </tool>131132 Let me check the weather for you.133 """134135 cleaned = ollama_service._clean_tool_calls_from_text(response_with_tool)136137 assert "<tool>" not in cleaned138 assert "get_weather" not in cleaned139 assert "I'll help you get the weather information." in cleaned140 assert "Let me check the weather for you." in cleaned
Tool Integration Testing
python1# tests/unit/test_tool_integration.py2import pytest3from unittest.mock import AsyncMock, patch4import json56from app.agents.task_agent import TaskManagementAgent7from app.models.message import Message, MessageRole89class TestToolIntegration:10 @pytest.fixture11 def task_agent(self):12 """Create a task agent with mocked services."""13 provider_service = AsyncMock()14 task_service = AsyncMock()1516 agent = TaskManagementAgent(17 provider_service=provider_service,18 task_service=task_service,19 system_prompt="You are a task management agent."20 )2122 return agent2324 @pytest.mark.asyncio25 async def test_process_tool_calls_list_tasks(self, task_agent):26 """Test processing the list_tasks tool call."""27 # Mock task service response28 task_agent.task_service.list_tasks.return_value = [29 {30 "id": "task1",31 "title": "Complete report",32 "status": "pending",33 "priority": "high",34 "due_date": "2023-04-15",35 "description": "Finish quarterly report"36 }37 ]3839 # Create a tool call for list_tasks40 tool_calls = [{41 "id": "call_123",42 "function": {43 "name": "list_tasks",44 "arguments": json.dumps({45 "status": "pending",46 "limit": 547 })48 }49 }]5051 # Process the tool calls52 tool_responses = await task_agent._process_tool_calls(tool_calls, "user123")5354 # Verify the response55 assert len(tool_responses) == 156 assert tool_responses[0]["tool_call_id"] == "call_123"57 assert "Complete report" in tool_responses[0]["content"]58 assert "pending" in tool_responses[0]["content"]5960 # Verify service was called correctly61 task_agent.task_service.list_tasks.assert_called_once_with(62 user_id="user123",63 status="pending",64 limit=565 )6667 @pytest.mark.asyncio68 async def test_process_tool_calls_create_task(self, task_agent):69 """Test processing the create_task tool call."""70 # Mock task service response71 task_agent.task_service.create_task.return_value = {72 "id": "new_task",73 "title": "New test task"74 }7576 # Create a tool call for create_task77 tool_calls = [{78 "id": "call_456",79 "function": {80 "name": "create_task",81 "arguments": json.dumps({82 "title": "New test task",83 "description": "This is a test task",84 "priority": "medium"85 })86 }87 }]8889 # Process the tool calls90 tool_responses = await task_agent._process_tool_calls(tool_calls, "user123")9192 # Verify the response93 assert len(tool_responses) == 194 assert tool_responses[0]["tool_call_id"] == "call_456"95 assert "Task created successfully" in tool_responses[0]["content"]96 assert "New test task" in tool_responses[0]["content"]9798 # Verify service was called correctly99 task_agent.task_service.create_task.assert_called_once_with(100 user_id="user123",101 title="New test task",102 description="This is a test task",103 due_date=None,104 priority="medium"105 )106107 @pytest.mark.asyncio108 async def test_generate_response_with_tools(self, task_agent):109 """Test the full generate_response flow with tool usage."""110 # Set up the conversation history111 task_agent.state.conversation_history = [112 Message(role=MessageRole.SYSTEM, content="You are a task management agent."),113 Message(role=MessageRole.USER, content="List my pending tasks")114 ]115116 # Mock provider service to return a response with tool calls first117 mock_response_with_tools = {118 "message": {119 "content": "I'll list your tasks",120 "tool_calls": [{121 "id": "call_123",122 "function": {123 "name": "list_tasks",124 "arguments": json.dumps({125 "status": "pending",126 "limit": 10127 })128 }129 }]130 },131 "tool_calls": [{132 "id": "call_123",133 "function": {134 "name": "list_tasks",135 "arguments": json.dumps({136 "status": "pending",137 "limit": 10138 })139 }140 }]141 }142143 # Mock task service144 task_agent.task_service.list_tasks.return_value = [145 {146 "id": "task1",147 "title": "Complete report",148 "status": "pending",149 "priority": "high",150 "due_date": "2023-04-15",151 "description": "Finish quarterly report"152 }153 ]154155 # Mock final response after tool processing156 mock_final_response = {157 "message": {158 "content": "You have 1 pending task: Complete report (high priority, due Apr 15)"159 }160 }161162 # Set up the mocked provider service163 task_agent.provider_service.generate_completion = AsyncMock()164 task_agent.provider_service.generate_completion.side_effect = [165 mock_response_with_tools, # First call returns tool calls166 mock_final_response # Second call returns final response167 ]168169 # Generate the response170 response = await task_agent._generate_response("user123")171172 # Verify the final response173 assert response == "You have 1 pending task: Complete report (high priority, due Apr 15)"174175 # Verify the provider service was called twice176 assert task_agent.provider_service.generate_completion.call_count == 2177178 # Verify the task service was called179 task_agent.task_service.list_tasks.assert_called_once()180181 # Verify tool response was added to conversation history182 tool_messages = [msg for msg in task_agent.state.conversation_history if msg.role == MessageRole.TOOL]183 assert len(tool_messages) == 1
2. Integration Testing Framework
API Endpoint Testing
python1# tests/integration/test_api_endpoints.py2import pytest3from fastapi.testclient import TestClient4import json5import os6from unittest.mock import patch, AsyncMock78from app.main import app9from app.services.provider_service import ProviderService1011client = TestClient(app)1213class TestAPIEndpoints:14 @pytest.fixture(autouse=True)15 def setup_mocks(self):16 """Set up mocks for services."""17 # Patch the provider service18 with patch('app.controllers.agent_controller.get_agent_factory') as mock_factory:19 mock_provider = AsyncMock(spec=ProviderService)20 mock_factory.return_value.provider_service = mock_provider21 yield2223 def test_health_endpoint(self):24 """Test the health check endpoint."""25 response = client.get("/api/health")26 assert response.status_code == 20027 assert response.json()["status"] == "ok"2829 def test_chat_endpoint_auth_required(self):30 """Test that chat endpoint requires authentication."""31 response = client.post(32 "/api/v1/chat",33 json={"message": "Hello"}34 )35 assert response.status_code == 401 # Unauthorized3637 def test_chat_endpoint_with_auth(self):38 """Test the chat endpoint with proper authentication."""39 # Mock the authentication40 with patch('app.services.auth_service.get_current_user') as mock_auth:41 mock_auth.return_value = {"id": "test_user"}4243 # Mock the agent's process_message44 with patch('app.agents.base_agent.BaseAgent.process_message') as mock_process:45 mock_process.return_value = "Hello, I'm an AI assistant."4647 response = client.post(48 "/api/v1/chat",49 json={"message": "Hi there"},50 headers={"Authorization": "Bearer test_token"}51 )5253 assert response.status_code == 20054 assert "response" in response.json()55 assert response.json()["response"] == "Hello, I'm an AI assistant."5657 def test_model_recommendation_endpoint(self):58 """Test the model recommendation endpoint."""59 # Mock the authentication60 with patch('app.services.auth_service.get_current_user') as mock_auth:61 mock_auth.return_value = {"id": "test_user"}6263 response = client.get(64 "/api/v1/agents/models/recommend?use_case=code_generation&performance_tier=high",65 headers={"Authorization": "Bearer test_token"}66 )6768 assert response.status_code == 20069 data = response.json()70 assert "ollama_recommendation" in data71 assert data["use_case"] == "code_generation"72 assert data["performance_tier"] == "high"7374 def test_streaming_endpoint(self):75 """Test the streaming endpoint."""76 # Mock the authentication77 with patch('app.services.auth_service.get_current_user') as mock_auth:78 mock_auth.return_value = {"id": "test_user"}7980 # Mock the streaming generator81 async def mock_stream_generator():82 yield {"id": "1", "content": "Hello"}83 yield {"id": "2", "content": " World"}8485 # Mock the stream method86 with patch('app.services.provider_service.ProviderService.stream_completion') as mock_stream:87 mock_stream.return_value = mock_stream_generator()8889 response = client.post(90 "/api/v1/chat/streaming",91 json={"message": "Hi", "stream": True},92 headers={"Authorization": "Bearer test_token"}93 )9495 assert response.status_code == 20096 assert response.headers["content-type"] == "text/event-stream"9798 # Parse the streaming response99 content = response.content.decode()100 assert "data:" in content101 assert "Hello" in content102 assert "World" in content
End-to-End Agent Flow Testing
python1# tests/integration/test_agent_flows.py2import pytest3import asyncio4from unittest.mock import AsyncMock, patch5import json67from app.agents.meta_agent import MetaAgent, AgentSubsystem8from app.agents.research_agent import ResearchAgent9from app.agents.conversation_manager import ConversationManager10from app.models.message import Message, MessageRole1112class TestAgentFlows:13 @pytest.fixture14 async def meta_agent_setup(self):15 """Set up a meta agent with subsystems for testing."""16 # Create mocked services17 provider_service = AsyncMock()18 knowledge_service = AsyncMock()19 memory_service = AsyncMock()2021 # Create subsystem agents22 research_agent = ResearchAgent(23 provider_service=provider_service,24 knowledge_service=knowledge_service,25 system_prompt="You are a research agent."26 )2728 conversation_agent = ConversationManager(29 provider_service=provider_service,30 system_prompt="You are a conversation management agent."31 )3233 # Create meta agent34 meta_agent = MetaAgent(35 provider_service=provider_service,36 system_prompt="You are a meta agent that coordinates specialized agents."37 )3839 # Add subsystems40 meta_agent.add_subsystem(AgentSubsystem(41 name="research",42 agent=research_agent,43 role="Knowledge retrieval specialist"44 ))4546 meta_agent.add_subsystem(AgentSubsystem(47 name="conversation",48 agent=conversation_agent,49 role="Conversation flow manager"50 ))5152 # Return the setup53 return {54 "meta_agent": meta_agent,55 "provider_service": provider_service,56 "knowledge_service": knowledge_service,57 "research_agent": research_agent,58 "conversation_agent": conversation_agent59 }6061 @pytest.mark.asyncio62 async def test_meta_agent_routing(self, meta_agent_setup):63 """Test the meta agent's routing logic."""64 meta_agent = meta_agent_setup["meta_agent"]65 provider_service = meta_agent_setup["provider_service"]6667 # Setup conversation history68 meta_agent.state.conversation_history = [69 Message(role=MessageRole.SYSTEM, content="You are a meta agent."),70 Message(role=MessageRole.USER, content="Tell me about quantum computing")71 ]7273 # Mock the routing response to use research subsystem74 routing_response = {75 "message": {76 "content": "I'll route this to the research subsystem"77 },78 "tool_calls": [{79 "id": "call_123",80 "function": {81 "name": "route_to_subsystem",82 "arguments": json.dumps({83 "subsystem": "research",84 "task": "Tell me about quantum computing",85 "context": {}86 })87 }88 }]89 }9091 # Mock the research agent's response92 research_response = "Quantum computing is a type of computing that uses quantum-mechanical phenomena, such as superposition and entanglement, to perform operations on data."93 meta_agent_setup["research_agent"].process_message = AsyncMock(return_value=research_response)9495 # Mock the provider service responses96 provider_service.generate_completion.side_effect = [97 routing_response, # First call for routing decision98 ]99100 # Generate response101 response = await meta_agent._generate_response("user123")102103 # Verify routing happened correctly104 assert "[research" in response105 assert "Quantum computing" in response106107 # Verify the research agent was called108 meta_agent_setup["research_agent"].process_message.assert_called_once_with(109 "Tell me about quantum computing", "user123"110 )111112 @pytest.mark.asyncio113 async def test_meta_agent_parallel_processing(self, meta_agent_setup):114 """Test the meta agent's parallel processing logic."""115 meta_agent = meta_agent_setup["meta_agent"]116 provider_service = meta_agent_setup["provider_service"]117118 # Setup conversation history119 meta_agent.state.conversation_history = [120 Message(role=MessageRole.SYSTEM, content="You are a meta agent."),121 Message(role=MessageRole.USER, content="Explain the impacts of AI on society")122 ]123124 # Mock the routing response to use parallel processing125 routing_response = {126 "message": {127 "content": "I'll process this with multiple subsystems"128 },129 "tool_calls": [{130 "id": "call_456",131 "function": {132 "name": "parallel_processing",133 "arguments": json.dumps({134 "task": "Explain the impacts of AI on society",135 "subsystems": ["research", "conversation"]136 })137 }138 }]139 }140141 # Mock each agent's response142 research_response = "From a research perspective, AI impacts society through automation, economic transformation, and ethical considerations."143 conversation_response = "From a conversational perspective, AI is changing how we interact with technology and each other."144145 meta_agent_setup["research_agent"].process_message = AsyncMock(return_value=research_response)146 meta_agent_setup["conversation_agent"].process_message = AsyncMock(return_value=conversation_response)147148 # Mock synthesis response149 synthesis_response = {150 "message": {151 "content": "AI has multifaceted impacts on society. From a research perspective, it drives automation and economic transformation. From a conversational perspective, it changes human-technology interaction patterns."152 }153 }154155 # Mock the provider service responses156 provider_service.generate_completion.side_effect = [157 routing_response, # First call for routing decision158 synthesis_response # Second call for synthesis159 ]160161 # Generate response162 response = await meta_agent._generate_response("user123")163164 # Verify synthesis happened correctly165 assert "multifaceted impacts" in response166 assert provider_service.generate_completion.call_count == 2167168 # Verify both agents were called169 meta_agent_setup["research_agent"].process_message.assert_called_once()170 meta_agent_setup["conversation_agent"].process_message.assert_called_once()171172 @pytest.mark.asyncio173 async def test_research_agent_knowledge_retrieval(self, meta_agent_setup):174 """Test the research agent's knowledge retrieval capabilities."""175 research_agent = meta_agent_setup["research_agent"]176 provider_service = meta_agent_setup["provider_service"]177 knowledge_service = meta_agent_setup["knowledge_service"]178179 # Setup conversation history180 research_agent.state.conversation_history = [181 Message(role=MessageRole.SYSTEM, content="You are a research agent."),182 Message(role=MessageRole.USER, content="What are the latest developments in fusion energy?")183 ]184185 # Mock knowledge retrieval results186 knowledge_service.search.return_value = [187 {188 "id": "doc1",189 "title": "Recent Fusion Breakthrough",190 "content": "Scientists achieved net energy gain in fusion reaction at NIF in December 2022.",191 "relevance_score": 0.95192 },193 {194 "id": "doc2",195 "title": "Commercial Fusion Startups",196 "content": "Several startups including Commonwealth Fusion Systems are working on commercial fusion reactors.",197 "relevance_score": 0.89198 }199 ]200201 # Mock initial response with tool calls202 tool_call_response = {203 "message": {204 "content": "Let me search for information on fusion energy."205 },206 "tool_calls": [{207 "id": "call_789",208 "function": {209 "name": "search_knowledge_base",210 "arguments": json.dumps({211 "query": "latest developments fusion energy",212 "max_results": 3213 })214 }215 }]216 }217218 # Mock final response with knowledge incorporated219 final_response = {220 "message": {221 "content": "Recent developments in fusion energy include a breakthrough at NIF in December 2022 achieving net energy gain, and advances from startups like Commonwealth Fusion Systems working on commercial reactors."222 }223 }224225 # Mock the provider service responses226 provider_service.generate_completion.side_effect = [227 tool_call_response, # First call with tool request228 final_response # Second call with knowledge incorporated229 ]230231 # Generate response232 response = await research_agent._generate_response("user123")233234 # Verify response includes knowledge235 assert "NIF" in response236 assert "Commonwealth Fusion Systems" in response237238 # Verify knowledge service was called239 knowledge_service.search.assert_called_once_with(240 query="latest developments fusion energy",241 max_results=3242 )
Cross-Provider Integration Testing
python1# tests/integration/test_cross_provider.py2import pytest3import os4from unittest.mock import patch, AsyncMock5import json67from app.services.provider_service import ProviderService, Provider8from app.services.ollama_service import OllamaService910class TestCrossProviderIntegration:11 @pytest.fixture12 async def real_services(self):13 """Set up real services for integration testing."""14 # Skip tests if API keys aren't available in the environment15 if not os.environ.get("OPENAI_API_KEY"):16 pytest.skip("OPENAI_API_KEY environment variable not set")1718 # Initialize real services19 ollama_service = OllamaService()20 provider_service = ProviderService()2122 # Initialize the services23 try:24 await ollama_service.initialize()25 await provider_service.initialize()26 except Exception as e:27 pytest.skip(f"Failed to initialize services: {str(e)}")2829 yield {30 "ollama_service": ollama_service,31 "provider_service": provider_service32 }3334 # Cleanup35 await ollama_service.cleanup()36 await provider_service.cleanup()3738 @pytest.mark.asyncio39 async def test_provider_selection_complex_query(self, real_services):40 """Test that complex queries route to OpenAI."""41 provider_service = real_services["provider_service"]4243 # Adjust complexity threshold to ensure predictable routing44 provider_service.model_selection_criteria.complexity_threshold = 0.54546 # Complex query that should route to OpenAI47 complex_messages = [48 {"role": "user", "content": "Provide a detailed analysis of the philosophical implications of artificial general intelligence, considering perspectives from epistemology, ethics, and metaphysics."}49 ]5051 # Select provider52 provider, model = await provider_service._select_provider_and_model(53 messages=complex_messages,54 provider="auto"55 )5657 # Verify routing decision58 assert provider == Provider.OPENAI5960 @pytest.mark.asyncio61 async def test_provider_selection_simple_query(self, real_services):62 """Test that simple queries route to Ollama."""63 provider_service = real_services["provider_service"]6465 # Adjust complexity threshold to ensure predictable routing66 provider_service.model_selection_criteria.complexity_threshold = 0.56768 # Simple query that should route to Ollama69 simple_messages = [70 {"role": "user", "content": "What's the weather like today?"}71 ]7273 # Select provider74 provider, model = await provider_service._select_provider_and_model(75 messages=simple_messages,76 provider="auto"77 )7879 # Verify routing decision80 assert provider == Provider.OLLAMA8182 @pytest.mark.asyncio83 async def test_fallback_mechanism_real(self, real_services):84 """Test the fallback mechanism with real services."""85 provider_service = real_services["provider_service"]8687 # Intentionally cause OpenAI to fail by using an invalid model88 messages = [89 {"role": "user", "content": "Simple test message"}90 ]9192 try:93 # This should fail with OpenAI but succeed with Ollama fallback94 response = await provider_service.generate_completion(95 messages=messages,96 model="openai:non-existent-model", # Invalid model97 provider="auto" # Enable auto-fallback98 )99100 # If we get here, fallback worked101 assert response["provider"] == "ollama"102 assert "content" in response["message"]103 except Exception as e:104 pytest.fail(f"Fallback mechanism failed: {str(e)}")105106 @pytest.mark.asyncio107 async def test_ollama_response_format(self, real_services):108 """Test that Ollama responses are properly formatted to match OpenAI's structure."""109 ollama_service = real_services["ollama_service"]110111 # Generate a basic response112 messages = [113 {"role": "user", "content": "What is 2+2?"}114 ]115116 response = await ollama_service.generate_completion(117 messages=messages,118 model="llama2" # Specify a model that should exist119 )120121 # Verify response structure matches expected format122 assert "id" in response123 assert "object" in response124 assert "model" in response125 assert "usage" in response126 assert "message" in response127 assert "content" in response["message"]128 assert response["provider"] == "ollama"
3. Performance Testing Framework
Response Latency Benchmarking
python1# tests/performance/test_latency.py2import pytest3import time4import asyncio5import statistics6from typing import List, Dict, Any7import pandas as pd8import matplotlib.pyplot as plt9import os1011from app.services.provider_service import ProviderService, Provider12from app.services.ollama_service import OllamaService1314# Skip tests if it's CI environment15SKIP_PERFORMANCE_TESTS = os.environ.get("CI") == "true"1617@pytest.mark.skipif(SKIP_PERFORMANCE_TESTS, reason="Performance tests skipped in CI environment")18class TestResponseLatency:19 @pytest.fixture20 async def services(self):21 """Set up services for latency testing."""22 if not os.environ.get("OPENAI_API_KEY"):23 pytest.skip("OPENAI_API_KEY environment variable not set")2425 # Initialize services26 ollama_service = OllamaService()27 provider_service = ProviderService()2829 try:30 await ollama_service.initialize()31 await provider_service.initialize()32 except Exception as e:33 pytest.skip(f"Failed to initialize services: {str(e)}")3435 yield {36 "ollama_service": ollama_service,37 "provider_service": provider_service38 }3940 # Cleanup41 await ollama_service.cleanup()42 await provider_service.cleanup()4344 async def measure_latency(self, provider_service, provider, model, messages):45 """Measure response latency for a given provider and model."""46 start_time = time.time()4748 if provider == "openai":49 await provider_service._generate_openai_completion(50 messages=messages,51 model=model52 )53 else: # ollama54 await provider_service._generate_ollama_completion(55 messages=messages,56 model=model57 )5859 end_time = time.time()60 return end_time - start_time6162 @pytest.mark.asyncio63 async def test_latency_comparison(self, services):64 """Compare latency between OpenAI and Ollama for different query types."""65 provider_service = services["provider_service"]6667 # Test messages of different complexity68 test_messages = [69 {70 "name": "simple_factual",71 "messages": [{"role": "user", "content": "What is the capital of France?"}]72 },73 {74 "name": "medium_explanation",75 "messages": [{"role": "user", "content": "Explain how photosynthesis works in plants."}]76 },77 {78 "name": "complex_analysis",79 "messages": [{"role": "user", "content": "Analyze the economic factors that contributed to the 2008 financial crisis and their long-term impacts."}]80 }81 ]8283 # Models to test84 models = {85 "openai": ["gpt-3.5-turbo", "gpt-4"],86 "ollama": ["llama2", "mistral"]87 }8889 # Number of repetitions for each test90 repetitions = 39192 # Collect results93 results = []9495 for message_type in test_messages:96 for provider in models:97 for model in models[provider]:98 for i in range(repetitions):99 try:100 latency = await self.measure_latency(101 provider_service,102 provider,103 model,104 message_type["messages"]105 )106107 results.append({108 "provider": provider,109 "model": model,110 "message_type": message_type["name"],111 "repetition": i,112 "latency": latency113 })114115 # Add a small delay to avoid rate limits116 await asyncio.sleep(1)117 except Exception as e:118 print(f"Error testing {provider}:{model} - {str(e)}")119120 # Analyze results121 df = pd.DataFrame(results)122123 # Calculate average latency by provider, model, and message type124 avg_latency = df.groupby(['provider', 'model', 'message_type'])['latency'].mean().reset_index()125126 # Generate summary statistics127 summary = avg_latency.pivot_table(128 index=['provider', 'model'],129 columns='message_type',130 values='latency'131 ).reset_index()132133 # Print summary134 print("\nLatency Benchmark Results (seconds):")135 print(summary)136137 # Create visualization138 plt.figure(figsize=(12, 8))139140 for message_type in test_messages:141 subset = avg_latency[avg_latency['message_type'] == message_type['name']]142 x = range(len(subset))143 labels = [f"{row['provider']}\n{row['model']}" for _, row in subset.iterrows()]144145 plt.subplot(1, len(test_messages), test_messages.index(message_type) + 1)146 plt.bar(x, subset['latency'])147 plt.xticks(x, labels, rotation=45)148 plt.title(f"Latency: {message_type['name']}")149 plt.ylabel("Seconds")150151 plt.tight_layout()152 plt.savefig('latency_benchmark.png')153154 # Assert something meaningful155 assert len(results) > 0, "No benchmark results collected"
Memory Usage Monitoring
python1# tests/performance/test_memory_usage.py2import pytest3import os4import asyncio5import psutil6import time7import resource8import matplotlib.pyplot as plt9import pandas as pd10from typing import List, Dict, Any1112from app.services.provider_service import ProviderService, Provider13from app.services.ollama_service import OllamaService1415# Skip tests if it's CI environment16SKIP_PERFORMANCE_TESTS = os.environ.get("CI") == "true"1718@pytest.mark.skipif(SKIP_PERFORMANCE_TESTS, reason="Performance tests skipped in CI environment")19class TestMemoryUsage:20 @pytest.fixture21 async def services(self):22 """Set up services for memory testing."""23 if not os.environ.get("OPENAI_API_KEY"):24 pytest.skip("OPENAI_API_KEY environment variable not set")2526 # Initialize services27 ollama_service = OllamaService()28 provider_service = ProviderService()2930 try:31 await ollama_service.initialize()32 await provider_service.initialize()33 except Exception as e:34 pytest.skip(f"Failed to initialize services: {str(e)}")3536 yield {37 "ollama_service": ollama_service,38 "provider_service": provider_service39 }4041 # Cleanup42 await ollama_service.cleanup()43 await provider_service.cleanup()4445 def get_memory_usage(self):46 """Get current memory usage of the process."""47 process = psutil.Process(os.getpid())48 memory_info = process.memory_info()49 return memory_info.rss / (1024 * 1024) # Convert to MB5051 async def monitor_memory_during_request(self, provider_service, provider, model, messages):52 """Monitor memory usage during a request."""53 memory_samples = []5455 # Start memory monitoring thread56 monitoring = True5758 async def memory_monitor():59 start_time = time.time()60 while monitoring:61 memory_samples.append({62 "time": time.time() - start_time,63 "memory_mb": self.get_memory_usage()64 })65 await asyncio.sleep(0.1) # Sample every 100ms6667 # Start monitoring68 monitor_task = asyncio.create_task(memory_monitor())6970 # Make the request71 start_time = time.time()72 try:73 if provider == "openai":74 await provider_service._generate_openai_completion(75 messages=messages,76 model=model77 )78 else: # ollama79 await provider_service._generate_ollama_completion(80 messages=messages,81 model=model82 )83 finally:84 end_time = time.time()8586 # Stop monitoring87 monitoring = False88 await monitor_task8990 return {91 "samples": memory_samples,92 "duration": end_time - start_time,93 "peak_memory": max(sample["memory_mb"] for sample in memory_samples) if memory_samples else 0,94 "mean_memory": sum(sample["memory_mb"] for sample in memory_samples) / len(memory_samples) if memory_samples else 095 }9697 @pytest.mark.asyncio98 async def test_memory_usage_comparison(self, services):99 """Compare memory usage between OpenAI and Ollama."""100 provider_service = services["provider_service"]101102 # Test messages103 test_message = {"role": "user", "content": "Write a detailed essay about climate change and its global impact."}104105 # Models to test106 models = {107 "openai": ["gpt-3.5-turbo"],108 "ollama": ["llama2"]109 }110111 # Collect results112 results = []113 memory_data = {}114115 for provider in models:116 for model in models[provider]:117 # Collect initial memory118 initial_memory = self.get_memory_usage()119120 # Monitor during request121 memory_result = await self.monitor_memory_during_request(122 provider_service,123 provider,124 model,125 [test_message]126 )127128 # Store results129 key = f"{provider}:{model}"130 memory_data[key] = memory_result["samples"]131132 results.append({133 "provider": provider,134 "model": model,135 "initial_memory_mb": initial_memory,136 "peak_memory_mb": memory_result["peak_memory"],137 "mean_memory_mb": memory_result["mean_memory"],138 "memory_increase_mb": memory_result["peak_memory"] - initial_memory,139 "duration_seconds": memory_result["duration"]140 })141142 # Wait a bit to let memory stabilize143 await asyncio.sleep(2)144145 # Analyze results146 df = pd.DataFrame(results)147148 # Print summary149 print("\nMemory Usage Results:")150 print(df.to_string(index=False))151152 # Create visualization153 plt.figure(figsize=(15, 10))154155 # Plot memory over time156 plt.subplot(2, 1, 1)157 for key, samples in memory_data.items():158 times = [s["time"] for s in samples]159 memory = [s["memory_mb"] for s in samples]160 plt.plot(times, memory, label=key)161162 plt.xlabel("Time (seconds)")163 plt.ylabel("Memory Usage (MB)")164 plt.title("Memory Usage Over Time During Request")165 plt.legend()166 plt.grid(True)167168 # Plot peak and increase169 plt.subplot(2, 1, 2)170 providers = df["provider"].tolist()171 models = df["model"].tolist()172 labels = [f"{p}\n{m}" for p, m in zip(providers, models)]173 x = range(len(labels))174175 plt.bar(x, df["memory_increase_mb"], label="Memory Increase")176 plt.xticks(x, labels)177 plt.ylabel("Memory (MB)")178 plt.title("Memory Increase by Provider/Model")179 plt.legend()180 plt.grid(True)181182 plt.tight_layout()183 plt.savefig('memory_benchmark.png')184185 # Assert something meaningful186 assert len(results) > 0, "No memory benchmark results collected"
Response Quality Benchmarking
python1# tests/performance/test_response_quality.py2import pytest3import os4import asyncio5import json6import pandas as pd7import matplotlib.pyplot as plt8from typing import List, Dict, Any910from app.services.provider_service import ProviderService, Provider11from app.services.ollama_service import OllamaService1213# Skip tests if it's CI environment14SKIP_PERFORMANCE_TESTS = os.environ.get("CI") == "true"1516@pytest.mark.skipif(SKIP_PERFORMANCE_TESTS, reason="Performance tests skipped in CI environment")17class TestResponseQuality:18 @pytest.fixture19 async def services(self):20 """Set up services for quality testing."""21 if not os.environ.get("OPENAI_API_KEY"):22 pytest.skip("OPENAI_API_KEY environment variable not set")2324 # Initialize services25 ollama_service = OllamaService()26 provider_service = ProviderService()2728 try:29 await ollama_service.initialize()30 await provider_service.initialize()31 except Exception as e:32 pytest.skip(f"Failed to initialize services: {str(e)}")3334 yield {35 "ollama_service": ollama_service,36 "provider_service": provider_service37 }3839 # Cleanup40 await ollama_service.cleanup()41 await provider_service.cleanup()4243 async def get_response(self, provider_service, provider, model, messages):44 """Get a response from a specific provider and model."""45 if provider == "openai":46 response = await provider_service._generate_openai_completion(47 messages=messages,48 model=model49 )50 else: # ollama51 response = await provider_service._generate_ollama_completion(52 messages=messages,53 model=model54 )5556 return response["message"]["content"]5758 async def evaluate_response(self, provider_service, response, criteria):59 """Evaluate a response using GPT-4 as a judge."""60 evaluation_prompt = [61 {"role": "system", "content": """62 You are an expert evaluator of AI responses. Evaluate the given response based on the specified criteria.63 For each criterion, provide a score from 1-10 and a brief explanation.64 Format your response as valid JSON with the following structure:65 {66 "criteria": {67 "accuracy": {"score": X, "explanation": "..."},68 "completeness": {"score": X, "explanation": "..."},69 "coherence": {"score": X, "explanation": "..."},70 "relevance": {"score": X, "explanation": "..."}71 },72 "overall_score": X,73 "summary": "..."74 }75 """},76 {"role": "user", "content": f"""77 Evaluate this AI response based on {', '.join(criteria)}:7879 RESPONSE TO EVALUATE:80 {response}81 """}82 ]8384 # Use GPT-4 to evaluate85 evaluation = await provider_service._generate_openai_completion(86 messages=evaluation_prompt,87 model="gpt-4",88 response_format={"type": "json_object"}89 )9091 try:92 return json.loads(evaluation["message"]["content"])93 except:94 # Fallback if parsing fails95 return {96 "criteria": {c: {"score": 0, "explanation": "Failed to parse"} for c in criteria},97 "overall_score": 0,98 "summary": "Failed to parse evaluation"99 }100101 @pytest.mark.asyncio102 async def test_response_quality_comparison(self, services):103 """Compare response quality between OpenAI and Ollama models."""104 provider_service = services["provider_service"]105106 # Test scenarios107 test_scenarios = [108 {109 "name": "factual_knowledge",110 "query": "Explain the process of photosynthesis and its importance to life on Earth."111 },112 {113 "name": "reasoning",114 "query": "A bat and ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?"115 },116 {117 "name": "creative_writing",118 "query": "Write a short story about a robot discovering emotions."119 },120 {121 "name": "code_generation",122 "query": "Write a Python function to check if a string is a palindrome."123 }124 ]125126 # Models to test127 models = {128 "openai": ["gpt-3.5-turbo"],129 "ollama": ["llama2", "mistral"]130 }131132 # Evaluation criteria133 criteria = ["accuracy", "completeness", "coherence", "relevance"]134135 # Collect results136 results = []137138 for scenario in test_scenarios:139 for provider in models:140 for model in models[provider]:141 try:142 # Get response143 response = await self.get_response(144 provider_service,145 provider,146 model,147 [{"role": "user", "content": scenario["query"]}]148 )149150 # Evaluate response151 evaluation = await self.evaluate_response(152 provider_service,153 response,154 criteria155 )156157 # Store results158 results.append({159 "scenario": scenario["name"],160 "provider": provider,161 "model": model,162 "overall_score": evaluation["overall_score"],163 **{f"{criterion}_score": evaluation["criteria"][criterion]["score"]164 for criterion in criteria}165 })166167 # Add raw responses for detailed analysis168 with open(f"response_{provider}_{model}_{scenario['name']}.txt", "w") as f:169 f.write(response)170171 # Add a delay to avoid rate limits172 await asyncio.sleep(2)173 except Exception as e:174 print(f"Error evaluating {provider}:{model} on {scenario['name']}: {str(e)}")175176 # Analyze results177 df = pd.DataFrame(results)178179 # Save results180 df.to_csv("quality_benchmark_results.csv", index=False)181182 # Print summary183 print("\nResponse Quality Results:")184 summary = df.groupby(['provider', 'model']).mean().reset_index()185 print(summary.to_string(index=False))186187 # Create visualization188 plt.figure(figsize=(15, 10))189190 # Plot overall scores by scenario191 plt.subplot(2, 1, 1)192 for i, scenario in enumerate(test_scenarios):193 scenario_df = df[df['scenario'] == scenario['name']]194 providers = scenario_df["provider"].tolist()195 models = scenario_df["model"].tolist()196 labels = [f"{p}\n{m}" for p, m in zip(providers, models)]197198 plt.subplot(2, 2, i+1)199 plt.bar(labels, scenario_df["overall_score"])200 plt.title(f"Quality Scores: {scenario['name']}")201 plt.ylabel("Score (1-10)")202 plt.ylim(0, 10)203 plt.xticks(rotation=45)204205 plt.tight_layout()206 plt.savefig('quality_benchmark.png')207208 # Assert something meaningful209 assert len(results) > 0, "No quality benchmark results collected"
4. Reliability Testing Framework
Error Handling and Fallback Testing
python1# tests/reliability/test_error_handling.py2import pytest3import asyncio4from unittest.mock import AsyncMock, patch, MagicMock5import aiohttp67from app.services.provider_service import ProviderService, Provider8from app.services.ollama_service import OllamaService910class TestErrorHandling:11 @pytest.fixture12 def provider_service(self):13 """Create a provider service with mocked dependencies for testing."""14 service = ProviderService()15 service.openai_client = AsyncMock()16 service.ollama_service = AsyncMock(spec=OllamaService)17 return service1819 @pytest.mark.asyncio20 async def test_openai_connection_error(self, provider_service):21 """Test handling of OpenAI connection errors."""22 # Mock OpenAI to raise a connection error23 provider_service._generate_openai_completion = AsyncMock(24 side_effect=aiohttp.ClientConnectionError("Connection refused")25 )2627 # Mock Ollama to succeed28 provider_service._generate_ollama_completion = AsyncMock(return_value={29 "id": "ollama-fallback",30 "provider": "ollama",31 "message": {"content": "Fallback response"}32 })3334 # Test with auto routing35 response = await provider_service.generate_completion(36 messages=[{"role": "user", "content": "Test message"}],37 provider="auto"38 )3940 # Verify fallback worked41 assert response["provider"] == "ollama"42 assert response["message"]["content"] == "Fallback response"43 provider_service._generate_openai_completion.assert_called_once()44 provider_service._generate_ollama_completion.assert_called_once()4546 @pytest.mark.asyncio47 async def test_ollama_connection_error(self, provider_service):48 """Test handling of Ollama connection errors."""49 # Mock the auto routing to select Ollama first50 provider_service._auto_route = AsyncMock(return_value=Provider.OLLAMA)5152 # Mock Ollama to fail53 provider_service._generate_ollama_completion = AsyncMock(54 side_effect=aiohttp.ClientConnectionError("Connection refused")55 )5657 # Mock OpenAI to succeed58 provider_service._generate_openai_completion = AsyncMock(return_value={59 "id": "openai-fallback",60 "provider": "openai",61 "message": {"content": "Fallback response"}62 })6364 # Test with auto routing65 response = await provider_service.generate_completion(66 messages=[{"role": "user", "content": "Test message"}],67 provider="auto"68 )6970 # Verify fallback worked71 assert response["provider"] == "openai"72 assert response["message"]["content"] == "Fallback response"73 provider_service._generate_ollama_completion.assert_called_once()74 provider_service._generate_openai_completion.assert_called_once()7576 @pytest.mark.asyncio77 async def test_rate_limit_handling(self, provider_service):78 """Test handling of rate limit errors."""79 # Mock OpenAI to raise a rate limit error80 rate_limit_error = MagicMock()81 rate_limit_error.status_code = 42982 rate_limit_error.json.return_value = {"error": {"message": "Rate limit exceeded"}}8384 provider_service._generate_openai_completion = AsyncMock(85 side_effect=openai.RateLimitError("Rate limit exceeded", response=rate_limit_error)86 )8788 # Mock Ollama to succeed89 provider_service._generate_ollama_completion = AsyncMock(return_value={90 "id": "ollama-fallback",91 "provider": "ollama",92 "message": {"content": "Fallback response"}93 })9495 # Test with auto routing96 response = await provider_service.generate_completion(97 messages=[{"role": "user", "content": "Test message"}],98 provider="auto"99 )100101 # Verify fallback worked102 assert response["provider"] == "ollama"103 assert response["message"]["content"] == "Fallback response"104105 @pytest.mark.asyncio106 async def test_timeout_handling(self, provider_service):107 """Test handling of timeout errors."""108 # Mock OpenAI to raise a timeout error109 provider_service._generate_openai_completion = AsyncMock(110 side_effect=asyncio.TimeoutError("Request timed out")111 )112113 # Mock Ollama to succeed114 provider_service._generate_ollama_completion = AsyncMock(return_value={115 "id": "ollama-fallback",116 "provider": "ollama",117 "message": {"content": "Fallback response"}118 })119120 # Test with auto routing121 response = await provider_service.generate_completion(122 messages=[{"role": "user", "content": "Test message"}],123 provider="auto"124 )125126 # Verify fallback worked127 assert response["provider"] == "ollama"128 assert response["message"]["content"] == "Fallback response"129130 @pytest.mark.asyncio131 async def test_all_providers_fail(self, provider_service):132 """Test case when all providers fail."""133 # Mock both providers to fail134 provider_service._generate_openai_completion = AsyncMock(135 side_effect=Exception("OpenAI failed")136 )137138 provider_service._generate_ollama_completion = AsyncMock(139 side_effect=Exception("Ollama failed")140 )141142 # Test with auto routing - should raise an exception143 with pytest.raises(Exception) as excinfo:144 await provider_service.generate_completion(145 messages=[{"role": "user", "content": "Test message"}],146 provider="auto"147 )148149 # Verify the original exception is re-raised150 assert "OpenAI failed" in str(excinfo.value)151 provider_service._generate_openai_completion.assert_called_once()152 provider_service._generate_ollama_completion.assert_called_once()
Load Testing
python1# tests/reliability/test_load.py2import pytest3import asyncio4import time5import os6import pandas as pd7import matplotlib.pyplot as plt8from aiohttp import ClientSession, TCPConnector910from app.services.provider_service import ProviderService, Provider1112# Skip tests if it's CI environment13SKIP_LOAD_TESTS = os.environ.get("CI") == "true"1415@pytest.mark.skipif(SKIP_LOAD_TESTS, reason="Load tests skipped in CI environment")16class TestLoadHandling:17 @pytest.fixture18 async def provider_service(self):19 """Set up provider service for load testing."""20 if not os.environ.get("OPENAI_API_KEY"):21 pytest.skip("OPENAI_API_KEY environment variable not set")2223 # Initialize service24 service = ProviderService()2526 try:27 await service.initialize()28 except Exception as e:29 pytest.skip(f"Failed to initialize service: {str(e)}")3031 yield service3233 # Cleanup34 await service.cleanup()3536 async def send_request(self, provider_service, provider, model, message, request_id):37 """Send a single request and record performance."""38 start_time = time.time()39 success = False40 error = None4142 try:43 response = await provider_service.generate_completion(44 messages=[{"role": "user", "content": message}],45 provider=provider,46 model=model47 )48 success = True49 except Exception as e:50 error = str(e)5152 end_time = time.time()5354 return {55 "request_id": request_id,56 "provider": provider,57 "model": model,58 "success": success,59 "error": error,60 "duration": end_time - start_time61 }6263 @pytest.mark.asyncio64 async def test_concurrent_requests(self, provider_service):65 """Test handling of multiple concurrent requests."""66 # Test configurations67 providers = ["openai", "ollama", "auto"]68 request_count = 10 # 10 requests per provider6970 # Test message (simple to avoid rate limits)71 message = "What is 2+2?"7273 # Create tasks for all requests74 tasks = []75 request_id = 07677 for provider in providers:78 for _ in range(request_count):79 # Determine model based on provider80 if provider == "openai":81 model = "gpt-3.5-turbo"82 elif provider == "ollama":83 model = "llama2"84 else:85 model = None # Auto select8687 tasks.append(self.send_request(88 provider_service,89 provider,90 model,91 message,92 request_id93 ))94 request_id += 19596 # Small delay to avoid immediate rate limiting97 await asyncio.sleep(0.1)9899 # Run requests concurrently with a reasonable concurrency limit100 concurrency_limit = 5101 results = []102103 for i in range(0, len(tasks), concurrency_limit):104 batch = tasks[i:i+concurrency_limit]105 batch_results = await asyncio.gather(*batch)106 results.extend(batch_results)107108 # Delay between batches to avoid rate limits109 await asyncio.sleep(2)110111 # Analyze results112 df = pd.DataFrame(results)113114 # Print summary115 print("\nConcurrent Request Test Results:")116 success_rate = df.groupby('provider')['success'].mean() * 100117 mean_duration = df.groupby('provider')['duration'].mean()118119 summary = pd.DataFrame({120 'success_rate': success_rate,121 'mean_duration': mean_duration122 }).reset_index()123124 print(summary.to_string(index=False))125126 # Create visualization127 plt.figure(figsize=(12, 10))128129 # Plot success rate130 plt.subplot(2, 1, 1)131 plt.bar(summary['provider'], summary['success_rate'])132 plt.title('Success Rate by Provider')133 plt.ylabel('Success Rate (%)')134 plt.ylim(0, 100)135136 # Plot response times137 plt.subplot(2, 1, 2)138 for provider in providers:139 provider_df = df[df['provider'] == provider]140 plt.plot(provider_df['request_id'], provider_df['duration'], marker='o', label=provider)141142 plt.title('Response Time by Request')143 plt.xlabel('Request ID')144 plt.ylabel('Duration (seconds)')145 plt.legend()146 plt.grid(True)147148 plt.tight_layout()149 plt.savefig('load_test_results.png')150151 # Assert reasonable success rate152 for provider in providers:153 provider_success = df[df['provider'] == provider]['success'].mean() * 100154 assert provider_success >= 70, f"Success rate for {provider} is below 70%"
Stability Testing for Extended Sessions
python1# tests/reliability/test_stability.py2import pytest3import asyncio4import time5import os6import random7import pandas as pd8import matplotlib.pyplot as plt9from typing import List, Dict, Any1011from app.services.provider_service import ProviderService, Provider12from app.agents.base_agent import BaseAgent, AgentState13from app.agents.research_agent import ResearchAgent14from app.models.message import Message, MessageRole1516# Skip tests if it's CI environment17SKIP_STABILITY_TESTS = os.environ.get("CI") == "true"1819@pytest.mark.skipif(SKIP_STABILITY_TESTS, reason="Stability tests skipped in CI environment")20class TestSystemStability:21 @pytest.fixture22 async def setup(self):23 """Set up test environment with services and agents."""24 if not os.environ.get("OPENAI_API_KEY"):25 pytest.skip("OPENAI_API_KEY environment variable not set")2627 # Initialize service28 provider_service = ProviderService()2930 try:31 await provider_service.initialize()32 except Exception as e:33 pytest.skip(f"Failed to initialize service: {str(e)}")3435 # Create a test agent36 agent = ResearchAgent(37 provider_service=provider_service,38 knowledge_service=None, # Mock would be better but we're testing stability39 system_prompt="You are a helpful research assistant."40 )4142 yield {43 "provider_service": provider_service,44 "agent": agent45 }4647 # Cleanup48 await provider_service.cleanup()4950 async def run_conversation_turn(self, agent, message, turn_number):51 """Run a single conversation turn and record metrics."""52 start_time = time.time()53 success = False54 error = None55 memory_before = self.get_memory_usage()5657 try:58 response = await agent.process_message(message, f"test_user_{turn_number}")59 success = True60 except Exception as e:61 error = str(e)62 response = None6364 end_time = time.time()65 memory_after = self.get_memory_usage()6667 return {68 "turn": turn_number,69 "success": success,70 "error": error,71 "duration": end_time - start_time,72 "memory_before": memory_before,73 "memory_after": memory_after,74 "memory_increase": memory_after - memory_before,75 "history_length": len(agent.state.conversation_history),76 "response_length": len(response) if response else 077 }7879 def get_memory_usage(self):80 """Get current memory usage in MB."""81 import psutil82 process = psutil.Process(os.getpid())83 memory_info = process.memory_info()84 return memory_info.rss / (1024 * 1024) # Convert to MB8586 @pytest.mark.asyncio87 async def test_extended_conversation(self, setup):88 """Test system stability over an extended conversation."""89 agent = setup["agent"]9091 # List of test questions for the conversation92 questions = [93 "What is machine learning?",94 "Can you explain neural networks?",95 "What is the difference between supervised and unsupervised learning?",96 "How does reinforcement learning work?",97 "What are some applications of deep learning?",98 "Explain the concept of overfitting.",99 "What is transfer learning?",100 "How does backpropagation work?",101 "What are convolutional neural networks?",102 "Explain the transformer architecture.",103 "What is BERT and how does it work?",104 "What are GANs used for?",105 "Explain the concept of attention in neural networks.",106 "What is the difference between RNNs and LSTMs?",107 "How do recommendation systems work?"108 ]109110 # Run an extended conversation111 results = []112 turn_limit = min(len(questions), 15) # Limit to 15 turns for test duration113114 for turn in range(turn_limit):115 # For later turns, occasionally refer to previous information116 if turn > 3 and random.random() < 0.3:117 message = f"Can you explain more about what you mentioned earlier regarding {random.choice(questions[:turn]).lower().replace('?', '')}"118 else:119 message = questions[turn]120121 result = await self.run_conversation_turn(agent, message, turn)122 results.append(result)123124 # Print progress125 status = "✓" if result["success"] else "✗"126 print(f"Turn {turn+1}/{turn_limit} {status} - Time: {result['duration']:.2f}s")127128 # Delay between turns129 await asyncio.sleep(2)130131 # Analyze results132 df = pd.DataFrame(results)133134 # Print summary statistics135 print("\nExtended Conversation Test Results:")136 print(f"Success rate: {df['success'].mean()*100:.1f}%")137 print(f"Average response time: {df['duration'].mean():.2f}s")138 print(f"Final conversation history length: {df['history_length'].iloc[-1]}")139 print(f"Memory usage increase: {df['memory_after'].iloc[-1] - df['memory_before'].iloc[0]:.2f} MB")140141 # Create visualization142 plt.figure(figsize=(15, 12))143144 # Plot response times145 plt.subplot(3, 1, 1)146 plt.plot(df['turn'], df['duration'], marker='o')147 plt.title('Response Time by Conversation Turn')148 plt.xlabel('Turn')149 plt.ylabel('Duration (seconds)')150 plt.grid(True)151152 # Plot memory usage153 plt.subplot(3, 1, 2)154 plt.plot(df['turn'], df['memory_after'], marker='o')155 plt.title('Memory Usage Over Conversation')156 plt.xlabel('Turn')157 plt.ylabel('Memory (MB)')158 plt.grid(True)159160 # Plot history length and response length161 plt.subplot(3, 1, 3)162 plt.plot(df['turn'], df['history_length'], marker='o', label='History Length')163 plt.plot(df['turn'], df['response_length'], marker='x', label='Response Length')164 plt.title('Conversation Metrics')165 plt.xlabel('Turn')166 plt.ylabel('Length (chars/items)')167 plt.legend()168 plt.grid(True)169170 plt.tight_layout()171 plt.savefig('stability_test_results.png')172173 # Assert reasonable success rate174 assert df['success'].mean() >= 0.8, "Success rate below 80%"175176 # Check for memory leaks (large, consistent growth would be concerning)177 memory_growth_rate = (df['memory_after'].iloc[-1] - df['memory_before'].iloc[0]) / turn_limit178 assert memory_growth_rate < 50, f"Excessive memory growth rate: {memory_growth_rate:.2f} MB/turn"
Automation Framework
Test Orchestration Script
python1# scripts/run_tests.py2#!/usr/bin/env python3import argparse4import os5import sys6import subprocess7import time8from datetime import datetime910def parse_args():11 parser = argparse.ArgumentParser(description='Run test suite for OpenAI-Ollama integration')12 parser.add_argument('--unit', action='store_true', help='Run unit tests')13 parser.add_argument('--integration', action='store_true', help='Run integration tests')14 parser.add_argument('--performance', action='store_true', help='Run performance tests')15 parser.add_argument('--reliability', action='store_true', help='Run reliability tests')16 parser.add_argument('--all', action='store_true', help='Run all tests')17 parser.add_argument('--html', action='store_true', help='Generate HTML report')18 parser.add_argument('--output-dir', default='test_results', help='Directory for test results')1920 args = parser.parse_args()2122 # If no specific test type is selected, run all23 if not (args.unit or args.integration or args.performance or args.reliability or args.all):24 args.all = True2526 return args2728def run_test_suite(test_type, output_dir, html=False):29 """Run a specific test suite and return success status."""30 print(f"\n{'='*80}\nRunning {test_type} tests\n{'='*80}")3132 timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")33 report_file = f"{output_dir}/{test_type}_report_{timestamp}"3435 # Create command with appropriate flags36 cmd = ["pytest", f"tests/{test_type}", "-v"]3738 if html:39 cmd.extend(["--html", f"{report_file}.html", "--self-contained-html"])4041 # Add JUnit XML report for CI integration42 cmd.extend(["--junitxml", f"{report_file}.xml"])4344 # Run the tests45 start_time = time.time()46 result = subprocess.run(cmd)47 duration = time.time() - start_time4849 # Print summary50 status = "PASSED" if result.returncode == 0 else "FAILED"51 print(f"\n{test_type} tests {status} in {duration:.2f} seconds")5253 if html:54 print(f"HTML report saved to {report_file}.html")5556 print(f"XML report saved to {report_file}.xml")5758 return result.returncode == 05960def main():61 args = parse_args()6263 # Create output directory if it doesn't exist64 os.makedirs(args.output_dir, exist_ok=True)6566 # Track overall success67 all_passed = True6869 # Run selected test suites70 if args.all or args.unit:71 unit_passed = run_test_suite("unit", args.output_dir, args.html)72 all_passed = all_passed and unit_passed7374 if args.all or args.integration:75 integration_passed = run_test_suite("integration", args.output_dir, args.html)76 all_passed = all_passed and integration_passed7778 if args.all or args.performance:79 performance_passed = run_test_suite("performance", args.output_dir, args.html)80 # Performance tests might be informational, so don't fail the build8182 if args.all or args.reliability:83 reliability_passed = run_test_suite("reliability", args.output_dir, args.html)84 all_passed = all_passed and reliability_passed8586 # Print overall summary87 print(f"\n{'='*80}")88 print(f"Test Suite {'PASSED' if all_passed else 'FAILED'}")89 print(f"{'='*80}")9091 # Return appropriate exit code92 return 0 if all_passed else 19394if __name__ == "__main__":95 sys.exit(main())
CI/CD Configuration
yaml1# .github/workflows/test.yml2name: Test Suite34on:5 push:6 branches: [ main, develop ]7 pull_request:8 branches: [ main, develop ]9 workflow_dispatch:10 inputs:11 test_type:12 description: 'Test suite to run (unit, integration, all)'13 required: true14 default: 'unit'1516jobs:17 test:18 runs-on: ubuntu-latest1920 services:21 ollama:22 image: ollama/ollama:latest23 ports:24 - 11434:114342526 steps:27 - uses: actions/checkout@v32829 - name: Set up Python30 uses: actions/setup-python@v431 with:32 python-version: '3.11'3334 - name: Install dependencies35 run: |36 python -m pip install --upgrade pip37 pip install -r requirements.txt38 pip install -r requirements-dev.txt3940 - name: Pull Ollama models41 run: |42 # Wait for Ollama service to be ready43 timeout 60 bash -c 'until curl -s -f http://localhost:11434/api/tags > /dev/null; do sleep 1; done'44 # Pull basic model for testing45 curl -X POST http://localhost:11434/api/pull -d '{"name":"llama2:7b-chat-q4_0"}'4647 - name: Run unit tests48 if: ${{ github.event.inputs.test_type == 'unit' || github.event.inputs.test_type == 'all' || github.event.inputs.test_type == '' }}49 env:50 OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}51 OLLAMA_HOST: http://localhost:1143452 run: pytest tests/unit -v --junitxml=unit-test-results.xml5354 - name: Run integration tests55 if: ${{ github.event.inputs.test_type == 'integration' || github.event.inputs.test_type == 'all' }}56 env:57 OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}58 OLLAMA_HOST: http://localhost:1143459 run: pytest tests/integration -v --junitxml=integration-test-results.xml6061 - name: Upload test results62 if: always()63 uses: actions/upload-artifact@v364 with:65 name: test-results66 path: '*-test-results.xml'6768 - name: Publish Test Report69 uses: mikepenz/action-junit-report@v370 if: always()71 with:72 report_paths: '*-test-results.xml'73 fail_on_failure: true
Comparative Benchmark Framework
Response Quality Evaluation Matrix
python1# tests/benchmarks/quality_matrix.py2import pytest3import asyncio4import json5import pandas as pd6import matplotlib.pyplot as plt7import seaborn as sns8import os9from typing import List, Dict, Any1011from app.services.provider_service import ProviderService, Provider12from app.services.ollama_service import OllamaService1314# Test questions across multiple domains15BENCHMARK_QUESTIONS = {16 "factual_knowledge": [17 "What are the main causes of climate change?",18 "Explain how vaccines work in the human body.",19 "What were the key causes of World War I?",20 "Describe the process of photosynthesis.",21 "What is the difference between DNA and RNA?"22 ],23 "reasoning": [24 "If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?",25 "A bat and ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?",26 "In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half of the lake?",27 "If three people can paint three fences in three hours, how many people would be needed to paint six fences in six hours?",28 "Imagine a rope that goes around the Earth at the equator, lying flat on the ground. If you add 10 meters to the length of this rope and space it evenly above the ground, how high above the ground would the rope be?"29 ],30 "creative_writing": [31 "Write a short story about a robot discovering emotions.",32 "Create a poem about the changing seasons.",33 "Write a creative dialogue between the ocean and the moon.",34 "Describe a world where humans can photosynthesize like plants.",35 "Create a character sketch of a time-traveling historian."36 ],37 "code_generation": [38 "Write a Python function to check if a string is a palindrome.",39 "Create a JavaScript function that finds the most frequent element in an array.",40 "Write a SQL query to find the top 5 customers by purchase amount.",41 "Implement a binary search algorithm in the language of your choice.",42 "Write a function to detect a cycle in a linked list."43 ],44 "instruction_following": [45 "List 5 fruits, then number them in the reverse order, then highlight the one that starts with 'a' if any.",46 "Explain quantum computing in 3 paragraphs, then summarize each paragraph in one sentence, then create a single slogan based on these summaries.",47 "Create a table comparing 3 car models based on price, fuel efficiency, and safety. Then add a row showing which model is best in each category.",48 "Write a recipe for chocolate cake, then modify it to be vegan, then list only the ingredients that changed.",49 "Translate 'Hello, how are you?' to French, Spanish, and German, then identify which language uses the most words."50 ]51}5253class TestQualityMatrix:54 @pytest.fixture55 async def services(self):56 """Set up services for benchmark testing."""57 if not os.environ.get("OPENAI_API_KEY"):58 pytest.skip("OPENAI_API_KEY environment variable not set")5960 # Initialize services61 ollama_service = OllamaService()62 provider_service = ProviderService()6364 try:65 await ollama_service.initialize()66 await provider_service.initialize()67 except Exception as e:68 pytest.skip(f"Failed to initialize services: {str(e)}")6970 yield {71 "ollama_service": ollama_service,72 "provider_service": provider_service73 }7475 # Cleanup76 await ollama_service.cleanup()77 await provider_service.cleanup()7879 async def generate_response(self, provider_service, provider, model, question):80 """Generate a response from a specific provider and model."""81 try:82 if provider == "openai":83 response = await provider_service._generate_openai_completion(84 messages=[{"role": "user", "content": question}],85 model=model,86 temperature=0.787 )88 else: # ollama89 response = await provider_service._generate_ollama_completion(90 messages=[{"role": "user", "content": question}],91 model=model,92 temperature=0.793 )9495 return {96 "success": True,97 "content": response["message"]["content"],98 "metadata": {99 "model": model,100 "provider": provider101 }102 }103 except Exception as e:104 return {105 "success": False,106 "error": str(e),107 "metadata": {108 "model": model,109 "provider": provider110 }111 }112113 async def evaluate_response(self, provider_service, question, response, category):114 """Evaluate a response using GPT-4 as a judge."""115 # Skip evaluation if response generation failed116 if not response.get("success", False):117 return {118 "scores": {119 "correctness": 0,120 "completeness": 0,121 "coherence": 0,122 "conciseness": 0,123 "overall": 0124 },125 "explanation": f"Failed to generate response: {response.get('error', 'Unknown error')}"126 }127128 evaluation_criteria = {129 "factual_knowledge": ["correctness", "completeness", "coherence", "citation"],130 "reasoning": ["logical_flow", "correctness", "explanation_quality", "step_by_step"],131 "creative_writing": ["originality", "coherence", "engagement", "language_use"],132 "code_generation": ["correctness", "efficiency", "readability", "explanation"],133 "instruction_following": ["accuracy", "completeness", "precision", "structure"]134 }135136 # Get the appropriate criteria for this category137 criteria = evaluation_criteria.get(category, ["correctness", "completeness", "coherence", "overall"])138139 evaluation_prompt = [140 {"role": "system", "content": f"""141 You are an expert evaluator of AI responses. Evaluate the given response to the question based on the following criteria:142143 {', '.join(criteria)}144145 For each criterion, provide a score from 1-10 and a brief explanation.146 Also provide an overall score from 1-10.147148 Format your response as valid JSON with the following structure:149 {{150 "scores": {{151 "{criteria[0]}": X,152 "{criteria[1]}": X,153 "{criteria[2]}": X,154 "{criteria[3]}": X,155 "overall": X156 }},157 "explanation": "Your overall assessment and suggestions for improvement"158 }}159 """},160 {"role": "user", "content": f"""161 Question: {question}162163 Response to evaluate:164 {response["content"]}165 """}166 ]167168 # Use GPT-4 to evaluate169 evaluation = await provider_service._generate_openai_completion(170 messages=evaluation_prompt,171 model="gpt-4",172 response_format={"type": "json_object"}173 )174175 try:176 return json.loads(evaluation["message"]["content"])177 except:178 # Fallback if parsing fails179 return {180 "scores": {criterion: 0 for criterion in criteria + ["overall"]},181 "explanation": "Failed to parse evaluation"182 }183184 @pytest.mark.asyncio185 async def test_quality_matrix(self, services):186 """Generate a comprehensive quality comparison matrix."""187 provider_service = services["provider_service"]188189 # Models to test190 models = {191 "openai": ["gpt-3.5-turbo", "gpt-4-turbo"],192 "ollama": ["llama2", "mistral", "codellama"]193 }194195 # Select a subset of questions for each category to keep test duration reasonable196 test_questions = {}197 for category, questions in BENCHMARK_QUESTIONS.items():198 # Take up to 3 questions per category199 test_questions[category] = questions[:2]200201 # Collect results202 all_results = []203204 for category, questions in test_questions.items():205 for question in questions:206 for provider in models:207 for model in models[provider]:208 print(f"Testing {provider}:{model} on {category} question")209210 # Generate response211 response = await self.generate_response(212 provider_service,213 provider,214 model,215 question216 )217218 # Save raw response219 model_safe_name = model.replace(":", "_")220 os.makedirs("benchmark_responses", exist_ok=True)221 with open(f"benchmark_responses/{provider}_{model_safe_name}_{category}.txt", "a") as f:222 f.write(f"\nQuestion: {question}\n\n")223 f.write(f"Response: {response.get('content', 'ERROR: ' + response.get('error', 'Unknown error'))}\n")224 f.write("-" * 80 + "\n")225226 # If successful, evaluate the response227 if response.get("success", False):228 evaluation = await self.evaluate_response(229 provider_service,230 question,231 response,232 category233 )234235 # Add to results236 result = {237 "category": category,238 "question": question,239 "provider": provider,240 "model": model,241 "success": response["success"]242 }243244 # Add scores245 for criterion, score in evaluation["scores"].items():246 result[f"score_{criterion}"] = score247248 all_results.append(result)249 else:250 # Add failed result251 all_results.append({252 "category": category,253 "question": question,254 "provider": provider,255 "model": model,256 "success": False,257 "score_overall": 0258 })259260 # Add a delay to avoid rate limits261 await asyncio.sleep(2)262263 # Analyze results264 df = pd.DataFrame(all_results)265266 # Save full results267 df.to_csv("benchmark_quality_matrix.csv", index=False)268269 # Create summary by model and category270 summary = df.groupby(["provider", "model", "category"])["score_overall"].mean().reset_index()271 pivot_summary = summary.pivot_table(272 index=["provider", "model"],273 columns="category",274 values="score_overall"275 ).round(2)276277 # Add average across categories278 pivot_summary["average"] = pivot_summary.mean(axis=1)279280 # Save summary281 pivot_summary.to_csv("benchmark_quality_summary.csv")282283 # Create visualization284 plt.figure(figsize=(15, 10))285286 # Heatmap of scores287 plt.subplot(1, 1, 1)288 sns.heatmap(pivot_summary, annot=True, cmap="YlGnBu", vmin=1, vmax=10)289 plt.title("Model Performance by Category (Average Score 1-10)")290291 plt.tight_layout()292 plt.savefig('benchmark_quality_matrix.png')293294 # Print summary to console295 print("\nQuality Benchmark Results:")296 print(pivot_summary.to_string())297298 # Assert something meaningful299 assert len(all_results) > 0, "No benchmark results collected"
Latency and Cost Efficiency Analysis
python1# tests/benchmarks/efficiency_analysis.py2import pytest3import asyncio4import time5import os6import pandas as pd7import matplotlib.pyplot as plt8import seaborn as sns9from typing import List, Dict, Any1011from app.services.provider_service import ProviderService, Provider12from app.services.ollama_service import OllamaService1314# Test prompts of different lengths15BENCHMARK_PROMPTS = {16 "short": "What is artificial intelligence?",17 "medium": "Explain the differences between supervised, unsupervised, and reinforcement learning in machine learning.",18 "long": "Write a comprehensive essay on the ethical implications of artificial intelligence in healthcare, considering patient privacy, diagnostic accuracy, and accessibility issues.",19 "very_long": """20 Analyze the historical development of artificial intelligence from its conceptual origins to the present day.21 Include key milestones, technological breakthroughs, paradigm shifts in approaches, and influential researchers.22 Also discuss how AI has been portrayed in popular culture and how that has influenced public perception and research funding.23 Finally, provide a thoughtful discussion on where AI might be headed in the next 20 years and what ethical frameworks24 should be considered as we continue to advance the technology.25 """26}2728class TestEfficiencyAnalysis:29 @pytest.fixture30 async def services(self):31 """Set up services for benchmark testing."""32 if not os.environ.get("OPENAI_API_KEY"):33 pytest.skip("OPENAI_API_KEY environment variable not set")3435 # Initialize services36 ollama_service = OllamaService()37 provider_service = ProviderService()3839 try:40 await ollama_service.initialize()41 await provider_service.initialize()42 except Exception as e:43 pytest.skip(f"Failed to initialize services: {str(e)}")4445 yield {46 "ollama_service": ollama_service,47 "provider_service": provider_service48 }4950 # Cleanup51 await ollama_service.cleanup()52 await provider_service.cleanup()5354 async def measure_response_metrics(self, provider_service, provider, model, prompt, max_tokens=None):55 """Measure response time, token counts, and other metrics."""56 start_time = time.time()57 success = False58 error = None59 token_count = {"prompt": 0, "completion": 0, "total": 0}6061 try:62 if provider == "openai":63 response = await provider_service._generate_openai_completion(64 messages=[{"role": "user", "content": prompt}],65 model=model,66 max_tokens=max_tokens67 )68 else: # ollama69 response = await provider_service._generate_ollama_completion(70 messages=[{"role": "user", "content": prompt}],71 model=model,72 max_tokens=max_tokens73 )7475 success = True7677 # Extract token counts from usage if available78 if "usage" in response:79 token_count = {80 "prompt": response["usage"].get("prompt_tokens", 0),81 "completion": response["usage"].get("completion_tokens", 0),82 "total": response["usage"].get("total_tokens", 0)83 }8485 response_text = response["message"]["content"]8687 except Exception as e:88 error = str(e)89 response_text = None9091 end_time = time.time()92 duration = end_time - start_time9394 # Estimate cost (for OpenAI)95 cost = 0.096 if provider == "openai" and success:97 if "gpt-4" in model:98 # GPT-4 pricing (approximate)99 cost = token_count["prompt"] * 0.00003 + token_count["completion"] * 0.00006100 else:101 # GPT-3.5 pricing (approximate)102 cost = token_count["prompt"] * 0.0000015 + token_count["completion"] * 0.000002103104 return {105 "success": success,106 "error": error,107 "duration": duration,108 "token_count": token_count,109 "response_length": len(response_text) if response_text else 0,110 "cost": cost,111 "tokens_per_second": token_count["completion"] / duration if success and duration > 0 else 0112 }113114 @pytest.mark.asyncio115 async def test_efficiency_benchmark(self, services):116 """Perform comprehensive efficiency analysis."""117 provider_service = services["provider_service"]118119 # Models to test120 models = {121 "openai": ["gpt-3.5-turbo", "gpt-4"],122 "ollama": ["llama2", "mistral:7b", "llama2:13b"]123 }124125 # Number of repetitions for each test126 repetitions = 2127128 # Results129 results = []130131 for prompt_length, prompt in BENCHMARK_PROMPTS.items():132 for provider in models:133 for model in models[provider]:134 print(f"Testing {provider}:{model} with {prompt_length} prompt")135136 for rep in range(repetitions):137 try:138 metrics = await self.measure_response_metrics(139 provider_service,140 provider,141 model,142 prompt143 )144145 results.append({146 "provider": provider,147 "model": model,148 "prompt_length": prompt_length,149 "repetition": rep + 1,150 **metrics151 })152153 # Add a delay to avoid rate limits154 await asyncio.sleep(2)155 except Exception as e:156 print(f"Error in benchmark: {str(e)}")157158 # Create DataFrame159 df = pd.DataFrame(results)160161 # Save raw results162 df.to_csv("benchmark_efficiency_raw.csv", index=False)163164 # Create summary by model and prompt length165 latency_summary = df.groupby(["provider", "model", "prompt_length"])["duration"].mean().reset_index()166 latency_pivot = latency_summary.pivot_table(167 index=["provider", "model"],168 columns="prompt_length",169 values="duration"170 ).round(2)171172 # Calculate efficiency metrics (tokens per second and cost per 1000 tokens)173 efficiency_df = df[df["success"]].copy()174 efficiency_df["cost_per_1k_tokens"] = efficiency_df.apply(175 lambda row: (row["cost"] * 1000 / row["token_count"]["total"])176 if row["provider"] == "openai" and row["token_count"]["total"] > 0177 else 0,178 axis=1179 )180181 efficiency_summary = efficiency_df.groupby(["provider", "model"])[182 ["tokens_per_second", "cost_per_1k_tokens"]183 ].mean().round(3)184185 # Save summaries186 latency_pivot.to_csv("benchmark_latency_summary.csv")187 efficiency_summary.to_csv("benchmark_efficiency_summary.csv")188189 # Create visualizations190 plt.figure(figsize=(15, 10))191192 # Latency by prompt length and model193 plt.subplot(2, 1, 1)194 ax = plt.gca()195 latency_pivot.plot(kind='bar', ax=ax)196 plt.title("Response Time by Prompt Length")197 plt.ylabel("Time (seconds)")198 plt.xticks(rotation=45)199 plt.legend(title="Prompt Length")200201 # Tokens per second by model202 plt.subplot(2, 2, 3)203 efficiency_summary["tokens_per_second"].plot(kind='bar')204 plt.title("Generation Speed (Tokens/Second)")205 plt.ylabel("Tokens per Second")206 plt.xticks(rotation=45)207208 # Cost per 1000 tokens (OpenAI only)209 plt.subplot(2, 2, 4)210 openai_efficiency = efficiency_summary.loc["openai"]211 openai_efficiency["cost_per_1k_tokens"].plot(kind='bar')212 plt.title("Cost per 1000 Tokens (OpenAI)")213 plt.ylabel("Cost ($)")214 plt.xticks(rotation=45)215216 plt.tight_layout()217 plt.savefig('benchmark_efficiency.png')218219 # Print summary to console220 print("\nLatency by Prompt Length (seconds):")221 print(latency_pivot.to_string())222223 print("\nEfficiency Metrics:")224 print(efficiency_summary.to_string())225226 # Comparison analysis227 if "ollama" in df["provider"].values and "openai" in df["provider"].values:228 # Calculate average speedup/slowdown ratio229 openai_avg = df[df["provider"] == "openai"]["duration"].mean()230 ollama_avg = df[df["provider"] == "ollama"]["duration"].mean()231232 speedup = openai_avg / ollama_avg if ollama_avg > 0 else float('inf')233234 print(f"\nAverage time ratio (OpenAI/Ollama): {speedup:.2f}")235 if speedup > 1:236 print(f"Ollama is {speedup:.2f}x faster on average")237 else:238 print(f"OpenAI is {1/speedup:.2f}x faster on average")239240 # Assert something meaningful241 assert len(results) > 0, "No benchmark results collected"
Tool Usage Comparison
python1# tests/benchmarks/tool_usage_comparison.py2import pytest3import asyncio4import json5import pandas as pd6import matplotlib.pyplot as plt7import seaborn as sns8import os9from typing import List, Dict, Any1011from app.services.provider_service import ProviderService, Provider12from app.services.ollama_service import OllamaService1314# Test tools for benchmarking15BENCHMARK_TOOLS = [16 {17 "type": "function",18 "function": {19 "name": "get_weather",20 "description": "Get the current weather in a location",21 "parameters": {22 "type": "object",23 "properties": {24 "location": {25 "type": "string",26 "description": "The city and state, e.g. San Francisco, CA"27 },28 "unit": {29 "type": "string",30 "enum": ["celsius", "fahrenheit"],31 "description": "The temperature unit to use"32 }33 },34 "required": ["location"]35 }36 }37 },38 {39 "type": "function",40 "function": {41 "name": "search_hotels",42 "description": "Search for hotels in a specific location",43 "parameters": {44 "type": "object",45 "properties": {46 "location": {47 "type": "string",48 "description": "The city to search in"49 },50 "check_in": {51 "type": "string",52 "description": "Check-in date in YYYY-MM-DD format"53 },54 "check_out": {55 "type": "string",56 "description": "Check-out date in YYYY-MM-DD format"57 },58 "guests": {59 "type": "integer",60 "description": "Number of guests"61 },62 "price_range": {63 "type": "string",64 "description": "Price range, e.g. '$0-$100'"65 }66 },67 "required": ["location", "check_in", "check_out"]68 }69 }70 },71 {72 "type": "function",73 "function": {74 "name": "calculate_mortgage",75 "description": "Calculate monthly mortgage payment",76 "parameters": {77 "type": "object",78 "properties": {79 "loan_amount": {80 "type": "number",81 "description": "The loan amount in dollars"82 },83 "interest_rate": {84 "type": "number",85 "description": "Annual interest rate (percentage)"86 },87 "loan_term": {88 "type": "integer",89 "description": "Loan term in years"90 },91 "down_payment": {92 "type": "number",93 "description": "Down payment amount in dollars"94 }95 },96 "required": ["loan_amount", "interest_rate", "loan_term"]97 }98 }99 }100]101102# Tool usage queries103TOOL_QUERIES = [104 "What's the weather like in Miami right now?",105 "Find me hotels in New York for next weekend for 2 people.",106 "Calculate the monthly payment for a $300,000 mortgage with 4.5% interest over 30 years.",107 "What's the weather in Tokyo and Paris this week?",108 "I need to calculate mortgage payments for different interest rates: 3%, 4%, and 5% on a $250,000 loan."109]110111class TestToolUsageComparison:112 @pytest.fixture113 async def services(self):114 """Set up services for benchmark testing."""115 if not os.environ.get("OPENAI_API_KEY"):116 pytest.skip("OPENAI_API_KEY environment variable not set")117118 # Initialize services119 ollama_service = OllamaService()120 provider_service = ProviderService()121122 try:123 await ollama_service.initialize()124 await provider_service.initialize()125 except Exception as e:126 pytest.skip(f"Failed to initialize services: {str(e)}")127128 yield {129 "ollama_service": ollama_service,130 "provider_service": provider_service131 }132133 # Cleanup134 await ollama_service.cleanup()135 await provider_service.cleanup()136137 async def generate_with_tools(self, provider_service, provider, model, query, tools):138 """Generate a response with tools and measure performance."""139 start_time = time.time()140 success = False141 error = None142143 try:144 if provider == "openai":145 response = await provider_service._generate_openai_completion(146 messages=[{"role": "user", "content": query}],147 model=model,148 tools=tools149 )150 else: # ollama151 response = await provider_service._generate_ollama_completion(152 messages=[{"role": "user", "content": query}],153 model=model,154 tools=tools155 )156157 success = True158 tool_calls = response.get("tool_calls", [])159 message_content = response["message"]["content"]160161 # Determine if tools were used correctly162 tools_used = len(tool_calls) > 0163164 # For Ollama (which might not have native tool support), check for tool-like patterns165 if not tools_used and provider == "ollama":166 # Check if response contains structured tool usage167 if "<tool>" in message_content:168 tools_used = True169170 # Look for patterns matching function names171 for tool in tools:172 if f"{tool['function']['name']}" in message_content:173 tools_used = True174 break175176 except Exception as e:177 error = str(e)178 message_content = None179 tools_used = False180 tool_calls = []181182 end_time = time.time()183184 return {185 "success": success,186 "error": error,187 "duration": end_time - start_time,188 "message": message_content,189 "tools_used": tools_used,190 "tool_call_count": len(tool_calls),191 "tool_calls": tool_calls192 }193194 @pytest.mark.asyncio195 async def test_tool_usage_benchmark(self, services):196 """Benchmark tool usage across providers and models."""197 provider_service = services["provider_service"]198199 # Models to test200 models = {201 "openai": ["gpt-3.5-turbo", "gpt-4-turbo"],202 "ollama": ["llama2", "mistral"]203 }204205 # Results206 results = []207208 for query in TOOL_QUERIES:209 for provider in models:210 for model in models[provider]:211 print(f"Testing {provider}:{model} with tools query: {query[:30]}...")212213 try:214 metrics = await self.generate_with_tools(215 provider_service,216 provider,217 model,218 query,219 BENCHMARK_TOOLS220 )221222 results.append({223 "provider": provider,224 "model": model,225 "query": query,226 **metrics227 })228229 # Save raw response230 model_safe_name = model.replace(":", "_")231 os.makedirs("tool_benchmark_responses", exist_ok=True)232 with open(f"tool_benchmark_responses/{provider}_{model_safe_name}.txt", "a") as f:233 f.write(f"\nQuery: {query}\n\n")234 f.write(f"Response: {metrics.get('message', 'ERROR: ' + metrics.get('error', 'Unknown error'))}\n")235 if metrics.get('tool_calls'):236 f.write("\nTool Calls:\n")237 f.write(json.dumps(metrics['tool_calls'], indent=2))238 f.write("\n" + "-" * 80 + "\n")239240 # Add a delay to avoid rate limits241 await asyncio.sleep(2)242 except Exception as e:243 print(f"Error in benchmark: {str(e)}")244245 # Create DataFrame246 df = pd.DataFrame(results)247248 # Save raw results249 df.to_csv("benchmark_tool_usage_raw.csv", index=False)250251 # Create summary252 tool_usage_summary = df.groupby(["provider", "model"])[253 ["success", "tools_used", "tool_call_count", "duration"]254 ].agg({255 "success": "mean",256 "tools_used": "mean",257 "tool_call_count": "mean",258 "duration": "mean"259 }).round(3)260261 # Rename columns for clarity262 tool_usage_summary.columns = [263 "Success Rate",264 "Tool Usage Rate",265 "Avg Tool Calls",266 "Avg Duration (s)"267 ]268269 # Save summary270 tool_usage_summary.to_csv("benchmark_tool_usage_summary.csv")271272 # Create visualizations273 plt.figure(figsize=(15, 10))274275 # Tool usage rate by model276 plt.subplot(2, 2, 1)277 tool_usage_summary["Tool Usage Rate"].plot(kind='bar')278 plt.title("Tool Usage Rate by Model")279 plt.ylabel("Rate (0-1)")280 plt.ylim(0, 1)281 plt.xticks(rotation=45)282283 # Average tool calls by model284 plt.subplot(2, 2, 2)285 tool_usage_summary["Avg Tool Calls"].plot(kind='bar')286 plt.title("Average Tool Calls per Query")287 plt.ylabel("Count")288 plt.xticks(rotation=45)289290 # Success rate by model291 plt.subplot(2, 2, 3)292 tool_usage_summary["Success Rate"].plot(kind='bar')293 plt.title("Success Rate")294 plt.ylabel("Rate (0-1)")295 plt.ylim(0, 1)296 plt.xticks(rotation=45)297298 # Average duration by model299 plt.subplot(2, 2, 4)300 tool_usage_summary["Avg Duration (s)"].plot(kind='bar')301 plt.title("Average Response Time")302 plt.ylabel("Seconds")303 plt.xticks(rotation=45)304305 plt.tight_layout()306 plt.savefig('benchmark_tool_usage.png')307308 # Print summary to console309 print("\nTool Usage Benchmark Results:")310 print(tool_usage_summary.to_string())311312 # Qualitative analysis - extract patterns in tool usage313 if len(df[df["tools_used"]]) > 0:314 print("\nQualitative Analysis of Tool Usage:")315316 # Comparison between providers317 openai_correct = df[(df["provider"] == "openai") & (df["tools_used"])].shape[0]318 openai_total = df[df["provider"] == "openai"].shape[0]319 openai_rate = openai_correct / openai_total if openai_total > 0 else 0320321 ollama_correct = df[(df["provider"] == "ollama") & (df["tools_used"])].shape[0]322 ollama_total = df[df["provider"] == "ollama"].shape[0]323 ollama_rate = ollama_correct / ollama_total if ollama_total > 0 else 0324325 print(f"OpenAI tool usage rate: {openai_rate:.2f}")326 print(f"Ollama tool usage rate: {ollama_rate:.2f}")327328 if openai_rate > 0 and ollama_rate > 0:329 ratio = openai_rate / ollama_rate330 print(f"OpenAI is {ratio:.2f}x more likely to use tools correctly")331332 # Additional insights333 if "openai" in df["provider"].values and "ollama" in df["provider"].values:334 openai_time = df[df["provider"] == "openai"]["duration"].mean()335 ollama_time = df[df["provider"] == "ollama"]["duration"].mean()336337 if openai_time > 0 and ollama_time > 0:338 time_ratio = openai_time / ollama_time339 print(f"Time ratio (OpenAI/Ollama): {time_ratio:.2f}")340 if time_ratio > 1:341 print(f"Ollama is {time_ratio:.2f}x faster for tool-related queries")342 else:343 print(f"OpenAI is {1/time_ratio:.2f}x faster for tool-related queries")344345 # Assert something meaningful346 assert len(results) > 0, "No benchmark results collected"
Pytest Configuration
python1# pytest.ini2[pytest]3markers =4 unit: marks tests as unit tests5 integration: marks tests as integration tests6 performance: marks tests as performance tests7 reliability: marks tests as reliability tests8 benchmark: marks tests as benchmarks910testpaths = tests1112python_files = test_*.py13python_classes = Test*14python_functions = test_*1516# Don't run performance tests by default17addopts = -m "not performance and not reliability and not benchmark"1819# Configure test outputs20junit_family = xunit22122# Add environment variables for default runs23env =24 PYTHONPATH=.25 OPENAI_MODEL=gpt-3.5-turbo26 OLLAMA_MODEL=llama227 OLLAMA_HOST=http://localhost:11434
Test Documentation
markdown1# Testing Strategy for OpenAI-Ollama Integration23This document outlines the comprehensive testing approach for the hybrid AI system that integrates OpenAI and Ollama.45## 1. Unit Testing67Unit tests verify the functionality of individual components in isolation:89- **Provider Service**: Tests for provider selection logic, auto-routing, and fallback mechanisms10- **Ollama Service**: Tests for response formatting, tool extraction, and error handling11- **Model Selection**: Tests for use case detection and model recommendation logic12- **Tool Integration**: Tests for proper handling of tool calls and responses1314Run unit tests with:15```bash16python -m pytest tests/unit -v
2. Integration Testing
Integration tests verify the interaction between components:
- API Endpoints: Tests for proper request handling, authentication, and response formatting
- End-to-End Agent Flows: Tests for agent behavior across different scenarios
- Cross-Provider Integration: Tests for seamless integration between OpenAI and Ollama
Run integration tests with:
bash1python -m pytest tests/integration -v
3. Performance Testing
Performance tests measure system performance characteristics:
- Response Latency: Compares response times across providers and models
- Memory Usage: Measures memory consumption during request processing
- Response Quality: Evaluates the quality of responses using GPT-4 as a judge
Run performance tests with:
bash1python -m pytest tests/performance -v
4. Reliability Testing
Reliability tests verify the system's behavior under various conditions:
- Error Handling: Tests for proper error detection and fallback mechanisms
- Load Testing: Measures system performance under concurrent requests
- Stability Testing: Evaluates system behavior during extended conversations
Run reliability tests with:
bash1python -m pytest tests/reliability -v
5. Benchmark Framework
Comprehensive benchmarks for comparative analysis:
- Quality Matrix: Compares response quality across providers and models
- Efficiency Analysis: Measures performance/cost characteristics
- Tool Usage Comparison: Evaluates tool handling capabilities
Run benchmarks with:
bash1python -m pytest tests/benchmarks -v
Running the Complete Test Suite
Use the test orchestration script to run all test suites:
bash1python scripts/run_tests.py --all
CI/CD Integration
The test suite is integrated with GitHub Actions workflow:
bash1# Triggered on push to main/develop or manually via workflow_dispatch2git push origin main # Automatically runs tests
Prerequisites
- OpenAI API Key in environment variables:
export OPENAI_API_KEY=sk-...
- Running Ollama instance:
bash1ollama serve
- Required models for Ollama:
bash1ollama pull llama22ollama pull mistral
text1## Conclusion23This comprehensive testing strategy provides a robust framework for validating the hybrid AI architecture that integrates OpenAI's cloud capabilities with Ollama's local model inference. By implementing this multi-faceted testing approach, we ensure:451. **Functional Correctness**: Unit and integration tests verify that all components function as expected both individually and when integrated.672. **Performance Optimization**: Benchmarks and performance tests provide quantitative data to guide resource allocation and routing decisions.893. **Reliability**: Load and stability tests ensure the system remains responsive and produces consistent results under various conditions.10114. **Quality Assurance**: Response quality evaluations ensure that the system maintains high standards regardless of which provider handles the inference.1213The test suite is designed to be extensible, allowing for additional test cases as the system evolves. By automating this testing strategy through CI/CD pipelines, we maintain ongoing quality assurance and enable continuous improvement of the hybrid AI architecture.1415# User Interface Design for Hybrid OpenAI-Ollama MCP System1617## Conceptual Framework for Interface Design1819The Modern Computational Paradigm (MCP) system—integrating cloud-based intelligence with local inference capabilities—requires a thoughtfully designed interface that balances simplicity with advanced functionality. This document presents a comprehensive design approach for both command-line and web interfaces that expose the system's capabilities while maintaining an intuitive user experience.2021## Command Line Interface (CLI) Design2223### CLI Architecture
┌─────────────────────────────────────────────────────────────┐ │ │ │ MCP-CLI │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌──────────────────┐ │ │ │ Core Module │ │ Config │ │ Interactive Mode │ │ │ └─────────────┘ └─────────────┘ └──────────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────┐ ┌─────────────┐ ┌──────────────────┐ │ │ │ Agent API │ │ Model │ │ Session │ │ │ │ Client │ │ Management │ │ Management │ │ │ └─────────────┘ └─────────────┘ └──────────────────┘ │ │ │ │ │ │ │ └───────────────┼───────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────┐ │ │ │ Output │ │ │ │ Formatting │ │ │ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘
text1### CLI Wireframes23#### Main Help Screen
┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ MCP CLI v1.0.0 │ │ │ │ USAGE: │ │ mcp [OPTIONS] COMMAND [ARGS]... │ │ │ │ OPTIONS: │ │ --config PATH Path to config file │ │ --verbose Enable verbose output │ │ --help Show this message and exit │ │ │ │ COMMANDS: │ │ chat Start a chat session │ │ complete Get a completion for a prompt │ │ models List and manage available models │ │ config Configure MCP settings │ │ agents Manage agent profiles │ │ session Manage saved sessions │ │ │ └─────────────────────────────────────────────────────────────────────────┘
text1#### Interactive Chat Mode
┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ MCP Chat Session - ID: chat_78f3d2 │ │ Model: auto-select | Provider: auto | Agent: research │ │ │ │ Type 'exit' to quit, 'help' for commands, 'models' to switch models │ │ ──────────────────────────────────────────────────────────────────── │ │ │ │ You: Tell me about quantum computing │ │ │ │ MCP [OpenAI:gpt-4]: Quantum computing is a type of computation that │ │ harnesses quantum mechanical phenomena like superposition and │ │ entanglement to process information in ways that classical computers │ │ cannot. │ │ │ │ Unlike classical bits that exist in a state of either 0 or 1, quantum │ │ bits or "qubits" can exist in multiple states simultaneously due to │ │ superposition. This potentially allows quantum computers to explore │ │ multiple solutions to a problem at once. │ │ │ │ [Response continues for several more paragraphs...] │ │ │ │ You: Can you explain quantum entanglement more simply? │ │ │ │ MCP [Ollama:mistral]: █ │ │ │ └─────────────────────────────────────────────────────────────────────────┘
text1#### Model Management Screen
┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ MCP Models │ │ │ │ AVAILABLE MODELS: │ │ │ │ OpenAI: │ │ [✓] gpt-4-turbo - Advanced reasoning, current knowledge │ │ [✓] gpt-3.5-turbo - Fast, efficient for standard tasks │ │ │ │ Ollama: │ │ [✓] llama2 - General purpose local model │ │ [✓] mistral - Strong reasoning, 8k context window │ │ [✓] codellama - Specialized for code generation │ │ [ ] wizard-math - Mathematical problem-solving │ │ │ │ COMMANDS: │ │ │ │ pull MODEL_NAME - Download a model to Ollama │ │ info MODEL_NAME - Show detailed model information │ │ benchmark MODEL_NAME - Run performance benchmark │ │ set-default MODEL_NAME - Set as default model │ │ │ └─────────────────────────────────────────────────────────────────────────┘
text1#### Agent Configuration Screen
┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ MCP Agent Configuration │ │ │ │ AVAILABLE AGENTS: │ │ │ │ [✓] general - General purpose assistant │ │ [✓] research - Research specialist with knowledge tools │ │ [✓] coding - Code assistant with tool integration │ │ [✓] creative - Creative writing and content generation │ │ │ │ CUSTOM AGENTS: │ │ │ │ [✓] my-math-tutor - Mathematics teaching and problem solving │ │ [✓] data-analyst - Data analysis with visualization tools │ │ │ │ COMMANDS: │ │ │ │ create NAME - Create a new custom agent │ │ edit NAME - Edit an existing agent │ │ delete NAME - Delete a custom agent │ │ export NAME FILE - Export agent configuration │ │ import FILE - Import agent configuration │ │ │ └─────────────────────────────────────────────────────────────────────────┘
text1### CLI Interaction Flow
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ │ │ │ │ Start CLI │────▶│ Select Mode │────▶│ Set Config │────▶│ Session │ │ │ │ │ │ │ │ Interaction │ └─────────────┘ └─────────────┘ └─────────────┘ └──────┬──────┘ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌──────▼──────┐ │ │ │ │ │ │ │ │ │ Export │◀────│ Session │◀────│ Generate │◀────│ User │ │ Results │ │ Management │ │ Response │ │ Prompt │ │ │ │ │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
text1### CLI Implementation Example23```python4# mcp_cli.py5import argparse6import os7import json8import sys9import time10from typing import Dict, Any, List, Optional11import requests12import yaml13import colorama14from colorama import Fore, Style15from prompt_toolkit import PromptSession16from prompt_toolkit.history import FileHistory17from prompt_toolkit.auto_suggest import AutoSuggestFromHistory18from prompt_toolkit.completion import WordCompleter19from rich.console import Console20from rich.markdown import Markdown21from rich.panel import Panel22from rich.progress import Progress2324# Initialize colorama for cross-platform color support25colorama.init()26console = Console()2728CONFIG_PATH = os.path.expanduser("~/.mcp/config.yaml")29HISTORY_PATH = os.path.expanduser("~/.mcp/history")30API_URL = "http://localhost:8000/api/v1"3132def ensure_config_dir():33 """Ensure the config directory exists."""34 config_dir = os.path.dirname(CONFIG_PATH)35 os.makedirs(config_dir, exist_ok=True)36 os.makedirs(os.path.dirname(HISTORY_PATH), exist_ok=True)3738def load_config():39 """Load configuration from file."""40 ensure_config_dir()4142 if not os.path.exists(CONFIG_PATH):43 # Create default config44 config = {45 "api": {46 "url": API_URL,47 "key": None48 },49 "defaults": {50 "model": "auto",51 "provider": "auto",52 "agent": "general"53 },54 "output": {55 "format": "markdown",56 "show_model_info": True57 }58 }5960 with open(CONFIG_PATH, 'w') as f:61 yaml.dump(config, f, default_flow_style=False)6263 console.print(f"Created default config at {CONFIG_PATH}", style="yellow")64 return config6566 with open(CONFIG_PATH, 'r') as f:67 return yaml.safe_load(f)6869def save_config(config):70 """Save configuration to file."""71 with open(CONFIG_PATH, 'w') as f:72 yaml.dump(config, f, default_flow_style=False)7374def get_api_key(config):75 """Get API key from config or environment."""76 if config["api"]["key"]:77 return config["api"]["key"]7879 env_key = os.environ.get("MCP_API_KEY")80 if env_key:81 return env_key8283 # If no key is configured, prompt the user84 console.print("No API key found. Please enter your API key:", style="yellow")85 key = input("> ")8687 if key:88 config["api"]["key"] = key89 save_config(config)90 return key9192 console.print("No API key provided. Some features may not work.", style="red")93 return None9495def make_api_request(endpoint, method="GET", data=None, config=None):96 """Make an API request to the MCP backend."""97 if config is None:98 config = load_config()99100 api_key = get_api_key(config)101 headers = {102 "Content-Type": "application/json"103 }104105 if api_key:106 headers["Authorization"] = f"Bearer {api_key}"107108 url = f"{config['api']['url']}/{endpoint.lstrip('/')}"109110 try:111 if method == "GET":112 response = requests.get(url, headers=headers)113 elif method == "POST":114 response = requests.post(url, headers=headers, json=data)115 else:116 raise ValueError(f"Unsupported HTTP method: {method}")117118 response.raise_for_status()119 return response.json()120 except requests.exceptions.RequestException as e:121 console.print(f"API request failed: {str(e)}", style="red")122 return None123124def display_response(response_text, format_type="markdown"):125 """Display a response with appropriate formatting."""126 if format_type == "markdown":127 console.print(Markdown(response_text))128 else:129 console.print(response_text)130131def chat_command(args, config):132 """Start an interactive chat session."""133 session_id = args.session_id134 model_name = args.model or config["defaults"]["model"]135 provider = args.provider or config["defaults"]["provider"]136 agent_type = args.agent or config["defaults"]["agent"]137138 console.print(Panel(f"Starting MCP Chat Session\nModel: {model_name} | Provider: {provider} | Agent: {agent_type}"))139 console.print("Type 'exit' to quit, 'help' for commands", style="dim")140141 # Set up prompt session with history142 ensure_config_dir()143 history_file = os.path.join(HISTORY_PATH, "chat_history")144 session = PromptSession(145 history=FileHistory(history_file),146 auto_suggest=AutoSuggestFromHistory(),147 completer=WordCompleter(['exit', 'help', 'models', 'clear', 'save', 'switch'])148 )149150 # Initial session data151 if not session_id:152 # Create a new session153 pass154155 while True:156 try:157 user_input = session.prompt(f"{Fore.GREEN}You: {Style.RESET_ALL}")158159 if user_input.lower() in ('exit', 'quit'):160 break161162 if not user_input.strip():163 continue164165 # Handle special commands166 if user_input.lower() == 'help':167 console.print(Panel("""168 Available commands:169 - exit/quit: Exit the chat session170 - clear: Clear the current conversation171 - save FILENAME: Save conversation to file172 - models: List available models173 - switch MODEL: Switch to a different model174 - provider PROVIDER: Switch to a different provider175 """))176 continue177178 # For normal input, send to API179 with Progress() as progress:180 task = progress.add_task("[cyan]Generating response...", total=None)181182 data = {183 "message": user_input,184 "session_id": session_id,185 "model_params": {186 "provider": provider,187 "model": model_name,188 "auto_select": provider == "auto"189 }190 }191192 response = make_api_request("chat", method="POST", data=data, config=config)193 progress.update(task, completed=100)194195 if response:196 session_id = response["session_id"]197 model_used = response.get("model_used", model_name)198 provider_used = response.get("provider_used", provider)199200 # Display provider and model info if configured201 if config["output"]["show_model_info"]:202 console.print(f"\n{Fore.BLUE}MCP [{provider_used}:{model_used}]:{Style.RESET_ALL}")203 else:204 console.print(f"\n{Fore.BLUE}MCP:{Style.RESET_ALL}")205206 display_response(response["response"], config["output"]["format"])207 console.print() # Empty line for readability208209 except KeyboardInterrupt:210 break211 except EOFError:212 break213 except Exception as e:214 console.print(f"Error: {str(e)}", style="red")215216 console.print("Chat session ended")217218def models_command(args, config):219 """List and manage available models."""220 if args.pull:221 # Pull a new model for Ollama222 console.print(f"Pulling Ollama model: {args.pull}")223224 with Progress() as progress:225 task = progress.add_task(f"[cyan]Pulling {args.pull}...", total=None)226227 # This would actually call Ollama API228 time.sleep(2) # Simulating download229230 progress.update(task, completed=100)231232 console.print(f"Successfully pulled {args.pull}", style="green")233 return234235 # List available models236 console.print(Panel("Available Models"))237238 console.print("\n[bold]OpenAI Models:[/bold]")239 openai_models = [240 {"name": "gpt-4-turbo", "description": "Advanced reasoning, current knowledge"},241 {"name": "gpt-3.5-turbo", "description": "Fast, efficient for standard tasks"}242 ]243244 for model in openai_models:245 console.print(f" • {model['name']} - {model['description']}")246247 console.print("\n[bold]Ollama Models:[/bold]")248249 # In a real implementation, this would fetch from Ollama API250 ollama_models = [251 {"name": "llama2", "description": "General purpose local model", "installed": True},252 {"name": "mistral", "description": "Strong reasoning, 8k context window", "installed": True},253 {"name": "codellama", "description": "Specialized for code generation", "installed": True},254 {"name": "wizard-math", "description": "Mathematical problem-solving", "installed": False}255 ]256257 for model in ollama_models:258 status = "[green]✓[/green]" if model["installed"] else "[red]✗[/red]"259 console.print(f" {status} {model['name']} - {model['description']}")260261 console.print("\nUse 'mcp models --pull MODEL_NAME' to download a model")262263def config_command(args, config):264 """View or edit configuration."""265 if args.set:266 # Set a configuration value267 key, value = args.set.split('=', 1)268 keys = key.split('.')269270 # Navigate to the nested key271 current = config272 for k in keys[:-1]:273 if k not in current:274 current[k] = {}275 current = current[k]276277 # Set the value (with type conversion)278 if value.lower() == 'true':279 current[keys[-1]] = True280 elif value.lower() == 'false':281 current[keys[-1]] = False282 elif value.isdigit():283 current[keys[-1]] = int(value)284 else:285 current[keys[-1]] = value286287 save_config(config)288 console.print(f"Configuration updated: {key} = {value}", style="green")289 return290291 # Display current configuration292 console.print(Panel("MCP Configuration"))293 console.print(yaml.dump(config))294 console.print("\nUse 'mcp config --set key.path=value' to change settings")295296def agent_command(args, config):297 """Manage agent profiles."""298 if args.create:299 # Create a new agent profile300 console.print(f"Creating agent profile: {args.create}")301 # Implementation would collect agent parameters302 return303304 if args.edit:305 # Edit an existing agent profile306 console.print(f"Editing agent profile: {args.edit}")307 return308309 # List available agents310 console.print(Panel("Available Agents"))311312 console.print("\n[bold]System Agents:[/bold]")313 system_agents = [314 {"name": "general", "description": "General purpose assistant"},315 {"name": "research", "description": "Research specialist with knowledge tools"},316 {"name": "coding", "description": "Code assistant with tool integration"},317 {"name": "creative", "description": "Creative writing and content generation"}318 ]319320 for agent in system_agents:321 console.print(f" • {agent['name']} - {agent['description']}")322323 # In a real implementation, this would load from user config324 custom_agents = [325 {"name": "my-math-tutor", "description": "Mathematics teaching and problem solving"},326 {"name": "data-analyst", "description": "Data analysis with visualization tools"}327 ]328329 if custom_agents:330 console.print("\n[bold]Custom Agents:[/bold]")331 for agent in custom_agents:332 console.print(f" • {agent['name']} - {agent['description']}")333334 console.print("\nUse 'mcp agents --create NAME' to create a new agent")335336def main():337 """Main entry point for the CLI."""338 parser = argparse.ArgumentParser(description="MCP Command Line Interface")339 parser.add_argument('--config', help="Path to config file")340 parser.add_argument('--verbose', action='store_true', help="Enable verbose output")341342 subparsers = parser.add_subparsers(dest='command', help='Command to run')343344 # Chat command345 chat_parser = subparsers.add_parser('chat', help='Start a chat session')346 chat_parser.add_argument('--model', help='Model to use')347 chat_parser.add_argument('--provider', choices=['openai', 'ollama', 'auto'], help='Provider to use')348 chat_parser.add_argument('--agent', help='Agent type to use')349 chat_parser.add_argument('--session-id', help='Resume an existing session')350351 # Complete command (one-shot completion)352 complete_parser = subparsers.add_parser('complete', help='Get a completion for a prompt')353 complete_parser.add_argument('prompt', help='Prompt text')354 complete_parser.add_argument('--model', help='Model to use')355 complete_parser.add_argument('--provider', choices=['openai', 'ollama', 'auto'], help='Provider to use')356357 # Models command358 models_parser = subparsers.add_parser('models', help='List and manage available models')359 models_parser.add_argument('--pull', metavar='MODEL_NAME', help='Download a model to Ollama')360 models_parser.add_argument('--info', metavar='MODEL_NAME', help='Show detailed model information')361 models_parser.add_argument('--benchmark', metavar='MODEL_NAME', help='Run performance benchmark')362363 # Config command364 config_parser = subparsers.add_parser('config', help='Configure MCP settings')365 config_parser.add_argument('--set', metavar='KEY=VALUE', help='Set a configuration value')366367 # Agents command368 agents_parser = subparsers.add_parser('agents', help='Manage agent profiles')369 agents_parser.add_argument('--create', metavar='NAME', help='Create a new custom agent')370 agents_parser.add_argument('--edit', metavar='NAME', help='Edit an existing agent')371 agents_parser.add_argument('--delete', metavar='NAME', help='Delete a custom agent')372373 # Session command374 session_parser = subparsers.add_parser('session', help='Manage saved sessions')375 session_parser.add_argument('--list', action='store_true', help='List saved sessions')376 session_parser.add_argument('--delete', metavar='SESSION_ID', help='Delete a session')377 session_parser.add_argument('--export', metavar='SESSION_ID', help='Export a session')378379 args = parser.parse_args()380381 # Load configuration382 config_path = args.config if args.config else CONFIG_PATH383384 if args.config and not os.path.exists(args.config):385 console.print(f"Config file not found: {args.config}", style="red")386 return 1387388 config = load_config()389390 # Execute the appropriate command391 if args.command == 'chat':392 chat_command(args, config)393 elif args.command == 'complete':394 # Implementation for complete command395 pass396 elif args.command == 'models':397 models_command(args, config)398 elif args.command == 'config':399 config_command(args, config)400 elif args.command == 'agents':401 agent_command(args, config)402 elif args.command == 'session':403 # Implementation for session command404 pass405 else:406 # No command specified, show help407 parser.print_help()408409 return 0410411if __name__ == "__main__":412 sys.exit(main())
Web Interface Design
Web Interface Architecture
text1┌────────────────────────────────────────────────────────────────────┐2│ │3│ React Frontend │4│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │5│ │ Chat │ │ Model │ │ Agent │ │ Settings │ │6│ │ Interface │ │ Management │ │ Configuration│ │ Manager │ │7│ └──────────────┘ └──────────────┘ └──────────────┘ └───────────┘ │8│ │ │ │ │ │9│ └───────────────┼────────────────┼───────────────┘ │10│ │ │ │11│ ▼ ▼ │12│ ┌─────────────┐ ┌────────────┐ │13│ │ Auth │ │ API Client │ │14│ │ Management │ │ │ │15│ └─────────────┘ └────────────┘ │16│ │17└────────────────────────────────────────────────────────────────────┘18 │19 ▼20┌────────────────────────────────────────────────────────────────────┐21│ │22│ FastAPI Backend │23│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │24│ │ Chat │ │ Model │ │ Agent │ │ User │ │25│ │ Controller │ │ Controller │ │ Controller │ │ Controller│ │26│ └──────────────┘ └──────────────┘ └──────────────┘ └───────────┘ │27│ │ │ │ │ │28│ └───────────────┼────────────────┼───────────────┘ │29│ │ │ │30│ ▼ ▼ │31│ ┌───────────────────┐ ┌────────────────────┐ │32│ │ Provider Service │ │ Agent Factory │ │33│ └───────────────────┘ └────────────────────┘ │34│ │ │ │35│ ▼ ▼ │36│ ┌─────────────┐ ┌─────────────┐ │37│ │ OpenAI API │ │ Ollama API │ │38│ └─────────────┘ └─────────────┘ │39│ │40└────────────────────────────────────────────────────────────────────┘
Web Interface Wireframes
Chat Interface
text1┌─────────────────────────────────────────────────────────────────────────┐2│ MCP Assistant 🔄 New Chat ⚙️ │3├─────────────────────────────────────────────────────────────────────────┤4│ │5│ ┌─────────────────────────┐ ┌───────────────────────────────────────┐ │6│ │ Chat History │ │ │ │7│ │ │ │ User: Tell me about quantum computing │ │8│ │ Welcome │ │ │ │9│ │ Quantum Computing │ │ MCP: Quantum computing is a type of │ │10│ │ AI Ethics │ │ computation that harnesses quantum │ │11│ │ Python Tutorial │ │ mechanical phenomena like super- │ │12│ │ │ │ position and entanglement. │ │13│ │ │ │ │ │14│ │ │ │ Unlike classical bits that represent │ │15│ │ │ │ either 0 or 1, quantum bits or │ │16│ │ │ │ "qubits" can exist in multiple states │ │17│ │ │ │ simultaneously due to superposition. │ │18│ │ │ │ │ │19│ │ │ │ [Response continues...] │ │20│ │ │ │ │ │21│ │ │ │ User: How does quantum entanglement │ │22│ │ │ │ work? │ │23│ │ │ │ │ │24│ │ │ │ MCP is typing... │ │25│ │ │ │ │ │26│ └─────────────────────────┘ └───────────────────────────────────────┘ │27│ │28│ ┌─────────────────────────────────────────────────────────────────┐ │29│ │ Type your message... Send ▶ │ │30│ └─────────────────────────────────────────────────────────────────┘ │31│ │32│ Model: auto (OpenAI:gpt-4) | Mode: Research | Memory: Enabled │33│ │34└─────────────────────────────────────────────────────────────────────────┘
Model Settings Panel
text1┌─────────────────────────────────────────────────────────────────────────┐2│ MCP Assistant > Settings > Models ✖ │3├─────────────────────────────────────────────────────────────────────────┤4│ │5│ Model Selection │6│ ┌─────────────────────────────────────────────────────────────────┐ │7│ │ ● Auto-select model (recommended) │ │8│ │ ○ Specify model and provider │ │9│ └─────────────────────────────────────────────────────────────────┘ │10│ │11│ Provider Model │12│ ┌────────────┐ ┌────────────────────┐ │13│ │ OpenAI ▼ │ │ gpt-4-turbo ▼ │ │14│ └────────────┘ └────────────────────┘ │15│ │16│ Auto-Selection Preferences │17│ ┌─────────────────────────────────────────────────────────────────┐ │18│ │ Prioritize: ● Speed ○ Quality ○ Privacy ○ Cost │ │19│ │ │ │20│ │ Complexity threshold: ███████████░░░░░░░░░ 0.65 │ │21│ │ │ │22│ │ [✓] Prefer Ollama for privacy-sensitive content │ │23│ │ [✓] Use OpenAI for complex reasoning │ │24│ │ [✓] Automatically fall back if a provider fails │ │25│ └─────────────────────────────────────────────────────────────────┘ │26│ │27│ Available Ollama Models │28│ ┌─────────────────────────────────────────────────────────────────┐ │29│ │ ✓ llama2 ✓ mistral ✓ codellama │ │30│ │ ✓ wizard-math ✓ neural-chat ○ llama2:70b [Download] │ │31│ └─────────────────────────────────────────────────────────────────┘ │32│ │33│ [ Save Changes ] [ Cancel ] │34│ │35└─────────────────────────────────────────────────────────────────────────┘
Agent Configuration Panel
text1┌─────────────────────────────────────────────────────────────────────────┐2│ MCP Assistant > Settings > Agents ✖ │3├─────────────────────────────────────────────────────────────────────────┤4│ │5│ Current Agent: Research Assistant [Edit ✏] │6│ │7│ Agent Library │8│ ┌─────────────────────────────────────────────────────────────────┐ │9│ │ ● Research Assistant Knowledge-focused with search capability│ │10│ │ ○ Code Assistant Specialized for software development │ │11│ │ ○ Creative Writer Content creation and storytelling │ │12│ │ ○ Math Tutor Step-by-step problem solving │ │13│ │ ○ General Assistant Versatile helper for everyday tasks │ │14│ └─────────────────────────────────────────────────────────────────┘ │15│ │16│ Agent Capabilities │17│ ┌─────────────────────────────────────────────────────────────────┐ │18│ │ [✓] Knowledge retrieval [ ] Code execution │ │19│ │ [✓] Web search [ ] Data visualization │ │20│ │ [✓] Memory [ ] File operations │ │21│ │ [✓] Calendar awareness [ ] Email integration │ │22│ └─────────────────────────────────────────────────────────────────┘ │23│ │24│ System Instructions │25│ ┌─────────────────────────────────────────────────────────────────┐ │26│ │ You are a research assistant with expertise in finding and │ │27│ │ synthesizing information. Provide comprehensive, accurate │ │28│ │ answers with authoritative sources when available. │ │29│ │ │ │30│ │ │ │31│ └─────────────────────────────────────────────────────────────────┘ │32│ │33│ [ Save Agent ] [ Create New Agent ] [ Import ] [ Export ] │34│ │35└─────────────────────────────────────────────────────────────────────────┘
Dashboard View
text1┌─────────────────────────────────────────────────────────────────────────┐2│ MCP Assistant > Dashboard ⚙️ │3├─────────────────────────────────────────────────────────────────────────┤4│ │5│ System Status Last 24 Hours │6│ ┌────────────────────────────┐ ┌────────────────────────────────┐ │7│ │ OpenAI: ● Connected │ │ Requests: 143 │ │8│ │ Ollama: ● Connected │ │ OpenAI: 62% | Ollama: 38% │ │9│ │ Database: ● Operational │ │ Avg Response Time: 2.4s │ │10│ └────────────────────────────┘ └────────────────────────────────┘ │11│ │12│ Recent Conversations │13│ ┌─────────────────────────────────────────────────────────────────┐ │14│ │ ● Quantum Computing Research Today, 14:32 [Resume] │ │15│ │ ● Python Code Debugging Today, 10:15 [Resume] │ │16│ │ ● Travel Planning Yesterday [Resume] │ │17│ │ ● Financial Analysis 2 days ago [Resume] │ │18│ └─────────────────────────────────────────────────────────────────┘ │19│ │20│ Model Usage Agent Usage │21│ ┌────────────────────────────┐ ┌────────────────────────────────┐ │22│ │ ███ OpenAI:gpt-4 27% │ │ ███ Research Assistant 42% │ │23│ │ ███ OpenAI:gpt-3.5 35% │ │ ███ Code Assistant 31% │ │24│ │ ███ Ollama:mistral 20% │ │ ███ General Assistant 18% │ │25│ │ ███ Ollama:llama2 18% │ │ ███ Other 9% │ │26│ └────────────────────────────┘ └────────────────────────────────┘ │27│ │28│ API Credits │29│ ┌─────────────────────────────────────────────────────────────────┐ │30│ │ OpenAI: $4.32 used this month of $10.00 budget ████░░░░░ 43% │ │31│ │ Estimated savings from Ollama usage: $3.87 │ │32│ └─────────────────────────────────────────────────────────────────┘ │33│ │34│ [ New Chat ] [ View All Conversations ] [ System Settings ] │35│ │36└─────────────────────────────────────────────────────────────────────────┘
Web Interface Interaction Flow
text1┌──────────────┐ ┌───────────────┐ ┌────────────────┐2│ │ │ │ │ │3│ Login Page │────▶│ Dashboard │────▶│ Chat Interface│◀───┐4│ │ │ │ │ │ │5└──────────────┘ └───────┬───────┘ └────────┬───────┘ │6 │ │ │7 ▼ ▼ │8 ┌───────────────┐ ┌────────────────┐ │9 │ │ │ │ │10 │Settings Panel │ │ User Message │ │11 │ │ │ │ │12 └───┬───────────┘ └────────┬───────┘ │13 │ │ │14 ▼ ▼ │15 ┌────────────────┐ ┌────────────────┐ │16 │ │ │ │ │17 │Model Settings │ │API Processing │ │18 │ │ │ │ │19 └────────┬───────┘ └────────┬───────┘ │20 │ │ │21 ▼ ▼ │22 ┌────────────────┐ ┌────────────────┐ │23 │ │ │ │ │24 │Agent Settings │ │System Response │────┘25 │ │ │ │26 └────────────────┘ └────────────────┘
Key Web Components
ProviderSelector Component
jsx1// ProviderSelector.jsx2import React, { useState, useEffect } from 'react';3import { Dropdown, Switch, Slider, Checkbox, Button, Card, Alert } from 'antd';4import { ApiOutlined, SettingOutlined, QuestionCircleOutlined } from '@ant-design/icons';56const ProviderSelector = ({7 onProviderChange,8 onModelChange,9 initialProvider = 'auto',10 initialModel = null,11 showAdvanced = false12}) => {13 const [provider, setProvider] = useState(initialProvider);14 const [model, setModel] = useState(initialModel);15 const [autoSelect, setAutoSelect] = useState(initialProvider === 'auto');16 const [complexityThreshold, setComplexityThreshold] = useState(0.65);17 const [prioritizePrivacy, setPrioritizePrivacy] = useState(false);18 const [ollamaModels, setOllamaModels] = useState([]);19 const [ollamaStatus, setOllamaStatus] = useState('unknown'); // 'online', 'offline', 'unknown'20 const [openaiModels, setOpenaiModels] = useState([21 { value: 'gpt-4o', label: 'GPT-4o' },22 { value: 'gpt-4-turbo', label: 'GPT-4 Turbo' },23 { value: 'gpt-3.5-turbo', label: 'GPT-3.5 Turbo' }24 ]);2526 // Fetch available Ollama models on component mount27 useEffect(() => {28 const fetchOllamaModels = async () => {29 try {30 const response = await fetch('/api/v1/models/ollama');31 if (response.ok) {32 const data = await response.json();33 setOllamaModels(data.models.map(m => ({34 value: m.name,35 label: m.name36 })));37 setOllamaStatus('online');38 } else {39 setOllamaStatus('offline');40 }41 } catch (error) {42 console.error('Error fetching Ollama models:', error);43 setOllamaStatus('offline');44 }45 };4647 fetchOllamaModels();48 }, []);4950 const handleProviderChange = (value) => {51 setProvider(value);52 onProviderChange(value);5354 // Reset model when changing provider55 setModel(null);56 onModelChange(null);57 };5859 const handleModelChange = (value) => {60 setModel(value);61 onModelChange(value);62 };6364 const handleAutoSelectChange = (checked) => {65 setAutoSelect(checked);66 if (checked) {67 setProvider('auto');68 onProviderChange('auto');69 setModel(null);70 onModelChange(null);71 } else {72 // Default to OpenAI if disabling auto-select73 setProvider('openai');74 onProviderChange('openai');75 setModel('gpt-3.5-turbo');76 onModelChange('gpt-3.5-turbo');77 }78 };7980 const providerOptions = [81 { value: 'openai', label: 'OpenAI' },82 { value: 'ollama', label: 'Ollama (Local)' },83 { value: 'auto', label: 'Auto-select' }84 ];8586 return (87 <Card title="Model Selection" extra={<QuestionCircleOutlined />}>88 <div className="provider-selector">89 <div className="selector-row">90 <Switch91 checked={autoSelect}92 onChange={handleAutoSelectChange}93 checkedChildren="Auto-select"94 unCheckedChildren="Manual"95 />96 <span className="selector-label">97 {autoSelect ? 'Automatically select the best model for each query' : 'Manually choose provider and model'}98 </span>99 </div>100101 {!autoSelect && (102 <div className="selector-row model-selection">103 <div className="provider-dropdown">104 <span>Provider:</span>105 <Dropdown106 options={providerOptions}107 value={provider}108 onChange={handleProviderChange}109 disabled={autoSelect}110 />111 </div>112113 <div className="model-dropdown">114 <span>Model:</span>115 <Dropdown116 options={provider === 'openai' ? openaiModels : ollamaModels}117 value={model}118 onChange={handleModelChange}119 disabled={autoSelect}120 placeholder="Select a model"121 />122 </div>123 </div>124 )}125126 {provider === 'ollama' && ollamaStatus === 'offline' && (127 <Alert128 message="Ollama is currently offline"129 description="Please start Ollama service to use local models."130 type="warning"131 showIcon132 />133 )}134135 {showAdvanced && (136 <div className="advanced-settings">137 <div className="setting-header">Advanced Routing Settings</div>138139 <div className="setting-row">140 <span>Complexity threshold:</span>141 <Slider142 value={complexityThreshold}143 onChange={setComplexityThreshold}144 min={0}145 max={1}146 step={0.05}147 disabled={!autoSelect}148 />149 <span className="setting-value">{complexityThreshold}</span>150 </div>151152 <div className="setting-row">153 <Checkbox154 checked={prioritizePrivacy}155 onChange={e => setPrioritizePrivacy(e.target.checked)}156 disabled={!autoSelect}157 >158 Prioritize privacy (prefer Ollama for sensitive content)159 </Checkbox>160 </div>161162 <div className="model-status">163 <div>164 <ApiOutlined /> OpenAI: <span className="status-online">Connected</span>165 </div>166 <div>167 <ApiOutlined /> Ollama: <span className={ollamaStatus === 'online' ? 'status-online' : 'status-offline'}>168 {ollamaStatus === 'online' ? 'Connected' : 'Disconnected'}169 </span>170 </div>171 </div>172 </div>173 )}174 </div>175 </Card>176 );177};178179export default ProviderSelector;
ChatInterface Component
jsx1// ChatInterface.jsx2import React, { useState, useEffect, useRef } from 'react';3import { Input, Button, Spin, Avatar, Tooltip, Card, Typography, Dropdown, Menu } from 'antd';4import { SendOutlined, UserOutlined, RobotOutlined, SettingOutlined,5 SaveOutlined, CopyOutlined, DeleteOutlined, InfoCircleOutlined } from '@ant-design/icons';6import ReactMarkdown from 'react-markdown';7import { Prism as SyntaxHighlighter } from 'react-syntax-highlighter';8import { tomorrow } from 'react-syntax-highlighter/dist/esm/styles/prism';9import ProviderSelector from './ProviderSelector';1011const { TextArea } = Input;12const { Text, Title } = Typography;1314const ChatInterface = () => {15 const [messages, setMessages] = useState([]);16 const [input, setInput] = useState('');17 const [loading, setLoading] = useState(false);18 const [sessionId, setSessionId] = useState(null);19 const [provider, setProvider] = useState('auto');20 const [model, setModel] = useState(null);21 const [showSettings, setShowSettings] = useState(false);22 const messagesEndRef = useRef(null);2324 // Scroll to bottom when messages change25 useEffect(() => {26 scrollToBottom();27 }, [messages]);2829 const scrollToBottom = () => {30 messagesEndRef.current?.scrollIntoView({ behavior: 'smooth' });31 };3233 const handleSend = async () => {34 if (!input.trim()) return;3536 // Add user message to chat37 const userMessage = { role: 'user', content: input, timestamp: new Date() };38 setMessages(prev => [...prev, userMessage]);39 setInput('');40 setLoading(true);4142 try {43 const response = await fetch('/api/v1/chat', {44 method: 'POST',45 headers: { 'Content-Type': 'application/json' },46 body: JSON.stringify({47 message: input,48 session_id: sessionId,49 model_params: {50 provider: provider,51 model: model,52 auto_select: provider === 'auto'53 }54 })55 });5657 if (!response.ok) {58 throw new Error('Failed to get response');59 }6061 const data = await response.json();6263 // Update session ID if new64 if (data.session_id && !sessionId) {65 setSessionId(data.session_id);66 }6768 // Add assistant message to chat69 const assistantMessage = {70 role: 'assistant',71 content: data.response,72 timestamp: new Date(),73 metadata: {74 model_used: data.model_used,75 provider_used: data.provider_used76 }77 };7879 setMessages(prev => [...prev, assistantMessage]);8081 } catch (error) {82 console.error('Error sending message:', error);83 // Add error message84 setMessages(prev => [...prev, {85 role: 'system',86 content: 'Error: Unable to get a response. Please try again.',87 error: true,88 timestamp: new Date()89 }]);90 } finally {91 setLoading(false);92 }93 };9495 const handleKeyDown = (e) => {96 if (e.key === 'Enter' && !e.shiftKey) {97 e.preventDefault();98 handleSend();99 }100 };101102 const handleCopyMessage = (content) => {103 navigator.clipboard.writeText(content);104 // Could show a toast notification here105 };106107 const renderMessage = (message, index) => {108 const isUser = message.role === 'user';109 const isError = message.error;110111 return (112 <div113 key={index}114 className={`message-container ${isUser ? 'user-message' : 'assistant-message'} ${isError ? 'error-message' : ''}`}115 >116 <div className="message-avatar">117 <Avatar118 icon={isUser ? <UserOutlined /> : <RobotOutlined />}119 style={{ backgroundColor: isUser ? '#1890ff' : '#52c41a' }}120 />121 </div>122123 <div className="message-content">124 <div className="message-header">125 <Text strong>{isUser ? 'You' : 'MCP Assistant'}</Text>126 {message.metadata && (127 <Tooltip title="Model information">128 <Text type="secondary" className="model-info">129 <InfoCircleOutlined /> {message.metadata.provider_used}:{message.metadata.model_used}130 </Text>131 </Tooltip>132 )}133 <Text type="secondary" className="message-time">134 {message.timestamp.toLocaleTimeString()}135 </Text>136 </div>137138 <div className="message-body">139 <ReactMarkdown140 children={message.content}141 components={{142 code({node, inline, className, children, ...props}) {143 const match = /language-(\w+)/.exec(className || '');144 return !inline && match ? (145 <SyntaxHighlighter146 children={String(children).replace(/\n$/, '')}147 style={tomorrow}148 language={match[1]}149 PreTag="div"150 {...props}151 />152 ) : (153 <code className={className} {...props}>154 {children}155 </code>156 );157 }158 }}159 />160 </div>161162 <div className="message-actions">163 <Button164 type="text"165 size="small"166 icon={<CopyOutlined />}167 onClick={() => handleCopyMessage(message.content)}168 >169 Copy170 </Button>171 </div>172 </div>173 </div>174 );175 };176177 const settingsMenu = (178 <Card className="settings-panel">179 <Title level={4}>Chat Settings</Title>180181 <ProviderSelector182 onProviderChange={setProvider}183 onModelChange={setModel}184 initialProvider={provider}185 initialModel={model}186 showAdvanced={true}187 />188189 <div className="settings-actions">190 <Button type="primary" onClick={() => setShowSettings(false)}>191 Close Settings192 </Button>193 </div>194 </Card>195 );196197 return (198 <div className="chat-interface">199 <div className="chat-header">200 <Title level={3}>MCP Assistant</Title>201202 <div className="header-actions">203 <Button icon={<DeleteOutlined />} onClick={() => setMessages([])}>204 Clear Chat205 </Button>206 <Button207 icon={<SettingOutlined />}208 type={showSettings ? 'primary' : 'default'}209 onClick={() => setShowSettings(!showSettings)}210 >211 Settings212 </Button>213 </div>214 </div>215216 {showSettings && settingsMenu}217218 <div className="message-list">219 {messages.length === 0 && (220 <div className="empty-state">221 <Title level={4}>Start a conversation</Title>222 <Text>Ask a question or request information</Text>223 </div>224 )}225226 {messages.map(renderMessage)}227228 {loading && (229 <div className="message-container assistant-message">230 <div className="message-avatar">231 <Avatar icon={<RobotOutlined />} style={{ backgroundColor: '#52c41a' }} />232 </div>233 <div className="message-content">234 <div className="message-body typing-indicator">235 <Spin /> MCP is thinking...236 </div>237 </div>238 </div>239 )}240241 <div ref={messagesEndRef} />242 </div>243244 <div className="chat-input">245 <TextArea246 value={input}247 onChange={e => setInput(e.target.value)}248 onKeyDown={handleKeyDown}249 placeholder="Type your message..."250 autoSize={{ minRows: 1, maxRows: 4 }}251 disabled={loading}252 />253 <Button254 type="primary"255 icon={<SendOutlined />}256 onClick={handleSend}257 disabled={loading || !input.trim()}258 >259 Send260 </Button>261 </div>262263 <div className="chat-footer">264 <Text type="secondary">265 Model: {provider === 'auto' ? 'Auto-select' : `${provider}:${model || 'default'}`}266 </Text>267 {sessionId && (268 <Text type="secondary">Session ID: {sessionId}</Text>269 )}270 </div>271 </div>272 );273};274275export default ChatInterface;
AgentConfiguration Component
jsx1// AgentConfiguration.jsx2import React, { useState, useEffect } from 'react';3import { Form, Input, Button, Select, Checkbox, Card, Typography, Tabs, message } from 'antd';4import { SaveOutlined, PlusOutlined, ImportOutlined, ExportOutlined } from '@ant-design/icons';56const { Title, Text } = Typography;7const { TextArea } = Input;8const { Option } = Select;9const { TabPane } = Tabs;1011const AgentConfiguration = () => {12 const [form] = Form.useForm();13 const [agents, setAgents] = useState([]);14 const [currentAgent, setCurrentAgent] = useState(null);15 const [loading, setLoading] = useState(false);1617 // Fetch available agents on component mount18 useEffect(() => {19 const fetchAgents = async () => {20 setLoading(true);21 try {22 const response = await fetch('/api/v1/agents');23 if (response.ok) {24 const data = await response.json();25 setAgents(data.agents);2627 // Set current agent to the first one28 if (data.agents.length > 0) {29 setCurrentAgent(data.agents[0]);30 form.setFieldsValue(data.agents[0]);31 }32 }33 } catch (error) {34 console.error('Error fetching agents:', error);35 message.error('Failed to load agents');36 } finally {37 setLoading(false);38 }39 };4041 fetchAgents();42 }, [form]);4344 const handleAgentChange = (agentId) => {45 const selected = agents.find(a => a.id === agentId);46 if (selected) {47 setCurrentAgent(selected);48 form.setFieldsValue(selected);49 }50 };5152 const handleSaveAgent = async (values) => {53 setLoading(true);54 try {55 const response = await fetch(`/api/v1/agents/${currentAgent.id}`, {56 method: 'PUT',57 headers: { 'Content-Type': 'application/json' },58 body: JSON.stringify(values)59 });6061 if (response.ok) {62 message.success('Agent configuration saved');63 // Update local state64 const updatedAgents = agents.map(a =>65 a.id === currentAgent.id ? { ...a, ...values } : a66 );67 setAgents(updatedAgents);68 setCurrentAgent({ ...currentAgent, ...values });69 } else {70 message.error('Failed to save agent configuration');71 }72 } catch (error) {73 console.error('Error saving agent:', error);74 message.error('Error saving agent configuration');75 } finally {76 setLoading(false);77 }78 };7980 const handleCreateAgent = () => {81 form.resetFields();82 form.setFieldsValue({83 name: 'New Agent',84 description: 'Custom assistant',85 capabilities: [],86 system_prompt: 'You are a helpful assistant.'87 });8889 setCurrentAgent(null); // Indicates we're creating a new agent90 };9192 const handleExportAgent = () => {93 if (!currentAgent) return;9495 const agentData = JSON.stringify(currentAgent, null, 2);96 const blob = new Blob([agentData], { type: 'application/json' });97 const url = URL.createObjectURL(blob);9899 const a = document.createElement('a');100 a.href = url;101 a.download = `${currentAgent.name.replace(/\s+/g, '_').toLowerCase()}_agent.json`;102 document.body.appendChild(a);103 a.click();104 document.body.removeChild(a);105 URL.revokeObjectURL(url);106 };107108 return (109 <div className="agent-configuration">110 <Card title={<Title level={4}>Agent Configuration</Title>}>111 <div className="agent-actions">112 <Button113 type="primary"114 icon={<PlusOutlined />}115 onClick={handleCreateAgent}116 >117 Create New Agent118 </Button>119120 <Button121 icon={<ExportOutlined />}122 onClick={handleExportAgent}123 disabled={!currentAgent}124 >125 Export126 </Button>127128 <Button icon={<ImportOutlined />}>129 Import130 </Button>131 </div>132133 <div className="agent-selector">134 <Text strong>Select Agent:</Text>135 <Select136 style={{ width: 300 }}137 onChange={handleAgentChange}138 value={currentAgent?.id}139 loading={loading}140 >141 {agents.map(agent => (142 <Option key={agent.id} value={agent.id}>143 {agent.name} - {agent.description}144 </Option>145 ))}146 </Select>147 </div>148149 <Form150 form={form}151 layout="vertical"152 onFinish={handleSaveAgent}153 className="agent-form"154 >155 <Tabs defaultActiveKey="basic">156 <TabPane tab="Basic Information" key="basic">157 <Form.Item158 name="name"159 label="Agent Name"160 rules={[{ required: true, message: 'Please enter an agent name' }]}161 >162 <Input placeholder="Agent name" />163 </Form.Item>164165 <Form.Item166 name="description"167 label="Description"168 rules={[{ required: true, message: 'Please enter a description' }]}169 >170 <Input placeholder="Brief description of this agent's purpose" />171 </Form.Item>172173 <Form.Item174 name="system_prompt"175 label="System Instructions"176 rules={[{ required: true, message: 'Please enter system instructions' }]}177 >178 <TextArea179 placeholder="Instructions that define the agent's behavior"180 autoSize={{ minRows: 4, maxRows: 8 }}181 />182 </Form.Item>183 </TabPane>184185 <TabPane tab="Capabilities" key="capabilities">186 <Form.Item name="capabilities" label="Agent Capabilities">187 <Checkbox.Group>188 <div className="capabilities-grid">189 <Checkbox value="knowledge_retrieval">Knowledge Retrieval</Checkbox>190 <Checkbox value="web_search">Web Search</Checkbox>191 <Checkbox value="memory">Long-term Memory</Checkbox>192 <Checkbox value="calendar">Calendar Awareness</Checkbox>193 <Checkbox value="code_execution">Code Execution</Checkbox>194 <Checkbox value="data_visualization">Data Visualization</Checkbox>195 <Checkbox value="file_operations">File Operations</Checkbox>196 <Checkbox value="email">Email Integration</Checkbox>197 </div>198 </Checkbox.Group>199 </Form.Item>200201 <Form.Item name="preferred_models" label="Preferred Models">202 <Select mode="multiple" placeholder="Select preferred models">203 <Option value="openai:gpt-4">OpenAI: GPT-4</Option>204 <Option value="openai:gpt-3.5-turbo">OpenAI: GPT-3.5 Turbo</Option>205 <Option value="ollama:llama2">Ollama: Llama2</Option>206 <Option value="ollama:mistral">Ollama: Mistral</Option>207 <Option value="ollama:codellama">Ollama: CodeLlama</Option>208 </Select>209 </Form.Item>210 </TabPane>211212 <TabPane tab="Advanced" key="advanced">213 <Form.Item name="tool_configuration" label="Tool Configuration">214 <TextArea215 placeholder="JSON configuration for tools (advanced)"216 autoSize={{ minRows: 4, maxRows: 8 }}217 />218 </Form.Item>219220 <Form.Item name="temperature" label="Temperature">221 <Select placeholder="Response creativity level">222 <Option value="0.2">0.2 - More deterministic/factual</Option>223 <Option value="0.5">0.5 - Balanced</Option>224 <Option value="0.8">0.8 - More creative/varied</Option>225 </Select>226 </Form.Item>227 </TabPane>228 </Tabs>229230 <Form.Item>231 <Button232 type="primary"233 htmlType="submit"234 icon={<SaveOutlined />}235 loading={loading}236 >237 {currentAgent ? 'Save Changes' : 'Create Agent'}238 </Button>239 </Form.Item>240 </Form>241 </Card>242 </div>243 );244};245246export default AgentConfiguration;
User Interaction Flows
New User Onboarding Flow
text1┌────────────────┐ ┌────────────────┐ ┌────────────────┐2│ │ │ │ │ │3│ Welcome Screen │────▶│ Initial Setup │────▶│ API Key Setup │4│ │ │ │ │ │5└────────────────┘ └────────────────┘ └───────┬────────┘6 │7┌────────────────┐ ┌────────────────┐ ┌───────▼────────┐8│ │ │ │ │ │9│ First Chat │◀────│ Ollama Setup │◀────│ Model Download │10│ │ │ │ │ │11└────────────────┘ └────────────────┘ └────────────────┘
Task-Based User Flow Example
text1┌────────────────┐ ┌────────────────┐ ┌────────────────┐2│ │ │ │ │ │3│ Start Chat │────▶│ Select Research│────▶│ Enter Research │4│ │ │ Agent │ │ Query │5└────────────────┘ └────────────────┘ └───────┬────────┘6 │7┌────────────────┐ ┌────────────────┐ ┌───────▼────────┐8│ │ │ │ │ │9│ Save Results │◀────│ Refine Query │◀────│ View Response │10│ │ │ │ │ (Using OpenAI) │11└────────────────┘ └────────────────┘ └────────────────┘
Advanced Settings Flow
text1┌────────────────┐ ┌────────────────┐ ┌────────────────┐2│ │ │ │ │ │3│ Chat Screen │────▶│ Settings Menu │────▶│ Model Settings │4│ │ │ │ │ │5└────────────────┘ └────────────────┘ └───────┬────────┘6 │7┌────────────────┐ ┌────────────────┐ ┌───────▼────────┐8│ │ │ │ │ │9│ Return to │◀────│ Save Settings │◀────│ Agent Settings │10│ Chat │ │ │ │ │11└────────────────┘ └────────────────┘ └────────────────┘
Implementation Recommendations
- Responsive Design: Ensure the web interface is mobile-friendly using responsive design principles
- Accessibility: Implement proper ARIA attributes and keyboard navigation for accessibility
- Progressive Enhancement: Build with a progressive enhancement approach where core functionality works without JavaScript
- State Management: Use context API or Redux for global state in more complex implementations
- Offline Support: Consider adding service workers for offline functionality in the web interface
- CLI Shortcuts: Implement tab completion and command history in the CLI for improved usability
Conclusion
The proposed user interface designs for the MCP system provide a balance between simplicity and power, enabling users to leverage the hybrid OpenAI-Ollama architecture effectively. The CLI offers a lightweight, scriptable interface for technical users and automation scenarios, while the web interface provides a rich, interactive experience for broader adoption.
Both interfaces expose the key capabilities of the system:
- Intelligent Model Routing: Users can leverage automatic model selection or manually choose specific models
- Agent Specialization: Configurable agents enable task-specific optimization
- Privacy Controls: Explicit options for privacy-sensitive content
- Performance Analytics: Visibility into system usage, costs, and efficiency
These interfaces serve as the critical touchpoint between users and the sophisticated underlying architecture, making complex AI capabilities accessible and manageable.
Optimization and Deployment Strategies for OpenAI-Ollama Hybrid AI System
Strategic Optimization Framework
The integration of cloud-based and local inference capabilities within a unified architecture presents unique opportunities for optimization across multiple dimensions. This document outlines comprehensive strategies for enhancing performance, reducing operational costs, and improving response accuracy, followed by detailed deployment methodologies for both local and cloud environments.
Performance Optimization Strategies
1. Query Routing Optimization
python1# app/services/routing_optimizer.py2import logging3import numpy as np4from typing import Dict, List, Any, Optional5from app.config import settings67logger = logging.getLogger(__name__)89class RoutingOptimizer:10 """Optimizes routing decisions based on historical performance data."""1112 def __init__(self, cache_size: int = 1000):13 self.performance_history = {}14 self.cache_size = cache_size15 self.learning_rate = 0.051617 # Baseline thresholds18 self.complexity_threshold = settings.COMPLEXITY_THRESHOLD19 self.token_threshold = 800 # Approximate tokens before preferring cloud20 self.latency_requirement = 2.0 # Seconds2122 # Performance weights23 self.weights = {24 "complexity": 0.4,25 "token_count": 0.2,26 "privacy_score": 0.3,27 "tool_requirement": 0.128 }2930 def update_performance_metrics(self,31 provider: str,32 model: str,33 query_complexity: float,34 token_count: int,35 response_time: float,36 success: bool) -> None:37 """Update performance metrics based on actual results."""38 model_key = f"{provider}:{model}"3940 if model_key not in self.performance_history:41 self.performance_history[model_key] = {42 "queries": 0,43 "avg_response_time": 0,44 "success_rate": 0,45 "complexity_performance": {} # Maps complexity ranges to success/time46 }4748 metrics = self.performance_history[model_key]4950 # Update metrics with exponential moving average51 metrics["queries"] += 152 metrics["avg_response_time"] = (53 (1 - self.learning_rate) * metrics["avg_response_time"] +54 self.learning_rate * response_time55 )5657 # Update success rate58 old_success_rate = metrics["success_rate"]59 queries = metrics["queries"]60 metrics["success_rate"] = (old_success_rate * (queries - 1) + (1 if success else 0)) / queries6162 # Update complexity-specific performance63 complexity_bin = round(query_complexity * 10) / 10 # Round to nearest 0.16465 if complexity_bin not in metrics["complexity_performance"]:66 metrics["complexity_performance"][complexity_bin] = {67 "count": 0,68 "avg_time": 0,69 "success_rate": 070 }7172 bin_metrics = metrics["complexity_performance"][complexity_bin]73 bin_metrics["count"] += 174 bin_metrics["avg_time"] = (75 (bin_metrics["count"] - 1) * bin_metrics["avg_time"] + response_time76 ) / bin_metrics["count"]7778 bin_metrics["success_rate"] = (79 (bin_metrics["count"] - 1) * bin_metrics["success_rate"] + (1 if success else 0)80 ) / bin_metrics["count"]8182 # Prune cache if needed83 if len(self.performance_history) > self.cache_size:84 # Remove least used models85 sorted_models = sorted(86 self.performance_history.items(),87 key=lambda x: x[1]["queries"]88 )89 for i in range(len(self.performance_history) - self.cache_size):90 if i < len(sorted_models):91 del self.performance_history[sorted_models[i][0]]9293 def optimize_thresholds(self) -> None:94 """Periodically optimize routing thresholds based on collected metrics."""95 if not self.performance_history:96 return9798 openai_models = [k for k in self.performance_history if k.startswith("openai:")]99 ollama_models = [k for k in self.performance_history if k.startswith("ollama:")]100101 if not openai_models or not ollama_models:102 return # Need data from both providers103104 # Calculate average performance metrics for each provider105 openai_avg_time = np.mean([106 self.performance_history[model]["avg_response_time"]107 for model in openai_models108 ])109 ollama_avg_time = np.mean([110 self.performance_history[model]["avg_response_time"]111 for model in ollama_models112 ])113114 # Find optimal complexity threshold by analyzing where Ollama begins to struggle115 complexity_success_rates = {}116117 for model in ollama_models:118 for complexity, metrics in self.performance_history[model]["complexity_performance"].items():119 if complexity not in complexity_success_rates:120 complexity_success_rates[complexity] = []121 complexity_success_rates[complexity].append(metrics["success_rate"])122123 # Find the complexity level where Ollama success rate drops significantly124 optimal_threshold = self.complexity_threshold # Start with current125126 if complexity_success_rates:127 complexities = sorted(complexity_success_rates.keys())128 avg_success_rates = [129 np.mean(complexity_success_rates[c]) for c in complexities130 ]131132 # Find first major drop in success rate133 for i in range(1, len(complexities)):134 if (avg_success_rates[i-1] - avg_success_rates[i]) > 0.15: # 15% drop135 optimal_threshold = complexities[i-1]136 break137138 # If no clear drop, look for when it falls below 85%139 if optimal_threshold == self.complexity_threshold:140 for i, c in enumerate(complexities):141 if avg_success_rates[i] < 0.85:142 optimal_threshold = c143 break144145 # Update thresholds (with dampening to avoid oscillation)146 self.complexity_threshold = (147 0.8 * self.complexity_threshold +148 0.2 * optimal_threshold149 )150151 # Update latency requirements based on current performance152 self.latency_requirement = max(1.0, min(ollama_avg_time * 1.2, 5.0))153154 logger.info(f"Optimized routing thresholds: complexity={self.complexity_threshold:.2f}, latency={self.latency_requirement:.2f}s")155156 def get_optimal_provider(self,157 query_complexity: float,158 privacy_score: float,159 estimated_tokens: int,160 requires_tools: bool) -> str:161 """Get the optimal provider based on current metrics and query characteristics."""162 # Calculate weighted score for routing decision163 openai_score = 0164 ollama_score = 0165166 # Complexity factor167 if query_complexity > self.complexity_threshold:168 openai_score += self.weights["complexity"]169 else:170 ollama_score += self.weights["complexity"]171172 # Token count factor173 if estimated_tokens > self.token_threshold:174 openai_score += self.weights["token_count"]175 else:176 ollama_score += self.weights["token_count"]177178 # Privacy factor (higher privacy score means more sensitive)179 if privacy_score > 0.5:180 ollama_score += self.weights["privacy_score"]181 else:182 # Split proportionally183 ollama_privacy = self.weights["privacy_score"] * privacy_score * 2184 openai_privacy = self.weights["privacy_score"] * (1 - privacy_score * 2)185 ollama_score += ollama_privacy186 openai_score += openai_privacy187188 # Tool requirements factor189 if requires_tools:190 openai_score += self.weights["tool_requirement"]191192 # Return the provider with higher score193 return "openai" if openai_score > ollama_score else "ollama"
2. Response Caching with Semantic Search
python1# app/services/cache_service.py2import time3import hashlib4import json5from typing import Dict, List, Any, Optional, Tuple6import numpy as np7from scipy.spatial.distance import cosine8import aioredis910from app.config import settings11from app.services.embedding_service import EmbeddingService1213class SemanticCache:14 """Intelligent caching system using semantic similarity."""1516 def __init__(self, embedding_service: EmbeddingService, ttl: int = 3600):17 self.embedding_service = embedding_service18 self.redis = None19 self.ttl = ttl20 self.similarity_threshold = 0.92 # Threshold for semantic similarity21 self.exact_cache_enabled = True22 self.semantic_cache_enabled = True2324 async def initialize(self):25 """Initialize Redis connection."""26 self.redis = await aioredis.create_redis_pool(settings.REDIS_URL)2728 async def close(self):29 """Close Redis connection."""30 if self.redis:31 self.redis.close()32 await self.redis.wait_closed()3334 def _get_exact_cache_key(self, messages: List[Dict], provider: str, model: str) -> str:35 """Generate an exact cache key from request parameters."""36 # Normalize the request to ensure consistent keys37 normalized = {38 "messages": messages,39 "provider": provider,40 "model": model41 }42 serialized = json.dumps(normalized, sort_keys=True)43 return f"exact:{hashlib.md5(serialized.encode()).hexdigest()}"4445 async def _get_embedding_key(self, text: str) -> str:46 """Get the embedding key for a text string."""47 return f"emb:{hashlib.md5(text.encode()).hexdigest()}"4849 async def _store_embedding(self, text: str, embedding: List[float]) -> None:50 """Store an embedding in Redis."""51 key = await self._get_embedding_key(text)52 await self.redis.set(key, json.dumps(embedding), expire=self.ttl)5354 async def _get_embedding(self, text: str) -> Optional[List[float]]:55 """Get an embedding from Redis or compute it if not found."""56 key = await self._get_embedding_key(text)57 cached = await self.redis.get(key)5859 if cached:60 return json.loads(cached)6162 # Generate new embedding63 embedding = await self.embedding_service.get_embedding(text)64 if embedding:65 await self._store_embedding(text, embedding)6667 return embedding6869 async def _compute_similarity(self, embedding1: List[float], embedding2: List[float]) -> float:70 """Compute cosine similarity between embeddings."""71 return 1 - cosine(embedding1, embedding2)7273 async def get(self, messages: List[Dict], provider: str, model: str) -> Optional[Dict]:74 """Get a cached response if available."""75 if not self.redis:76 return None7778 # Try exact match first79 if self.exact_cache_enabled:80 exact_key = self._get_exact_cache_key(messages, provider, model)81 cached = await self.redis.get(exact_key)82 if cached:83 return json.loads(cached)8485 # Try semantic search if enabled86 if self.semantic_cache_enabled:87 # Extract query text (last user message)88 query_text = None89 for msg in reversed(messages):90 if msg.get("role") == "user" and msg.get("content"):91 query_text = msg["content"]92 break9394 if not query_text:95 return None9697 # Get embedding for query98 query_embedding = await self._get_embedding(query_text)99 if not query_embedding:100 return None101102 # Get all semantic cache keys103 semantic_keys = await self.redis.keys("semantic:*")104 if not semantic_keys:105 return None106107 # Find most similar cached query108 best_match = None109 best_similarity = 0110111 for key in semantic_keys:112 # Get metadata113 meta_key = f"{key}:meta"114 meta_data = await self.redis.get(meta_key)115 if not meta_data:116 continue117118 meta = json.loads(meta_data)119 cached_embedding = meta.get("embedding")120121 if not cached_embedding:122 continue123124 # Check provider/model compatibility125 if (provider != "auto" and meta.get("provider") != provider) or \126 (model and meta.get("model") != model):127 continue128129 # Compute similarity130 similarity = await self._compute_similarity(query_embedding, cached_embedding)131132 if similarity > self.similarity_threshold and similarity > best_similarity:133 best_match = key134 best_similarity = similarity135136 if best_match:137 cached = await self.redis.get(best_match)138 if cached:139 # Record cache hit analytics140 await self.redis.incr("stats:semantic_cache_hits")141 return json.loads(cached)142143 # Record cache miss144 await self.redis.incr("stats:cache_misses")145 return None146147 async def set(self, messages: List[Dict], provider: str, model: str, response: Dict) -> None:148 """Set a response in the cache."""149 if not self.redis:150 return151152 # Set exact match cache153 if self.exact_cache_enabled:154 exact_key = self._get_exact_cache_key(messages, provider, model)155 await self.redis.set(exact_key, json.dumps(response), expire=self.ttl)156157 # Set semantic cache158 if self.semantic_cache_enabled:159 # Extract query text (last user message)160 query_text = None161 for msg in reversed(messages):162 if msg.get("role") == "user" and msg.get("content"):163 query_text = msg["content"]164 break165166 if not query_text:167 return168169 # Get embedding for query170 query_embedding = await self._get_embedding(query_text)171 if not query_embedding:172 return173174 # Generate semantic key175 semantic_key = f"semantic:{time.time()}:{hashlib.md5(query_text.encode()).hexdigest()}"176177 # Store response178 await self.redis.set(semantic_key, json.dumps(response), expire=self.ttl)179180 # Store metadata (for similarity search)181 meta_data = {182 "query": query_text,183 "embedding": query_embedding,184 "provider": response.get("provider", provider),185 "model": response.get("model", model),186 "timestamp": time.time()187 }188189 await self.redis.set(f"{semantic_key}:meta", json.dumps(meta_data), expire=self.ttl)190191 async def get_stats(self) -> Dict[str, int]:192 """Get cache statistics."""193 if not self.redis:194 return {"hits": 0, "misses": 0, "semantic_hits": 0}195196 exact_hits = int(await self.redis.get("stats:exact_cache_hits") or 0)197 semantic_hits = int(await self.redis.get("stats:semantic_cache_hits") or 0)198 misses = int(await self.redis.get("stats:cache_misses") or 0)199200 return {201 "exact_hits": exact_hits,202 "semantic_hits": semantic_hits,203 "total_hits": exact_hits + semantic_hits,204 "misses": misses,205 "hit_rate": (exact_hits + semantic_hits) / (exact_hits + semantic_hits + misses) if (exact_hits + semantic_hits + misses) > 0 else 0206 }
3. Parallel Query Processing
python1# app/services/parallel_processor.py2import asyncio3from typing import List, Dict, Any, Optional, Tuple4import logging5import time67from app.services.provider_service import ProviderService8from app.config import settings910logger = logging.getLogger(__name__)1112class ParallelProcessor:13 """Processes complex queries by decomposing and running in parallel."""1415 def __init__(self, provider_service: ProviderService):16 self.provider_service = provider_service17 # Threshold for when to use parallel processing18 self.complexity_threshold = 0.819 self.parallel_enabled = settings.ENABLE_PARALLEL_PROCESSING2021 async def should_process_in_parallel(self, messages: List[Dict]) -> bool:22 """Determine if a query should be processed in parallel."""23 if not self.parallel_enabled:24 return False2526 # Get the last user message27 user_message = None28 for msg in reversed(messages):29 if msg.get("role") == "user":30 user_message = msg.get("content", "")31 break3233 if not user_message:34 return False3536 # Check message length37 if len(user_message.split()) < 50:38 return False3940 # Check for complexity indicators41 complexity_markers = [42 "compare", "analyze", "different perspectives", "pros and cons",43 "multiple aspects", "detail", "comprehensive", "multifaceted"44 ]4546 marker_count = sum(1 for marker in complexity_markers if marker in user_message.lower())4748 # Check for multiple questions49 question_count = user_message.count("?")5051 # Calculate complexity score52 complexity = (marker_count * 0.15) + (question_count * 0.2) + (len(user_message.split()) / 500)5354 return complexity > self.complexity_threshold5556 async def decompose_query(self, query: str) -> List[str]:57 """Decompose a complex query into simpler sub-queries."""58 # Use the provider service to generate the decomposition59 decompose_messages = [60 {"role": "system", "content": """61 You are a query decomposition specialist. Your job is to break down complex questions into62 simpler, independent sub-questions that can be answered separately and then combined.6364 Return a JSON array of strings, where each string is a sub-question.65 For example: ["What are the basics of quantum computing?", "How does quantum computing differ from classical computing?"]6667 Keep the total number of sub-questions between 2 and 5.68 """},69 {"role": "user", "content": f"Decompose this complex query into simpler sub-questions: {query}"}70 ]7172 try:73 response = await self.provider_service.generate_completion(74 messages=decompose_messages,75 provider="openai", # Use OpenAI for decomposition76 model="gpt-3.5-turbo", # Use a faster model for this task77 response_format={"type": "json_object"}78 )7980 if response and response.get("message", {}).get("content"):81 import json82 result = json.loads(response["message"]["content"])83 if isinstance(result, list) and all(isinstance(item, str) for item in result):84 return result85 elif isinstance(result, dict) and "sub_questions" in result:86 return result["sub_questions"]8788 # Fallback to simple decomposition89 return [query]9091 except Exception as e:92 logger.error(f"Error decomposing query: {str(e)}")93 # Fallback to simple decomposition94 return [query]9596 async def process_sub_query(self, sub_query: str, provider: str, model: str) -> Dict[str, Any]:97 """Process a single sub-query."""98 messages = [{"role": "user", "content": sub_query}]99100 start_time = time.time()101 response = await self.provider_service.generate_completion(102 messages=messages,103 provider=provider,104 model=model105 )106 duration = time.time() - start_time107108 return {109 "query": sub_query,110 "response": response,111 "content": response.get("message", {}).get("content", ""),112 "duration": duration113 }114115 async def synthesize_responses(self,116 original_query: str,117 sub_results: List[Dict]) -> str:118 """Synthesize the responses from sub-queries into a cohesive answer."""119 # Extract the responses120 synthesize_prompt = f"""121 Original question: {original_query}122123 I've broken this question down into parts and found the following information:124125 {126 ''.join([f"Sub-question: {r['query']}\nAnswer: {r['content']}\n\n" for r in sub_results])127 }128129 Please synthesize this information into a cohesive, comprehensive answer to the original question.130 Ensure the response is well-structured and flows naturally as if it were answering the original131 question directly. Maintain a consistent tone throughout.132 """133134 messages = [135 {"role": "system", "content": "You are an expert at synthesizing information from multiple sources into cohesive, comprehensive answers."},136 {"role": "user", "content": synthesize_prompt}137 ]138139 try:140 response = await self.provider_service.generate_completion(141 messages=messages,142 provider="openai", # Use OpenAI for synthesis143 model="gpt-4" # Use a more capable model for synthesis144 )145146 if response and response.get("message", {}).get("content"):147 return response["message"]["content"]148149 # Fallback150 return "\n\n".join([r['content'] for r in sub_results])151152 except Exception as e:153 logger.error(f"Error synthesizing responses: {str(e)}")154 # Fallback to simple concatenation155 return "\n\n".join([f"Regarding '{r['query']}':\n{r['content']}" for r in sub_results])156157 async def process_in_parallel(self,158 messages: List[Dict],159 provider: str = "auto",160 model: str = None) -> Dict[str, Any]:161 """Process a complex query by breaking it down and processing in parallel."""162 # Get the last user message163 user_message = None164 for msg in reversed(messages):165 if msg.get("role") == "user":166 user_message = msg.get("content", "")167 break168169 if not user_message:170 # Fallback to regular processing171 return await self.provider_service.generate_completion(172 messages=messages,173 provider=provider,174 model=model175 )176177 # Decompose the query178 sub_queries = await self.decompose_query(user_message)179180 if len(sub_queries) <= 1:181 # Not complex enough to benefit from parallel processing182 return await self.provider_service.generate_completion(183 messages=messages,184 provider=provider,185 model=model186 )187188 # Process sub-queries in parallel189 tasks = [190 self.process_sub_query(query, provider, model)191 for query in sub_queries192 ]193194 sub_results = await asyncio.gather(*tasks)195196 # Synthesize the results197 final_content = await self.synthesize_responses(user_message, sub_results)198199 # Calculate aggregated metrics200 total_duration = sum(result["duration"] for result in sub_results)201 providers_used = [result["response"].get("provider") for result in sub_results202 if result["response"].get("provider")]203 models_used = [result["response"].get("model") for result in sub_results204 if result["response"].get("model")]205206 # Construct a response in the same format as provider_service.generate_completion207 return {208 "id": f"parallel_{int(time.time())}",209 "object": "chat.completion",210 "created": int(time.time()),211 "model": ", ".join(set(models_used)) if models_used else model,212 "provider": ", ".join(set(providers_used)) if providers_used else provider,213 "usage": {214 "prompt_tokens": sum(result["response"].get("usage", {}).get("prompt_tokens", 0)215 for result in sub_results),216 "completion_tokens": sum(result["response"].get("usage", {}).get("completion_tokens", 0)217 for result in sub_results),218 "total_tokens": sum(result["response"].get("usage", {}).get("total_tokens", 0)219 for result in sub_results)220 },221 "message": {222 "role": "assistant",223 "content": final_content224 },225 "parallel_processing": {226 "sub_queries": len(sub_queries),227 "total_duration": total_duration,228 "max_duration": max(result["duration"] for result in sub_results),229 "processing_efficiency": 1 - (max(result["duration"] for result in sub_results) / total_duration)230 if total_duration > 0 else 0231 }232 }
4. Dynamic Batching for High-Load Scenarios
python1# app/services/batch_processor.py2import asyncio3from typing import List, Dict, Any, Optional, Callable, Awaitable4import time5import logging6from collections import deque78logger = logging.getLogger(__name__)910class RequestBatcher:11 """12 Dynamically batches requests to optimize throughput under high load.13 """1415 def __init__(self,16 max_batch_size: int = 4,17 max_wait_time: float = 0.1,18 processor_fn: Optional[Callable] = None):19 self.max_batch_size = max_batch_size20 self.max_wait_time = max_wait_time21 self.processor_fn = processor_fn22 self.queue = deque()23 self.batch_task = None24 self.active = False25 self.stats = {26 "total_requests": 0,27 "total_batches": 0,28 "avg_batch_size": 0,29 "max_queue_length": 030 }3132 async def start(self):33 """Start the batch processor."""34 if self.active:35 return3637 self.active = True38 self.batch_task = asyncio.create_task(self._batch_processor())39 logger.info("Batch processor started")4041 async def stop(self):42 """Stop the batch processor."""43 if not self.active:44 return4546 self.active = False47 if self.batch_task:48 try:49 self.batch_task.cancel()50 await self.batch_task51 except asyncio.CancelledError:52 pass5354 logger.info("Batch processor stopped")5556 async def _batch_processor(self):57 """Background task to process batches."""58 while self.active:59 try:60 # Process any batches in the queue61 await self._process_next_batch()6263 # Wait a small amount of time before checking again64 await asyncio.sleep(0.01)65 except Exception as e:66 logger.error(f"Error in batch processor: {str(e)}")67 await asyncio.sleep(1) # Wait longer on error6869 async def _process_next_batch(self):70 """Process the next batch from the queue."""71 if not self.queue:72 return7374 # Start timing from oldest request75 oldest_request_time = self.queue[0][2]76 current_time = time.time()7778 # Process if we have max batch size or max wait time elapsed79 if len(self.queue) >= self.max_batch_size or \80 (current_time - oldest_request_time) >= self.max_wait_time:8182 # Extract batch (up to max_batch_size)83 batch_size = min(len(self.queue), self.max_batch_size)84 batch = []8586 for _ in range(batch_size):87 request, future, _ = self.queue.popleft()88 batch.append((request, future))8990 # Update stats91 self.stats["total_batches"] += 192 self.stats["avg_batch_size"] = ((self.stats["avg_batch_size"] * (self.stats["total_batches"] - 1)) + batch_size) / self.stats["total_batches"]9394 # Process batch95 asyncio.create_task(self._process_batch(batch))9697 async def _process_batch(self, batch: List[tuple]):98 """Process a batch of requests."""99 if not self.processor_fn:100 for _, future in batch:101 if not future.done():102 future.set_exception(ValueError("No processor function set"))103 return104105 # Extract just the requests for processing106 requests = [req for req, _ in batch]107108 try:109 # Process the batch110 results = await self.processor_fn(requests)111112 # Match results to futures113 if results and len(results) == len(batch):114 for i, (_, future) in enumerate(batch):115 if not future.done():116 future.set_result(results[i])117 else:118 # Handle mismatch in results119 logger.error(f"Batch result count mismatch: {len(results)} results for {len(batch)} requests")120 for _, future in batch:121 if not future.done():122 future.set_exception(ValueError("Batch processing error: result count mismatch"))123124 except Exception as e:125 logger.error(f"Error processing batch: {str(e)}")126 # Set exception for all futures in batch127 for _, future in batch:128 if not future.done():129 future.set_exception(e)130131 async def submit(self, request: Any) -> Any:132 """Submit a request for batched processing."""133 self.stats["total_requests"] += 1134135 # Create future for this request136 future = asyncio.Future()137138 # Add to queue with timestamp139 self.queue.append((request, future, time.time()))140141 # Update max queue length stat142 queue_length = len(self.queue)143 if queue_length > self.stats["max_queue_length"]:144 self.stats["max_queue_length"] = queue_length145146 # Return future147 return await future
5. Model-Specific Prompt Optimization
python1# app/services/prompt_optimizer.py2import logging3from typing import List, Dict, Any, Optional4import re56logger = logging.getLogger(__name__)78class PromptOptimizer:9 """Optimizes prompts for specific models to improve response quality and reduce token usage."""1011 def __init__(self):12 self.model_specific_templates = {13 # OpenAI models14 "gpt-4": {15 "prefix": "", # GPT-4 doesn't need special prefixing16 "suffix": "",17 "instruction_format": "{instruction}"18 },19 "gpt-3.5-turbo": {20 "prefix": "",21 "suffix": "",22 "instruction_format": "{instruction}"23 },2425 # Ollama models - they benefit from more explicit formatting26 "llama2": {27 "prefix": "",28 "suffix": "Think step-by-step and be thorough in your response.",29 "instruction_format": "{instruction}"30 },31 "llama2:70b": {32 "prefix": "",33 "suffix": "",34 "instruction_format": "{instruction}"35 },36 "mistral": {37 "prefix": "",38 "suffix": "Take a deep breath and work on this step-by-step.",39 "instruction_format": "{instruction}"40 },41 "codellama": {42 "prefix": "You are an expert programmer with years of experience. ",43 "suffix": "Make sure your code is correct and efficient.",44 "instruction_format": "Task: {instruction}"45 },46 "wizard-math": {47 "prefix": "You are a mathematics expert. ",48 "suffix": "Show your work step-by-step and explain your reasoning clearly.",49 "instruction_format": "Problem: {instruction}"50 }51 }5253 # Default template to use when model not specifically defined54 self.default_template = {55 "prefix": "",56 "suffix": "",57 "instruction_format": "{instruction}"58 }5960 # Task-specific optimizations61 self.task_templates = {62 "code_generation": {63 "prefix": "You are an expert programmer. ",64 "suffix": "Ensure your code is correct, efficient, and well-commented.",65 "instruction_format": "Programming Task: {instruction}"66 },67 "creative_writing": {68 "prefix": "You are a creative writer with excellent storytelling abilities. ",69 "suffix": "",70 "instruction_format": "Creative Writing Prompt: {instruction}"71 },72 "reasoning": {73 "prefix": "You are a logical thinker with strong reasoning skills. ",74 "suffix": "Think step-by-step and be precise in your analysis.",75 "instruction_format": "Reasoning Task: {instruction}"76 },77 "math": {78 "prefix": "You are a mathematics expert. ",79 "suffix": "Show your work step-by-step with explanations.",80 "instruction_format": "Math Problem: {instruction}"81 }82 }8384 def detect_task_type(self, message: str) -> Optional[str]:85 """Detect the type of task from the message content."""86 message_lower = message.lower()8788 # Code detection patterns89 code_patterns = [90 r"write (a|an|the)?\s?(code|function|program|script|class|method)",91 r"implement (a|an|the)?\s?(algorithm|function|class|method)",92 r"debug (this|the)?\s?(code|function|program)",93 r"(js|javascript|python|java|c\+\+|go|rust|typescript)"94 ]9596 # Creative writing patterns97 creative_patterns = [98 r"write (a|an|the)?\s?(story|poem|essay|narrative|scene)",99 r"create (a|an|the)?\s?(story|character|dialogue|setting)",100 r"describe (a|an|the)?\s?(scene|character|setting|world)"101 ]102103 # Math patterns104 math_patterns = [105 r"calculate",106 r"solve (this|the)?\s?(equation|problem|expression)",107 r"compute",108 r"what is (the)?\s?(value|result|answer)",109 r"find (the)?\s?(derivative|integral|product|sum|limit)"110 ]111112 # Reasoning patterns113 reasoning_patterns = [114 r"analyze",115 r"compare (and|&) contrast",116 r"explain (why|how)",117 r"what are (the)?\s?(pros|cons|advantages|disadvantages)",118 r"evaluate"119 ]120121 # Check each pattern set122 for pattern in code_patterns:123 if re.search(pattern, message_lower):124 return "code_generation"125126 for pattern in creative_patterns:127 if re.search(pattern, message_lower):128 return "creative_writing"129130 for pattern in math_patterns:131 if re.search(pattern, message_lower):132 return "math"133134 for pattern in reasoning_patterns:135 if re.search(pattern, message_lower):136 return "reasoning"137138 return None139140 def optimize_system_prompt(self, original_prompt: str, model: str, task_type: Optional[str] = None) -> str:141 """Optimize the system prompt for the specific model and task."""142 # If no original prompt, return an appropriate default143 if not original_prompt:144 return "You are a helpful assistant. Provide accurate, detailed, and clear responses."145146 # Get model-specific template147 template = self.model_specific_templates.get(model, self.default_template)148149 # If task type is provided, incorporate task-specific optimizations150 if task_type and task_type in self.task_templates:151 task_template = self.task_templates[task_type]152153 # Merge templates, with task template taking precedence for non-empty values154 merged_template = {155 "prefix": task_template["prefix"] if task_template["prefix"] else template["prefix"],156 "suffix": task_template["suffix"] if task_template["suffix"] else template["suffix"],157 "instruction_format": task_template["instruction_format"]158 }159160 template = merged_template161162 # Apply template163 optimized_prompt = f"{template['prefix']}{original_prompt}"164165 # Add suffix if it doesn't appear to already be present166 if template["suffix"] and template["suffix"] not in optimized_prompt:167 optimized_prompt += f" {template['suffix']}"168169 return optimized_prompt170171 def optimize_user_prompt(self, original_prompt: str, model: str, task_type: Optional[str] = None) -> str:172 """Optimize the user prompt for the specific model and task."""173 if not original_prompt:174 return original_prompt175176 # Auto-detect task type if not provided177 if not task_type:178 task_type = self.detect_task_type(original_prompt)179180 # Get model-specific template181 template = self.model_specific_templates.get(model, self.default_template)182183 # If task type is provided, incorporate task-specific optimizations184 if task_type and task_type in self.task_templates:185 task_template = self.task_templates[task_type]186 # Use task instruction format if available187 instruction_format = task_template["instruction_format"]188 else:189 instruction_format = template["instruction_format"]190191 # Apply instruction format if the prompt doesn't already look formatted192 if "{instruction}" in instruction_format and not re.match(r"^(task|problem|prompt|question):", original_prompt.lower()):193 formatted_prompt = instruction_format.replace("{instruction}", original_prompt)194 return formatted_prompt195196 return original_prompt197198 def optimize_messages(self, messages: List[Dict[str, str]], model: str) -> List[Dict[str, str]]:199 """Optimize all messages in a conversation for the specific model."""200 if not messages:201 return messages202203 # Try to detect task type from the user messages204 task_type = None205 for msg in messages:206 if msg.get("role") == "user" and msg.get("content"):207 detected_task = self.detect_task_type(msg["content"])208 if detected_task:209 task_type = detected_task210 break211212 optimized = []213214 for msg in messages:215 role = msg.get("role", "")216 content = msg.get("content", "")217218 if role == "system" and content:219 optimized_content = self.optimize_system_prompt(content, model, task_type)220 optimized.append({"role": role, "content": optimized_content})221 elif role == "user" and content:222 optimized_content = self.optimize_user_prompt(content, model, task_type)223 optimized.append({"role": role, "content": optimized_content})224 else:225 # Keep other messages unchanged226 optimized.append(msg)227228 return optimized
Cost Reduction Strategies
1. Token Usage Optimization
python1# app/services/token_optimizer.py2import logging3import re4from typing import List, Dict, Any, Optional, Tuple5import tiktoken6import numpy as np78logger = logging.getLogger(__name__)910class TokenOptimizer:11 """Optimizes token usage to reduce costs."""1213 def __init__(self):14 # Load tokenizers once15 try:16 self.gpt3_tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo")17 self.gpt4_tokenizer = tiktoken.encoding_for_model("gpt-4")18 except Exception as e:19 logger.warning(f"Could not load tokenizers: {str(e)}. Falling back to approximate counting.")20 self.gpt3_tokenizer = None21 self.gpt4_tokenizer = None2223 def count_tokens(self, text: str, model: str = "gpt-3.5-turbo") -> int:24 """Count the number of tokens in a text string for a specific model."""25 if not text:26 return 02728 # Use appropriate tokenizer if available29 if model.startswith("gpt-4") and self.gpt4_tokenizer:30 return len(self.gpt4_tokenizer.encode(text))31 elif model.startswith("gpt-3") and self.gpt3_tokenizer:32 return len(self.gpt3_tokenizer.encode(text))3334 # Fallback to approximation (~ 4 chars per token for English)35 return len(text) // 4 + 13637 def count_message_tokens(self, messages: List[Dict[str, str]], model: str = "gpt-3.5-turbo") -> int:38 """Count tokens in a full message array."""39 if not messages:40 return 04142 total = 04344 # Different models have different message formatting overheads45 if model.startswith("gpt-3.5-turbo"):46 # Per OpenAI's formula for message token counting47 # See: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb48 total += 3 # Every message follows <im_start>{role/name}\n{content}<im_end>\n4950 for message in messages:51 total += 3 # Role overhead52 for key, value in message.items():53 if key == "name": # Name is 1 token54 total += 155 if key == "content" and value:56 total += self.count_tokens(value, model)5758 total += 3 # Assistant response overhead5960 elif model.startswith("gpt-4"):61 # Similar formula for GPT-462 total += 36364 for message in messages:65 total += 366 for key, value in message.items():67 if key == "name":68 total += 169 if key == "content" and value:70 total += self.count_tokens(value, model)7172 total += 37374 else:75 # Simple approach for other models76 for message in messages:77 content = message.get("content", "")78 if content:79 total += self.count_tokens(content, model)8081 return total8283 def truncate_messages(self,84 messages: List[Dict[str, str]],85 max_tokens: int,86 model: str = "gpt-3.5-turbo",87 preserve_system: bool = True,88 preserve_last_n_exchanges: int = 2) -> List[Dict[str, str]]:89 """Truncate conversation history to fit within token limit."""90 if not messages:91 return messages9293 # Clone messages to avoid modifying the original94 messages = [m.copy() for m in messages]9596 current_tokens = self.count_message_tokens(messages, model)9798 # If already under the limit, return as is99 if current_tokens <= max_tokens:100 return messages101102 # Identify system and user/assistant pairs103 system_messages = [m for m in messages if m.get("role") == "system"]104 system_tokens = sum(self.count_tokens(m.get("content", ""), model) for m in system_messages)105106 # Extract exchanges (user followed by assistant message)107 exchanges = []108 current_exchange = []109110 for m in messages:111 if m.get("role") == "system":112 continue113114 current_exchange.append(m)115116 # If we have a user+assistant pair, add to exchanges and reset117 if len(current_exchange) == 2 and current_exchange[0].get("role") == "user" and current_exchange[1].get("role") == "assistant":118 exchanges.append(current_exchange)119 current_exchange = []120121 # Add any remaining messages122 if current_exchange:123 exchanges.append(current_exchange)124125 # Calculate tokens needed for essential parts126 essential_tokens = system_tokens if preserve_system else 0127128 # Add tokens for the last N exchanges129 last_n_exchanges = exchanges[-preserve_last_n_exchanges:] if exchanges else []130 last_n_tokens = sum(131 self.count_tokens(m.get("content", ""), model)132 for exchange in last_n_exchanges133 for m in exchange134 )135136 essential_tokens += last_n_tokens137138 # If essential parts already exceed the limit, we need more aggressive truncation139 if essential_tokens > max_tokens:140 logger.warning(f"Essential conversation parts exceed token limit: {essential_tokens} > {max_tokens}")141142 # Start by keeping system messages if requested143 result = system_messages.copy() if preserve_system else []144145 # Add as many of the last exchanges as we can fit146 remaining_tokens = max_tokens - sum(self.count_tokens(m.get("content", ""), model) for m in result)147148 for exchange in reversed(last_n_exchanges):149 exchange_tokens = sum(self.count_tokens(m.get("content", ""), model) for m in exchange)150151 if exchange_tokens <= remaining_tokens:152 result.extend(exchange)153 remaining_tokens -= exchange_tokens154 else:155 # If we can't fit the whole exchange, try truncating the assistant response156 if len(exchange) == 2:157 user_msg = exchange[0]158 assistant_msg = exchange[1].copy()159160 user_tokens = self.count_tokens(user_msg.get("content", ""), model)161162 if user_tokens < remaining_tokens:163 # We can include the user message164 result.append(user_msg)165 remaining_tokens -= user_tokens166167 # Truncate the assistant message to fit168 assistant_content = assistant_msg.get("content", "")169 if assistant_content:170 # Simple truncation - in a real system, you'd want more intelligent truncation171 chars_to_keep = int(remaining_tokens * 4) # Approximate char count172 truncated_content = assistant_content[:chars_to_keep] + "... [truncated]"173 assistant_msg["content"] = truncated_content174 result.append(assistant_msg)175176 break177178 # Resort the messages to maintain the correct order179 result.sort(key=lambda m: messages.index(m) if m in messages else 999999)180 return result181182 # If we get here, we can keep all essential parts and need to drop from the middle183 result = system_messages.copy() if preserve_system else []184 middle_exchanges = exchanges[:-preserve_last_n_exchanges] if len(exchanges) > preserve_last_n_exchanges else []185186 # Calculate how many tokens we can allocate to middle exchanges187 remaining_tokens = max_tokens - essential_tokens188189 # Add exchanges from the middle, newest first, until we run out of tokens190 for exchange in reversed(middle_exchanges):191 exchange_tokens = sum(self.count_tokens(m.get("content", ""), model) for m in exchange)192193 if exchange_tokens <= remaining_tokens:194 result.extend(exchange)195 remaining_tokens -= exchange_tokens196 else:197 break198199 # Add the preserved last exchanges200 for exchange in last_n_exchanges:201 result.extend(exchange)202203 # Sort messages to maintain the correct order204 result.sort(key=lambda m: messages.index(m) if m in messages else 999999)205206 # Verify the result is within the token limit207 final_tokens = self.count_message_tokens(result, model)208 if final_tokens > max_tokens:209 logger.warning(f"Truncation failed to meet target: {final_tokens} > {max_tokens}")210211 return result212213 def compress_system_prompt(self, system_prompt: str, max_tokens: int, model: str = "gpt-3.5-turbo") -> str:214 """Compress a system prompt to use fewer tokens while preserving key information."""215 current_tokens = self.count_tokens(system_prompt, model)216217 if current_tokens <= max_tokens:218 return system_prompt219220 # Use a language model to compress the prompt221 # In a real implementation, you might want to call an external service222223 # Fallback compression strategy: Use text summarization techniques224 # 1. Remove redundant phrases225 redundant_phrases = [226 "Please note that", "It's important to remember that", "Keep in mind that",227 "I want you to", "I'd like you to", "You should", "Make sure to",228 "Always", "Never", "Remember to"229 ]230231 compressed = system_prompt232 for phrase in redundant_phrases:233 compressed = compressed.replace(phrase, "")234235 # 2. Replace verbose constructions with shorter ones236 replacements = {237 "in order to": "to",238 "for the purpose of": "for",239 "due to the fact that": "because",240 "in the event that": "if",241 "on the condition that": "if",242 "with regard to": "about",243 "in relation to": "about"244 }245246 for verbose, concise in replacements.items():247 compressed = compressed.replace(verbose, concise)248249 # 3. Remove unnecessary whitespace250 compressed = re.sub(r'\s+', ' ', compressed).strip()251252 # 4. If still over the limit, truncate with an ellipsis253 compressed_tokens = self.count_tokens(compressed, model)254 if compressed_tokens > max_tokens:255 # Approximation: 4 characters per token256 char_limit = max_tokens * 4257 compressed = compressed[:char_limit] + "..."258259 return compressed260261 def optimize_messages_for_cost(self,262 messages: List[Dict[str, str]],263 model: str,264 max_tokens: int = 4096) -> List[Dict[str, str]]:265 """Fully optimize messages for cost efficiency."""266 if not messages:267 return messages268269 # 1. First, identify system messages for compression270 system_messages = []271 other_messages = []272273 for msg in messages:274 if msg.get("role") == "system":275 system_messages.append(msg)276 else:277 other_messages.append(msg)278279 # 2. Compress system messages if there are multiple280 if len(system_messages) > 1:281 # Combine multiple system messages282 combined_content = " ".join(msg.get("content", "") for msg in system_messages)283 compressed_content = self.compress_system_prompt(combined_content, 1024, model)284285 # Replace with a single compressed message286 system_messages = [{"role": "system", "content": compressed_content}]287 elif len(system_messages) == 1 and self.count_tokens(system_messages[0].get("content", ""), model) > 1024:288 # Compress a single long system message289 system_messages[0]["content"] = self.compress_system_prompt(290 system_messages[0].get("content", ""), 1024, model291 )292293 # 3. Recombine and truncate the full conversation294 optimized = system_messages + other_messages295 reserved_completion_tokens = max(max_tokens // 4, 1024) # Reserve 25% or at least 1024 tokens for completion296 max_prompt_tokens = max_tokens - reserved_completion_tokens297298 return self.truncate_messages(299 optimized,300 max_prompt_tokens,301 model,302 preserve_system=True,303 preserve_last_n_exchanges=2304 )
2. Model Tier Selection
python1# app/services/model_tier_service.py2import logging3from typing import Dict, List, Any, Optional, Tuple4import re5import time67from app.config import settings89logger = logging.getLogger(__name__)1011class ModelTierService:12 """Selects the appropriate model tier based on task requirements and budget constraints."""1314 def __init__(self):15 # Cost per 1000 tokens for different models (approximate)16 self.model_costs = {17 # OpenAI models input/output costs18 "gpt-4": {"input": 0.03, "output": 0.06},19 "gpt-4-32k": {"input": 0.06, "output": 0.12},20 "gpt-4-turbo": {"input": 0.01, "output": 0.03},21 "gpt-3.5-turbo": {"input": 0.0015, "output": 0.002},22 "gpt-3.5-turbo-16k": {"input": 0.003, "output": 0.004},2324 # Ollama models (local, so effectively zero API cost)25 "llama2": {"input": 0, "output": 0},26 "mistral": {"input": 0, "output": 0},27 "codellama": {"input": 0, "output": 0}28 }2930 # Model capabilities and appropriate use cases31 self.model_capabilities = {32 "gpt-4": ["complex_reasoning", "creative", "code", "math", "general"],33 "gpt-4-turbo": ["complex_reasoning", "creative", "code", "math", "general"],34 "gpt-3.5-turbo": ["simple_reasoning", "general", "summarization"],35 "llama2": ["simple_reasoning", "general", "summarization"],36 "mistral": ["simple_reasoning", "general", "creative"],37 "codellama": ["code"]38 }3940 # Default model selections for different task types41 self.task_model_mapping = {42 "complex_reasoning": {43 "high": "gpt-4-turbo",44 "medium": "gpt-4-turbo",45 "low": "gpt-3.5-turbo"46 },47 "simple_reasoning": {48 "high": "gpt-3.5-turbo",49 "medium": "gpt-3.5-turbo",50 "low": "mistral"51 },52 "creative": {53 "high": "gpt-4-turbo",54 "medium": "mistral",55 "low": "mistral"56 },57 "code": {58 "high": "gpt-4-turbo",59 "medium": "codellama",60 "low": "codellama"61 },62 "math": {63 "high": "gpt-4-turbo",64 "medium": "gpt-3.5-turbo",65 "low": "mistral"66 },67 "general": {68 "high": "gpt-3.5-turbo",69 "medium": "mistral",70 "low": "llama2"71 },72 "summarization": {73 "high": "gpt-3.5-turbo",74 "medium": "mistral",75 "low": "llama2"76 }77 }7879 # Budget tier thresholds - what percentage of budget is remaining?80 self.budget_tiers = {81 "high": 0.6, # >60% of budget remaining82 "medium": 0.3, # 30-60% of budget remaining83 "low": 0.0 # <30% of budget remaining84 }8586 # Initialize usage tracking87 self.monthly_budget = settings.MONTHLY_BUDGET88 self.usage_this_month = 089 self.month_start_timestamp = self._get_month_start_timestamp()9091 def _get_month_start_timestamp(self) -> int:92 """Get timestamp for the start of the current month."""93 import datetime94 now = datetime.datetime.now()95 month_start = datetime.datetime(now.year, now.month, 1)96 return int(month_start.timestamp())9798 def detect_task_type(self, query: str) -> str:99 """Detect the type of task from the query."""100 query_lower = query.lower()101102 # Check for code-related tasks103 code_indicators = [104 "code", "function", "program", "algorithm", "javascript",105 "python", "java", "c++", "typescript", "html", "css"106 ]107 if any(indicator in query_lower for indicator in code_indicators):108 return "code"109110 # Check for math problems111 math_indicators = [112 "calculate", "solve", "equation", "math problem", "compute",113 "derivative", "integral", "algebra", "calculus", "arithmetic"114 ]115 if any(indicator in query_lower for indicator in math_indicators):116 return "math"117118 # Check for creative tasks119 creative_indicators = [120 "story", "poem", "creative", "imagine", "fiction", "fantasy",121 "character", "novel", "script", "narrative", "write a"122 ]123 if any(indicator in query_lower for indicator in creative_indicators):124 return "creative"125126 # Check for complex reasoning127 complex_indicators = [128 "analyze", "critique", "evaluate", "compare and contrast",129 "implications", "consequences", "recommend", "strategy",130 "detailed explanation", "comprehensive", "thorough"131 ]132 if any(indicator in query_lower for indicator in complex_indicators):133 return "complex_reasoning"134135 # Check for summarization136 summary_indicators = [137 "summarize", "summary", "tldr", "briefly explain", "short version",138 "key points", "main ideas"139 ]140 if any(indicator in query_lower for indicator in summary_indicators):141 return "summarization"142143 # Default to simple reasoning if no specific category is detected144 simple_indicators = [145 "explain", "how", "why", "what", "when", "who", "where",146 "help me understand", "tell me about"147 ]148 if any(indicator in query_lower for indicator in simple_indicators):149 return "simple_reasoning"150151 # Fallback to general152 return "general"153154 def get_current_budget_tier(self) -> str:155 """Get the current budget tier based on monthly usage."""156 # Check if we're in a new month157 current_month_start = self._get_month_start_timestamp()158 if current_month_start > self.month_start_timestamp:159 # Reset for new month160 self.month_start_timestamp = current_month_start161 self.usage_this_month = 0162163 if self.monthly_budget <= 0:164 # No budget constraints165 return "high"166167 # Calculate remaining budget percentage168 remaining_percentage = 1 - (self.usage_this_month / self.monthly_budget)169170 # Determine tier171 if remaining_percentage > self.budget_tiers["high"]:172 return "high"173 elif remaining_percentage > self.budget_tiers["medium"]:174 return "medium"175 else:176 return "low"177178 def record_usage(self, model: str, input_tokens: int, output_tokens: int) -> None:179 """Record token usage for budget tracking."""180 if model not in self.model_costs:181 return182183 costs = self.model_costs[model]184 input_cost = (input_tokens / 1000) * costs["input"]185 output_cost = (output_tokens / 1000) * costs["output"]186 total_cost = input_cost + output_cost187188 self.usage_this_month += total_cost189190 # Log for monitoring191 logger.info(f"Usage recorded: {model}, {input_tokens} input tokens, {output_tokens} output tokens, ${total_cost:.4f}")192193 def select_optimal_model(self,194 query: str,195 preferred_provider: Optional[str] = None,196 force_tier: Optional[str] = None) -> Tuple[str, str]:197 """198 Select the optimal model based on the query and budget constraints.199 Returns a tuple of (provider, model)200 """201 # Detect task type202 task_type = self.detect_task_type(query)203204 # Get budget tier (unless forced)205 budget_tier = force_tier if force_tier else self.get_current_budget_tier()206207 # Get the recommended model for this task and budget tier208 recommended_model = self.task_model_mapping[task_type][budget_tier]209210 # Determine provider based on model211 if recommended_model in ["llama2", "mistral", "codellama"]:212 provider = "ollama"213 else:214 provider = "openai"215216 # Override provider if specified and compatible217 if preferred_provider:218 if preferred_provider == "ollama" and provider == "openai":219 # Find an Ollama alternative for this task220 for model, capabilities in self.model_capabilities.items():221 if task_type in capabilities and model in ["llama2", "mistral", "codellama"]:222 recommended_model = model223 provider = "ollama"224 break225 elif preferred_provider == "openai" and provider == "ollama":226 # Find an OpenAI alternative for this task227 for model, capabilities in self.model_capabilities.items():228 if task_type in capabilities and model not in ["llama2", "mistral", "codellama"]:229 recommended_model = model230 provider = "openai"231 break232233 logger.info(f"Selected model for task '{task_type}' (tier: {budget_tier}): {provider}:{recommended_model}")234 return provider, recommended_model235236 def estimate_cost(self, model: str, input_tokens: int, expected_output_tokens: int) -> float:237 """Estimate the cost of a request."""238 if model not in self.model_costs:239 return 0.0240241 costs = self.model_costs[model]242 input_cost = (input_tokens / 1000) * costs["input"]243 output_cost = (expected_output_tokens / 1000) * costs["output"]244245 return input_cost + output_cost
3. Local Model Prioritization for Development
python1# app/services/dev_mode_service.py2import logging3import os4from typing import Dict, List, Any, Optional5import re67logger = logging.getLogger(__name__)89class DevModeService:10 """11 Service that prioritizes local models during development to reduce costs.12 """1314 def __init__(self):15 # Read environment to determine if we're in development mode16 self.is_dev_mode = os.environ.get("APP_ENV", "development").lower() == "development"17 self.dev_mode_forced = os.environ.get("FORCE_DEV_MODE", "false").lower() == "true"1819 # Set up developer-focused settings20 self.allow_openai_for_patterns = [21 r"(complex|sophisticated|advanced)\s+(reasoning|analysis)",22 r"(gpt-4|gpt-3\.5|openai)" # Explicit requests for OpenAI models23 ]2425 self.use_ollama_for_patterns = [26 r"^test\s", # Queries starting with "test"27 r"^debug\s", # Debugging queries28 r"^hello\s", # Simple greetings29 r"^hi\s",30 r"^try\s"31 ]3233 # Track usage for reporting34 self.openai_requests = 035 self.ollama_requests = 036 self.redirected_requests = 03738 def is_development_environment(self) -> bool:39 """Check if we're running in a development environment."""40 return self.is_dev_mode or self.dev_mode_forced4142 def should_use_local_model(self, query: str) -> bool:43 """44 Determine if a query should use local models in development mode.45 In development, we default to local models unless specific patterns are matched.46 """47 if not self.is_development_environment():48 return False4950 # Always use local models for specific patterns51 for pattern in self.use_ollama_for_patterns:52 if re.search(pattern, query, re.IGNORECASE):53 return True5455 # Allow OpenAI for specific advanced patterns56 for pattern in self.allow_openai_for_patterns:57 if re.search(pattern, query, re.IGNORECASE):58 return False5960 # In development, default to local models to save costs61 return True6263 def get_dev_routing_decision(self, query: str, default_provider: str) -> str:64 """65 Make a routing decision based on development mode settings.66 Returns: "openai" or "ollama"67 """68 if not self.is_development_environment():69 return default_provider7071 should_use_local = self.should_use_local_model(query)7273 # Track for reporting74 if should_use_local:75 self.ollama_requests += 176 if default_provider == "openai":77 self.redirected_requests += 178 else:79 self.openai_requests += 18081 return "ollama" if should_use_local else "openai"8283 def get_usage_report(self) -> Dict[str, Any]:84 """Get a report of usage patterns for monitoring costs."""85 total_requests = self.openai_requests + self.ollama_requests8687 if total_requests == 0:88 ollama_percentage = 089 redirected_percentage = 090 else:91 ollama_percentage = (self.ollama_requests / total_requests) * 10092 redirected_percentage = (self.redirected_requests / total_requests) * 1009394 return {95 "dev_mode_active": self.is_development_environment(),96 "total_requests": total_requests,97 "openai_requests": self.openai_requests,98 "ollama_requests": self.ollama_requests,99 "redirected_to_ollama": self.redirected_requests,100 "ollama_usage_percentage": ollama_percentage,101 "cost_savings_percentage": redirected_percentage102 }
4. Request Batching and Rate Limiting
python1# app/services/rate_limiter.py2import time3import asyncio4import logging5from typing import Dict, List, Any, Optional, Callable, Awaitable6from collections import defaultdict7import redis.asyncio as redis89from app.config import settings1011logger = logging.getLogger(__name__)1213class RateLimiter:14 """15 Rate limiter to control API usage and costs.16 Implements tiered rate limiting based on user roles.17 """1819 def __init__(self):20 self.redis = None2122 # Rate limit tiers (requests per time window)23 self.rate_limit_tiers = {24 "free": {25 "minute": 5,26 "hour": 20,27 "day": 10028 },29 "basic": {30 "minute": 20,31 "hour": 100,32 "day": 100033 },34 "premium": {35 "minute": 60,36 "hour": 1000,37 "day": 1000038 },39 "enterprise": {40 "minute": 120,41 "hour": 5000,42 "day": 5000043 }44 }4546 # Provider-specific rate limits (global)47 self.provider_rate_limits = {48 "openai": {49 "minute": 60, # Shared across all users50 "tokens_per_minute": 90000 # Token budget per minute51 },52 "ollama": {53 "minute": 100, # Higher for local models54 "tokens_per_minute": 25000055 }56 }5758 # Tracking for available token budgets59 self.token_budgets = {60 "openai": self.provider_rate_limits["openai"]["tokens_per_minute"],61 "ollama": self.provider_rate_limits["ollama"]["tokens_per_minute"]62 }63 self.last_budget_reset = time.time()6465 async def initialize(self):66 """Initialize Redis connection."""67 self.redis = await redis.from_url(settings.REDIS_URL)6869 # Start token budget replenishment task70 asyncio.create_task(self._token_budget_replenishment())7172 async def _token_budget_replenishment(self):73 """Periodically replenish token budgets."""74 while True:75 try:76 now = time.time()77 elapsed = now - self.last_budget_reset7879 # Reset every minute80 if elapsed >= 60:81 self.token_budgets = {82 "openai": self.provider_rate_limits["openai"]["tokens_per_minute"],83 "ollama": self.provider_rate_limits["ollama"]["tokens_per_minute"]84 }85 self.last_budget_reset = now8687 # Partial replenishment for less than a minute88 else:89 # Calculate replenishment based on elapsed time90 openai_replenishment = int((elapsed / 60) * self.provider_rate_limits["openai"]["tokens_per_minute"])91 ollama_replenishment = int((elapsed / 60) * self.provider_rate_limits["ollama"]["tokens_per_minute"])9293 # Replenish up to max94 self.token_budgets["openai"] = min(95 self.token_budgets["openai"] + openai_replenishment,96 self.provider_rate_limits["openai"]["tokens_per_minute"]97 )98 self.token_budgets["ollama"] = min(99 self.token_budgets["ollama"] + ollama_replenishment,100 self.provider_rate_limits["ollama"]["tokens_per_minute"]101 )102103 self.last_budget_reset = now104 except Exception as e:105 logger.error(f"Error in token budget replenishment: {str(e)}")106107 # Update every 5 seconds108 await asyncio.sleep(5)109110 async def check_rate_limit(self,111 user_id: str,112 tier: str = "free",113 provider: str = "openai") -> Dict[str, Any]:114 """115 Check if a request is within rate limits.116 Returns: {"allowed": bool, "retry_after": Optional[int], "reason": Optional[str]}117 """118 if not self.redis:119 # If Redis is not available, allow the request but log a warning120 logger.warning("Redis not available for rate limiting")121 return {"allowed": True}122123 # Get rate limits for this user's tier124 tier_limits = self.rate_limit_tiers.get(tier, self.rate_limit_tiers["free"])125126 # Check user-specific rate limits127 for window, limit in tier_limits.items():128 key = f"rate:user:{user_id}:{window}"129130 # Get current count131 count = await self.redis.get(key)132 count = int(count) if count else 0133134 if count >= limit:135 ttl = await self.redis.ttl(key)136 return {137 "allowed": False,138 "retry_after": max(1, ttl),139 "reason": f"Rate limit exceeded for {window}"140 }141142 # Check provider-specific rate limits143 provider_limits = self.provider_rate_limits.get(provider, {})144 if "minute" in provider_limits:145 provider_key = f"rate:provider:{provider}:minute"146 provider_count = await self.redis.get(provider_key)147 provider_count = int(provider_count) if provider_count else 0148149 if provider_count >= provider_limits["minute"]:150 ttl = await self.redis.ttl(provider_key)151 return {152 "allowed": False,153 "retry_after": max(1, ttl),154 "reason": f"Global {provider} rate limit exceeded"155 }156157 # Check token budget158 if provider in self.token_budgets and self.token_budgets[provider] <= 0:159 # Calculate time until next budget refresh160 time_since_reset = time.time() - self.last_budget_reset161 time_until_refresh = max(1, int(60 - time_since_reset))162163 return {164 "allowed": False,165 "retry_after": time_until_refresh,166 "reason": f"{provider} token budget exhausted"167 }168169 # All checks passed170 return {"allowed": True}171172 async def increment_counters(self,173 user_id: str,174 provider: str,175 token_count: int = 0) -> None:176 """Increment rate limit counters after a successful request."""177 if not self.redis:178 return179180 now = int(time.time())181182 # Increment user counters for different windows183 pipeline = self.redis.pipeline()184185 # Minute window (expires in 60 seconds)186 minute_key = f"rate:user:{user_id}:minute"187 pipeline.incr(minute_key)188 pipeline.expireat(minute_key, now + 60)189190 # Hour window (expires in 3600 seconds)191 hour_key = f"rate:user:{user_id}:hour"192 pipeline.incr(hour_key)193 pipeline.expireat(hour_key, now + 3600)194195 # Day window (expires in 86400 seconds)196 day_key = f"rate:user:{user_id}:day"197 pipeline.incr(day_key)198 pipeline.expireat(day_key, now + 86400)199200 # Increment provider counter201 provider_key = f"rate:provider:{provider}:minute"202 pipeline.incr(provider_key)203 pipeline.expireat(provider_key, now + 60)204205 # Execute all commands206 await pipeline.execute()207208 # Decrement token budget209 if provider in self.token_budgets and token_count > 0:210 self.token_budgets[provider] = max(0, self.token_budgets[provider] - token_count)211212 async def get_user_usage(self, user_id: str) -> Dict[str, Any]:213 """Get current usage statistics for a user."""214 if not self.redis:215 return {216 "minute": 0,217 "hour": 0,218 "day": 0219 }220221 pipeline = self.redis.pipeline()222223 # Get counts for all windows224 pipeline.get(f"rate:user:{user_id}:minute")225 pipeline.get(f"rate:user:{user_id}:hour")226 pipeline.get(f"rate:user:{user_id}:day")227228 # Get TTLs (time remaining)229 pipeline.ttl(f"rate:user:{user_id}:minute")230 pipeline.ttl(f"rate:user:{user_id}:hour")231 pipeline.ttl(f"rate:user:{user_id}:day")232233 results = await pipeline.execute()234235 return {236 "minute": {237 "usage": int(results[0]) if results[0] else 0,238 "reset_in": results[3] if results[3] and results[3] > 0 else 60239 },240 "hour": {241 "usage": int(results[1]) if results[1] else 0,242 "reset_in": results[4] if results[4] and results[4] > 0 else 3600243 },244 "day": {245 "usage": int(results[2]) if results[2] else 0,246 "reset_in": results[5] if results[5] and results[5] > 0 else 86400247 }248 }
5. Memory and Context Compression
python1# app/services/context_compression.py2import logging3from typing import List, Dict, Any, Optional4import re5import json67logger = logging.getLogger(__name__)89class ContextCompressor:10 """11 Compresses conversation history to reduce token usage while preserving context.12 """1314 def __init__(self):15 self.max_summary_tokens = 300 # Target size for summaries1617 async def compress_history(self,18 messages: List[Dict[str, str]],19 provider_service: Any) -> List[Dict[str, str]]:20 """21 Compress conversation history by summarizing older exchanges.22 Returns a new message list with compressed history.23 """24 # If fewer than 4 messages (system + maybe 1-2 exchanges), no compression needed25 if len(messages) < 4:26 return messages.copy()2728 # Extract system message29 system_messages = [m for m in messages if m.get("role") == "system"]3031 # Find the cut point - we'll preserve the most recent exchanges32 if len(messages) <= 10:33 # For shorter conversations, keep the most recent 3 messages (1-2 exchanges)34 preserve_count = 335 compress_messages = messages[:-preserve_count]36 preserve_messages = messages[-preserve_count:]37 else:38 # For longer conversations, preserve the most recent 4-6 messages (2-3 exchanges)39 preserve_count = min(6, max(4, len(messages) // 5))40 compress_messages = messages[:-preserve_count]41 preserve_messages = messages[-preserve_count:]4243 # No system message in the compression list44 compress_messages = [m for m in compress_messages if m.get("role") != "system"]4546 # If nothing to compress, return original47 if not compress_messages:48 return messages.copy()4950 # Generate summary of the earlier conversation51 summary = await self._generate_conversation_summary(compress_messages, provider_service)5253 # Create a new message list with the summary + preserved messages54 result = system_messages.copy() # Start with system message(s)5556 # Add summary as a system message57 if summary:58 result.append({59 "role": "system",60 "content": f"Previous conversation summary: {summary}"61 })6263 # Add preserved recent messages64 result.extend(preserve_messages)6566 return result6768 async def _generate_conversation_summary(self,69 messages: List[Dict[str, str]],70 provider_service: Any) -> str:71 """Generate a summary of the conversation history."""72 if not messages:73 return ""7475 # Format the conversation for summarization76 conversation_text = "\n".join([77 f"{m.get('role', 'unknown')}: {m.get('content', '')}"78 for m in messages if m.get('content')79 ])8081 # Prepare the summarization prompt82 summary_prompt = [83 {"role": "system", "content":84 "You are a conversation summarizer. Create a concise summary of the key points "85 "from the conversation that would help maintain context for future responses. "86 "Focus on important information, user preferences, and outstanding questions. "87 "Keep the summary under 200 words."88 },89 {"role": "user", "content": f"Summarize this conversation:\n\n{conversation_text}"}90 ]9192 # Get a summary using a smaller/faster model93 try:94 summary_response = await provider_service.generate_completion(95 messages=summary_prompt,96 provider="openai", # Use OpenAI for reliability97 model="gpt-3.5-turbo", # Use a smaller model for efficiency98 max_tokens=self.max_summary_tokens99 )100101 if summary_response and summary_response.get("message", {}).get("content"):102 return summary_response["message"]["content"]103104 except Exception as e:105 logger.error(f"Error generating conversation summary: {str(e)}")106107 # Simple fallback summary generation108 topics = self._extract_topics(conversation_text)109 if topics:110 return f"Previous conversation covered: {', '.join(topics)}."111112 return "The conversation covered various topics which have been summarized to save space."113114 def _extract_topics(self, conversation_text: str) -> List[str]:115 """Simple topic extraction as a fallback mechanism."""116 # Extract potential topic indicators117 topic_phrases = [118 "discussed", "talked about", "mentioned", "referred to",119 "asked about", "inquired about", "wanted to know"120 ]121122 topics = []123124 for phrase in topic_phrases:125 pattern = rf"{phrase} ([^\.,:;]+)"126 matches = re.findall(pattern, conversation_text, re.IGNORECASE)127 topics.extend(matches)128129 # Deduplicate and limit130 unique_topics = list(set(topics))131 return unique_topics[:5] # Return at most 5 topics132133 async def compress_user_query(self,134 original_query: str,135 provider_service: Any) -> str:136 """137 Compress a long user query to reduce token usage while preserving intent.138 Used for very long inputs.139 """140 # If query is already reasonably sized, return as is141 if len(original_query.split()) < 100:142 return original_query143144 # Prepare compression prompt145 compression_prompt = [146 {"role": "system", "content":147 "You are a query optimizer. Your job is to reformulate user queries to be more "148 "concise while preserving the core intent and all critical details. "149 "Remove redundant information and excessive elaboration, but maintain all "150 "specific requirements, constraints, and examples provided."151 },152 {"role": "user", "content": f"Optimize this query to be more concise while preserving all important details:\n\n{original_query}"}153 ]154155 # Get a compressed query156 try:157 compression_response = await provider_service.generate_completion(158 messages=compression_prompt,159 provider="openai",160 model="gpt-3.5-turbo",161 max_tokens=len(original_query.split()) // 2 # Target ~50% reduction162 )163164 if (compression_response and165 compression_response.get("message", {}).get("content") and166 len(compression_response["message"]["content"]) < len(original_query)):167 return compression_response["message"]["content"]168169 except Exception as e:170 logger.error(f"Error compressing user query: {str(e)}")171172 # If compression fails or doesn't reduce size, return original173 return original_query
Response Accuracy Optimization Strategies
1. Prompt Engineering Templates
python1# app/services/prompt_templates.py2from typing import Dict, List, Any, Optional3import re45class PromptTemplates:6 """7 Provides optimized prompt templates for different use cases to improve response accuracy.8 """910 def __init__(self):11 # Core system prompt templates12 self.system_templates = {13 "general": """14 You are a helpful assistant with diverse knowledge and capabilities.15 Provide accurate, relevant, and concise responses to user queries.16 When you don't know something, admit it rather than making up information.17 Format your responses clearly using markdown when helpful.18 """,1920 "coding": """21 You are a coding assistant with expertise in programming languages and software development.22 Provide correct, efficient, and well-documented code examples.23 Explain your code clearly and highlight important concepts.24 Format code blocks using markdown with appropriate syntax highlighting.25 Suggest best practices and consider edge cases in your solutions.26 """,2728 "research": """29 You are a research assistant with access to broad knowledge.30 Provide comprehensive, accurate, and nuanced information.31 Consider different perspectives and cite limitations of your knowledge.32 Structure complex information clearly and logically.33 Indicate uncertainty when appropriate rather than speculating.34 """,3536 "math": """37 You are a mathematics tutor with expertise in various mathematical domains.38 Provide step-by-step explanations for mathematical problems.39 Use clear notation and formatting for equations using markdown.40 Verify your solutions and check for errors or edge cases.41 When solving problems, explain the underlying concepts and techniques.42 """,4344 "creative": """45 You are a creative assistant skilled in writing, storytelling, and idea generation.46 Provide original, engaging, and imaginative content based on user requests.47 Consider tone, style, and audience in your creative work.48 When generating stories or content, maintain internal consistency.49 Respect copyright and avoid plagiarizing existing creative works.50 """51 }5253 # Task-specific prompt templates that can be inserted into system prompts54 self.task_templates = {55 "step_by_step": """56 Break down your explanation into clear, logical steps.57 Begin with foundational concepts before advancing to more complex ideas.58 Use numbered or bulleted lists for sequential instructions or key points.59 Provide examples to illustrate abstract concepts.60 """,6162 "comparison": """63 Present a balanced and objective comparison.64 Identify clear categories for comparison (features, performance, use cases, etc.).65 Highlight both similarities and differences.66 Consider context and specific use cases in your evaluation.67 Avoid unjustified bias and present evidence for evaluative statements.68 """,6970 "factual_accuracy": """71 Prioritize accuracy over comprehensiveness.72 Clearly distinguish between well-established facts, expert consensus, and speculation.73 Acknowledge limitations in your knowledge, especially for time-sensitive information.74 Avoid overgeneralizations and recognize exceptions where relevant.75 """,7677 "technical_explanation": """78 Begin with a high-level overview before diving into technical details.79 Define specialized terminology when introduced.80 Use analogies to explain complex concepts when appropriate.81 Balance technical precision with accessibility based on the apparent expertise level of the user.82 """83 }8485 # Output format templates86 self.format_templates = {87 "pros_cons": """88 Structure your response with clearly labeled sections for advantages and disadvantages.89 Use bullet points or numbered lists for each point.90 Consider different perspectives or use cases.91 If applicable, provide a balanced conclusion or recommendation.92 """,9394 "academic": """95 Structure your response similar to an academic paper with introduction, body, and conclusion.96 Use formal language and precise terminology.97 Acknowledge limitations and alternative viewpoints.98 Refer to theoretical frameworks or methodologies where relevant.99 """,100101 "tutorial": """102 Structure your response as a tutorial with clear sections:103 - Introduction explaining what will be covered and prerequisites104 - Step-by-step instructions with examples105 - Common pitfalls or troubleshooting tips106 - Summary of key takeaways107 Use headings and code blocks with appropriate formatting.108 """,109110 "eli5": """111 Explain the concept as if to a 10-year-old with no specialized knowledge.112 Use simple language and concrete analogies.113 Break complex ideas into simple components.114 Avoid jargon, or define terms very clearly when they must be used.115 """116 }117118 def get_system_prompt(self, category: str, include_tasks: List[str] = None) -> str:119 """Get a system prompt template with optional task-specific additions."""120 base_template = self.system_templates.get(121 category,122 self.system_templates["general"]123 ).strip()124125 if not include_tasks:126 return base_template127128 # Add selected task templates129 task_additions = []130 for task in include_tasks:131 if task in self.task_templates:132 task_additions.append(self.task_templates[task].strip())133134 if task_additions:135 combined = base_template + "\n\n" + "\n\n".join(task_additions)136 return combined137138 return base_template139140 def enhance_user_prompt(self, original_prompt: str, format_type: str = None) -> str:141 """Enhance a user prompt with formatting instructions."""142 if not format_type or format_type not in self.format_templates:143 return original_prompt144145 format_instructions = self.format_templates[format_type].strip()146 enhanced_prompt = f"{original_prompt}\n\nPlease format your response as follows:\n{format_instructions}"147148 return enhanced_prompt149150 def detect_format_type(self, prompt: str) -> Optional[str]:151 """Detect what format type might be appropriate based on prompt content."""152 prompt_lower = prompt.lower()153154 # Check for format indicators155 if any(phrase in prompt_lower for phrase in ["pros and cons", "advantages and disadvantages", "benefits and drawbacks"]):156 return "pros_cons"157158 if any(phrase in prompt_lower for phrase in ["academic", "paper", "research", "literature", "theoretical"]):159 return "academic"160161 if any(phrase in prompt_lower for phrase in ["tutorial", "how to", "guide", "step by step", "walkthrough"]):162 return "tutorial"163164 if any(phrase in prompt_lower for phrase in ["explain like", "eli5", "simple terms", "layman's terms", "simply explain"]):165 return "eli5"166167 return None
2. Context-Aware Chain of Thought
python1# app/services/chain_of_thought.py2from typing import Dict, List, Any, Optional3import logging4import json5import re67logger = logging.getLogger(__name__)89class ChainOfThoughtService:10 """11 Enhances response accuracy by enabling step-by-step reasoning.12 """1314 def __init__(self):15 # Configure when to use chain-of-thought prompting16 self.cot_triggers = [17 # Keywords indicating complex reasoning is needed18 r"(why|how|explain|analyze|reason|think|consider)",19 # Question patterns that benefit from step-by-step thinking20 r"(what (would|will|could|might) happen if)",21 r"(what (is|are) the (cause|reason|impact|effect|implication))",22 # Complexity indicators23 r"(complex|complicated|difficult|challenging|nuanced)",24 # Multi-step problems25 r"(steps|process|procedure|method|approach)"26 ]2728 # Task-specific CoT templates29 self.cot_templates = {30 "general": "Let's think through this step-by-step.",3132 "math": """33 Let's solve this step-by-step:34 1. First, understand what we're looking for35 2. Identify the relevant information and equations36 3. Work through the solution methodically37 4. Verify the answer makes sense38 """,3940 "reasoning": """41 Let's approach this systematically:42 1. Identify the key elements of the problem43 2. Consider relevant principles and constraints44 3. Analyze potential approaches45 4. Evaluate and compare alternatives46 5. Draw a well-reasoned conclusion47 """,4849 "decision": """50 Let's analyze this decision carefully:51 1. Clarify the decision to be made52 2. Identify the key criteria and constraints53 3. Consider the available options54 4. Evaluate each option against the criteria55 5. Assess potential risks and trade-offs56 6. Recommend the best course of action with justification57 """,5859 "causal": """60 Let's analyze the causal relationships:61 1. Identify the events or phenomena to be explained62 2. Consider potential causes and mechanisms63 3. Evaluate the evidence for each causal link64 4. Consider alternative explanations65 5. Draw conclusions about the most likely causal relationships66 """67 }6869 # Internal vs. external CoT modes70 self.cot_modes = {71 "internal": {72 "prefix": "Think through this problem step-by-step before providing your final answer.",73 "format": "standard" # No special formatting needed74 },75 "external": {76 "prefix": "Show your step-by-step reasoning process explicitly in your response.",77 "format": "markdown" # Format as markdown78 }79 }8081 def should_use_cot(self, query: str) -> bool:82 """Determine if chain-of-thought prompting should be used for this query."""83 query_lower = query.lower()8485 # Check for CoT triggers86 for pattern in self.cot_triggers:87 if re.search(pattern, query_lower):88 return True8990 # Check for task complexity indicators91 if len(query.split()) > 30: # Longer queries often benefit from CoT92 return True9394 # Check for explicit reasoning requests95 explicit_requests = [96 "step by step", "explain your reasoning", "think through",97 "show your work", "explain how you", "walk me through"98 ]99100 if any(request in query_lower for request in explicit_requests):101 return True102103 return False104105 def detect_task_type(self, query: str) -> str:106 """Detect the type of reasoning task from the query."""107 query_lower = query.lower()108109 # Check for mathematical content110 math_indicators = [111 "calculate", "compute", "solve", "equation", "formula",112 "find the value", "what is the result", r"\d+(\.\d+)?"113 ]114115 if any(re.search(indicator, query_lower) for indicator in math_indicators):116 return "math"117118 # Check for decision-making queries119 decision_indicators = [120 "should i", "which is better", "what's the best", "recommend",121 "decide between", "choose", "options"122 ]123124 if any(indicator in query_lower for indicator in decision_indicators):125 return "decision"126127 # Check for causal analysis128 causal_indicators = [129 "why did", "what caused", "reason for", "explain why",130 "how does", "what leads to", "effect of", "impact of"131 ]132133 if any(indicator in query_lower for indicator in causal_indicators):134 return "causal"135136 # Default to general reasoning137 reasoning_indicators = [138 "explain", "analyze", "evaluate", "critique", "assess",139 "compare", "contrast", "discuss", "review"140 ]141142 if any(indicator in query_lower for indicator in reasoning_indicators):143 return "reasoning"144145 return "general"146147 def enhance_prompt_with_cot(self,148 query: str,149 mode: str = "internal",150 explicit_template: bool = False) -> str:151 """152 Enhance a prompt with chain-of-thought instructions.153154 Args:155 query: The original user query156 mode: "internal" (for model thinking) or "external" (for visible reasoning)157 explicit_template: Whether to include the full template or just the instruction158 """159 if not self.should_use_cot(query):160 return query161162 # Get CoT mode configuration163 cot_mode = self.cot_modes.get(mode, self.cot_modes["internal"])164165 # Detect the task type166 task_type = self.detect_task_type(query)167168 # Get the appropriate template169 template = self.cot_templates.get(task_type, self.cot_templates["general"])170171 if explicit_template:172 # Add the full template173 enhanced = f"{query}\n\n{cot_mode['prefix']}\n\n{template.strip()}"174 else:175 # Just add the basic instruction176 enhanced = f"{query}\n\n{cot_mode['prefix']}"177178 return enhanced179180 def format_cot_for_response(self, reasoning: str, final_answer: str, mode: str = "external") -> str:181 """182 Format chain-of-thought reasoning and final answer for response.183184 Args:185 reasoning: The step-by-step reasoning process186 final_answer: The final answer or conclusion187 mode: "internal" (hidden) or "external" (visible)188 """189 if mode == "internal":190 # For internal mode, just return the final answer191 return final_answer192193 # For external mode, format the reasoning and answer194 formatted = f"""195## Reasoning Process196197{reasoning}198199## Conclusion200201{final_answer}202"""203 return formatted.strip()
3. Self-Verification and Error Correction
python1# app/services/verification_service.py2import logging3from typing import Dict, List, Any, Optional, Tuple4import re5import json67logger = logging.getLogger(__name__)89class VerificationService:10 """11 Improves response accuracy through self-verification and error correction.12 """1314 def __init__(self):15 # Define verification categories16 self.verification_categories = [17 "factual_accuracy",18 "logical_consistency",19 "completeness",20 "code_correctness",21 "calculation_accuracy",22 "bias_detection"23 ]2425 # High-risk categories that should always be verified26 self.high_risk_categories = [27 "medical",28 "legal",29 "financial",30 "security"31 ]3233 # Verification prompt templates34 self.verification_templates = {35 "general": """36 Please verify your response for:37 1. Factual accuracy - Are all stated facts correct?38 2. Logical consistency - Is the reasoning sound and free of contradictions?39 3. Completeness - Does the answer address all aspects of the question?40 4. Clarity - Is the response clear and easy to understand?4142 If you find any errors or omissions, please correct them in your response.43 """,4445 "factual": """46 Critically verify the factual claims in your response:47 - Are dates, names, and definitions accurate?48 - Are statistics and measurements correct?49 - Are attributions to people, organizations, or sources accurate?50 - Have you distinguished between facts and opinions/interpretations?5152 If you identify any factual errors, please correct them.53 """,5455 "code": """56 Verify your code for:57 1. Syntax errors and typos58 2. Logical correctness - does it perform the intended function?59 3. Edge cases and error handling60 4. Efficiency and best practices61 5. Security vulnerabilities6263 If you find any issues, please provide corrected code.64 """,6566 "math": """67 Verify your mathematical work by:68 1. Re-checking each calculation step69 2. Verifying that formulas are applied correctly70 3. Confirming unit conversions if applicable71 4. Testing the solution with sample values if possible72 5. Checking for arithmetic errors7374 If you find any errors, please recalculate and provide the correct answer.75 """,7677 "bias": """78 Check your response for potential biases:79 1. Is the framing balanced and objective?80 2. Have you considered diverse perspectives?81 3. Are there cultural, geographic, or demographic assumptions?82 4. Does the language contain implicit value judgments?8384 If you detect bias, please revise for greater objectivity.85 """86 }8788 def detect_verification_needs(self, query: str) -> List[str]:89 """Detect which verification categories are needed based on the query."""90 query_lower = query.lower()91 needed_categories = []9293 # Check for high-risk topics94 high_risk_detected = False95 for category in self.high_risk_categories:96 if category in query_lower or f"related to {category}" in query_lower:97 high_risk_detected = True98 break99100 # For high-risk topics, perform comprehensive verification101 if high_risk_detected:102 return ["factual_accuracy", "logical_consistency", "completeness", "bias_detection"]103104 # Check for code-related content105 code_indicators = ["code", "function", "program", "algorithm", "syntax"]106 if any(indicator in query_lower for indicator in code_indicators):107 needed_categories.append("code_correctness")108109 # Check for mathematical content110 math_indicators = ["calculate", "compute", "solve", "equation", "math problem"]111 if any(indicator in query_lower for indicator in math_indicators):112 needed_categories.append("calculation_accuracy")113114 # Check for factual questions115 factual_indicators = ["fact", "information about", "when did", "who is", "history of"]116 if any(indicator in query_lower for indicator in factual_indicators):117 needed_categories.append("factual_accuracy")118119 # Check for logical reasoning requirements120 logic_indicators = ["why", "explain", "reason", "because", "therefore", "hence"]121 if any(indicator in query_lower for indicator in logic_indicators):122 needed_categories.append("logical_consistency")123124 # For comprehensive questions125 if len(query.split()) > 30 or "comprehensive" in query_lower or "detailed" in query_lower:126 needed_categories.append("completeness")127128 # For sensitive or controversial topics129 sensitive_indicators = ["controversy", "debate", "opinion", "perspective", "ethical"]130 if any(indicator in query_lower for indicator in sensitive_indicators):131 needed_categories.append("bias_detection")132133 # Default to basic verification if nothing specific detected134 if not needed_categories:135 needed_categories = ["factual_accuracy", "logical_consistency"]136137 return needed_categories138139 def get_verification_prompt(self, categories: List[str]) -> str:140 """Get the appropriate verification prompt based on needed categories."""141 if "code_correctness" in categories and len(categories) == 1:142 return self.verification_templates["code"]143144 if "calculation_accuracy" in categories and len(categories) == 1:145 return self.verification_templates["math"]146147 if "factual_accuracy" in categories and "bias_detection" not in categories:148 return self.verification_templates["factual"]149150 if "bias_detection" in categories and len(categories) == 1:151 return self.verification_templates["bias"]152153 # Default to general verification154 return self.verification_templates["general"]155156 async def verify_response(self,157 query: str,158 initial_response: str,159 provider_service: Any) -> Tuple[str, bool]:160 """161 Verify and potentially correct a response.162163 Returns:164 Tuple of (verified_response, was_corrected)165 """166 # Detect verification needs167 verification_categories = self.detect_verification_needs(query)168169 # If no verification needed, return original170 if not verification_categories:171 return initial_response, False172173 # Get verification prompt174 verification_prompt = self.get_verification_prompt(verification_categories)175176 # Create verification messages177 verification_messages = [178 {"role": "system", "content":179 "You are a verification assistant. Your job is to verify the accuracy, "180 "consistency, and completeness of responses. Identify any errors or "181 "issues, and provide corrections when necessary."182 },183 {"role": "user", "content": query},184 {"role": "assistant", "content": initial_response},185 {"role": "user", "content": verification_prompt}186 ]187188 try:189 verification_response = await provider_service.generate_completion(190 messages=verification_messages,191 provider="openai", # Use OpenAI for verification192 model="gpt-4" # Use a more capable model for verification193 )194195 if verification_response and verification_response.get("message", {}).get("content"):196 # Check if verification found issues197 verification_text = verification_response["message"]["content"]198199 # Look for indicators of corrections200 correction_indicators = [201 "correction", "error", "mistake", "incorrect",202 "needs clarification", "inaccurate", "not quite right"203 ]204205 if any(indicator in verification_text.lower() for indicator in correction_indicators):206 # Attempt to correct the response207 corrected_response = await self._generate_corrected_response(208 query, initial_response, verification_text, provider_service209 )210 return corrected_response, True211212 # If verification found no issues, or was just minor clarifications213 minor_indicators = ["minor clarification", "additional note", "small correction"]214 if any(indicator in verification_text.lower() for indicator in minor_indicators):215 # Include the clarification in the response216 combined = f"{initial_response}\n\n**Note:** {verification_text}"217 return combined, True218219 # If verification failed or found no issues220 return initial_response, False221222 except Exception as e:223 logger.error(f"Error in response verification: {str(e)}")224 return initial_response, False225226 async def _generate_corrected_response(self,227 query: str,228 initial_response: str,229 verification_text: str,230 provider_service: Any) -> str:231 """Generate a corrected response based on verification feedback."""232 correction_prompt = [233 {"role": "system", "content":234 "You are a correction assistant. Your job is to provide a revised response "235 "that addresses the issues identified in the verification feedback. "236 "Create a complete, standalone corrected response."237 },238 {"role": "user", "content": f"Original question:\n{query}"},239 {"role": "assistant", "content": f"Initial response:\n{initial_response}"},240 {"role": "user", "content": f"Verification feedback:\n{verification_text}\n\nPlease provide a corrected response."}241 ]242243 try:244 correction_response = await provider_service.generate_completion(245 messages=correction_prompt,246 provider="openai",247 model="gpt-4"248 )249250 if correction_response and correction_response.get("message", {}).get("content"):251 return correction_response["message"]["content"]252253 except Exception as e:254 logger.error(f"Error generating corrected response: {str(e)}")255256 # Fallback - append verification notes to original257 return f"{initial_response}\n\n**Correction Note:** {verification_text}"
4. Domain-Specific Knowledge Integration
python1# app/services/domain_knowledge.py2import logging3from typing import Dict, List, Any, Optional4import json5import re6import os7import yaml89logger = logging.getLogger(__name__)1011class DomainKnowledgeService:12 """13 Enhances response accuracy by integrating domain-specific knowledge.14 """1516 def __init__(self, knowledge_dir: str = "knowledge"):17 self.knowledge_dir = knowledge_dir1819 # Domain definitions20 self.domains = {21 "programming": {22 "keywords": ["coding", "programming", "software", "development", "algorithm", "function"],23 "languages": ["python", "javascript", "java", "c++", "ruby", "go", "rust", "php"]24 },25 "medicine": {26 "keywords": ["medical", "health", "disease", "treatment", "diagnosis", "symptom", "patient"],27 "specialties": ["cardiology", "neurology", "pediatrics", "oncology", "psychiatry"]28 },29 "finance": {30 "keywords": ["finance", "investment", "stock", "market", "trading", "portfolio", "asset"],31 "topics": ["stocks", "bonds", "cryptocurrency", "retirement", "taxes", "budgeting"]32 },33 "law": {34 "keywords": ["legal", "law", "regulation", "compliance", "contract", "liability"],35 "areas": ["corporate", "criminal", "civil", "constitutional", "intellectual property"]36 },37 "science": {38 "keywords": ["science", "research", "experiment", "theory", "hypothesis", "evidence"],39 "fields": ["physics", "chemistry", "biology", "astronomy", "geology", "ecology"]40 }41 }4243 # Load domain knowledge44 self.domain_knowledge = self._load_domain_knowledge()4546 # Track query->domain mappings to optimize repeated queries47 self.domain_cache = {}4849 def _load_domain_knowledge(self) -> Dict[str, Any]:50 """Load domain knowledge from files."""51 knowledge = {}5253 try:54 # Create knowledge dir if it doesn't exist55 os.makedirs(self.knowledge_dir, exist_ok=True)5657 # List all domain knowledge files58 for domain in self.domains.keys():59 domain_path = os.path.join(self.knowledge_dir, f"{domain}.yaml")6061 # Create empty file if it doesn't exist62 if not os.path.exists(domain_path):63 with open(domain_path, 'w') as f:64 yaml.dump({65 "domain": domain,66 "concepts": {},67 "facts": [],68 "common_misconceptions": [],69 "best_practices": []70 }, f)7172 # Load domain knowledge73 try:74 with open(domain_path, 'r') as f:75 domain_data = yaml.safe_load(f)76 knowledge[domain] = domain_data77 except Exception as e:78 logger.error(f"Error loading domain knowledge for {domain}: {str(e)}")79 knowledge[domain] = {80 "domain": domain,81 "concepts": {},82 "facts": [],83 "common_misconceptions": [],84 "best_practices": []85 }86 except Exception as e:87 logger.error(f"Error loading domain knowledge: {str(e)}")8889 return knowledge9091 def detect_domains(self, query: str) -> List[str]:92 """Detect relevant domains for a query."""93 # Check cache first94 cache_key = hashlib.md5(query.encode()).hexdigest()95 if cache_key in self.domain_cache:96 return self.domain_cache[cache_key]9798 query_lower = query.lower()99 relevant_domains = []100101 # Check each domain for relevance102 for domain, definition in self.domains.items():103 # Check domain keywords104 keyword_match = any(keyword in query_lower for keyword in definition["keywords"])105106 # Check specific domain topics107 topic_match = False108 for topic_category, topics in definition.items():109 if topic_category != "keywords":110 if any(topic in query_lower for topic in topics):111 topic_match = True112 break113114 if keyword_match or topic_match:115 relevant_domains.append(domain)116117 # Cache result118 self.domain_cache[cache_key] = relevant_domains119 return relevant_domains120121 def get_domain_knowledge(self, domains: List[str]) -> Dict[str, Any]:122 """Get knowledge for the specified domains."""123 combined_knowledge = {124 "concepts": {},125 "facts": [],126 "common_misconceptions": [],127 "best_practices": []128 }129130 for domain in domains:131 if domain in self.domain_knowledge:132 domain_data = self.domain_knowledge[domain]133134 # Merge concepts (dictionary)135 combined_knowledge["concepts"].update(domain_data.get("concepts", {}))136137 # Extend lists138 for key in ["facts", "common_misconceptions", "best_practices"]:139 combined_knowledge[key].extend(domain_data.get(key, []))140141 return combined_knowledge142143 def format_domain_knowledge(self, knowledge: Dict[str, Any]) -> str:144 """Format domain knowledge as a context string."""145 if not knowledge or all(not v for v in knowledge.values()):146 return ""147148 formatted_parts = []149150 # Format concepts151 if knowledge["concepts"]:152 concepts_list = []153 for concept, definition in knowledge["concepts"].items():154 concepts_list.append(f"- {concept}: {definition}")155156 formatted_parts.append("Key concepts:\n" + "\n".join(concepts_list))157158 # Format facts159 if knowledge["facts"]:160 formatted_parts.append("Important facts:\n- " + "\n- ".join(knowledge["facts"]))161162 # Format misconceptions163 if knowledge["common_misconceptions"]:164 formatted_parts.append("Common misconceptions to avoid:\n- " + "\n- ".join(knowledge["common_misconceptions"]))165166 # Format best practices167 if knowledge["best_practices"]:168 formatted_parts.append("Best practices:\n- " + "\n- ".join(knowledge["best_practices"]))169170 return "\n\n".join(formatted_parts)171172 def enhance_prompt_with_domain_knowledge(self, query: str, system_prompt: str) -> str:173 """Enhance a system prompt with relevant domain knowledge."""174 # Detect relevant domains175 domains = self.detect_domains(query)176177 if not domains:178 return system_prompt179180 # Get domain knowledge181 knowledge = self.get_domain_knowledge(domains)182183 # Format knowledge as context184 knowledge_text = self.format_domain_knowledge(knowledge)185186 if not knowledge_text:187 return system_prompt188189 # Add to system prompt190 enhanced_prompt = f"{system_prompt}\n\nRelevant domain knowledge:\n{knowledge_text}"191192 return enhanced_prompt
5. Dynamic Few-Shot Learning
python1# app/services/few_shot_examples.py2import logging3from typing import Dict, List, Any, Optional, Tuple4import os5import json6import random7import re8import hashlib910logger = logging.getLogger(__name__)1112class FewShotExampleService:13 """14 Enhances response accuracy using dynamic few-shot learning with examples.15 """1617 def __init__(self, examples_dir: str = "examples"):18 self.examples_dir = examples_dir1920 # Ensure examples directory exists21 os.makedirs(examples_dir, exist_ok=True)2223 # Task categories for examples24 self.task_categories = {25 "code_generation": {26 "keywords": ["write code", "function", "implement", "program", "algorithm"],27 "patterns": [r"write a .* function", r"implement .* in (python|javascript|java|c\+\+)"]28 },29 "explanation": {30 "keywords": ["explain", "describe", "how does", "what is", "why is"],31 "patterns": [r"explain .* to me", r"what is the .* of", r"how does .* work"]32 },33 "classification": {34 "keywords": ["classify", "categorize", "identify", "is this", "determine"],35 "patterns": [r"is this .* or .*", r"which category", r"identify the .*"]36 },37 "comparison": {38 "keywords": ["compare", "contrast", "difference", "similarities", "versus"],39 "patterns": [r"compare .* and .*", r"what is the difference between", r".* vs .*"]40 },41 "summarization": {42 "keywords": ["summarize", "summary", "brief overview", "key points"],43 "patterns": [r"summarize .*", r"provide a summary", r"key points of"]44 }45 }4647 # Load examples48 self.examples = self._load_examples()4950 def _load_examples(self) -> Dict[str, List[Dict[str, str]]]:51 """Load examples from files."""52 examples = {category: [] for category in self.task_categories.keys()}5354 # Load examples for each category55 for category in self.task_categories.keys():56 category_file = os.path.join(self.examples_dir, f"{category}.json")5758 if os.path.exists(category_file):59 try:60 with open(category_file, 'r') as f:61 category_examples = json.load(f)62 examples[category] = category_examples63 except Exception as e:64 logger.error(f"Error loading examples for {category}: {str(e)}")6566 return examples6768 def detect_task_category(self, query: str) -> Optional[str]:69 """Detect the task category for a query."""70 query_lower = query.lower()7172 # Check each category73 for category, definition in self.task_categories.items():74 # Check keywords75 if any(keyword in query_lower for keyword in definition["keywords"]):76 return category7778 # Check regex patterns79 if any(re.search(pattern, query_lower) for pattern in definition["patterns"]):80 return category8182 return None8384 def select_examples(self,85 query: str,86 category: Optional[str] = None,87 num_examples: int = 3) -> List[Dict[str, str]]:88 """Select the most relevant examples for a query."""89 # Detect category if not provided90 if not category:91 category = self.detect_task_category(query)9293 if not category or category not in self.examples or not self.examples[category]:94 return []9596 category_examples = self.examples[category]9798 # If we have few examples, just return all of them (up to num_examples)99 if len(category_examples) <= num_examples:100 return category_examples101102 # For simplicity, we're using random selection here103 # In a production system, this would use semantic similarity or other relevance metrics104 selected = random.sample(category_examples, min(num_examples, len(category_examples)))105106 return selected107108 def format_examples_for_prompt(self, examples: List[Dict[str, str]]) -> str:109 """Format examples for inclusion in a prompt."""110 if not examples:111 return ""112113 formatted_examples = []114115 for i, example in enumerate(examples, 1):116 query = example.get("query", "")117 response = example.get("response", "")118119 formatted = f"Example {i}:\n\nUser: {query}\n\nAssistant: {response}\n"120 formatted_examples.append(formatted)121122 return "\n".join(formatted_examples)123124 def enhance_prompt_with_examples(self,125 query: str,126 system_prompt: str,127 num_examples: int = 2) -> str:128 """Enhance a system prompt with few-shot examples."""129 # Select relevant examples130 examples = self.select_examples(query, num_examples=num_examples)131132 if not examples:133 return system_prompt134135 # Format examples136 examples_text = self.format_examples_for_prompt(examples)137138 # Add to system prompt139 enhanced_prompt = f"{system_prompt}\n\nHere are some examples of how to respond to similar queries:\n\n{examples_text}"140141 return enhanced_prompt142143 def add_example(self, category: str, query: str, response: str) -> bool:144 """Add a new example to the examples collection."""145 if category not in self.task_categories:146 logger.error(f"Invalid category: {category}")147 return False148149 example = {150 "query": query,151 "response": response,152 "id": hashlib.md5(f"{category}:{query}".encode()).hexdigest()153 }154155 # Add to in-memory collection156 if category not in self.examples:157 self.examples[category] = []158159 # Check if this example already exists160 existing_ids = [e.get("id") for e in self.examples[category]]161 if example["id"] in existing_ids:162 return False # Example already exists163164 self.examples[category].append(example)165166 # Save to file167 try:168 category_file = os.path.join(self.examples_dir, f"{category}.json")169 with open(category_file, 'w') as f:170 json.dump(self.examples[category], f, indent=2)171 return True172 except Exception as e:173 logger.error(f"Error saving example: {str(e)}")174 return False
Deployment Strategies
Local Development Environment
Setup Script for Local Deployment
bash1#!/bin/bash2# local_setup.sh - Set up local development environment34set -e # Exit on error56# Check for required tools7echo "Checking prerequisites..."8command -v python3 >/dev/null 2>&1 || { echo "Python 3 is required but not installed. Aborting."; exit 1; }9command -v pip3 >/dev/null 2>&1 || { echo "pip3 is required but not installed. Aborting."; exit 1; }10command -v docker >/dev/null 2>&1 || { echo "Docker is required but not installed. Aborting."; exit 1; }11command -v docker-compose >/dev/null 2>&1 || { echo "Docker Compose is required but not installed. Aborting."; exit 1; }1213# Create virtual environment14echo "Creating Python virtual environment..."15python3 -m venv venv16source venv/bin/activate1718# Install dependencies19echo "Installing Python dependencies..."20pip install --upgrade pip21pip install -r requirements.txt22pip install -r requirements-dev.txt2324# Set up environment file25if [ ! -f .env ]; then26 echo "Creating .env file..."27 cp .env.example .env2829 # Prompt for OpenAI API key30 read -p "Enter your OpenAI API key (leave blank to skip): " openai_key31 if [ ! -z "$openai_key" ]; then32 sed -i "s/OPENAI_API_KEY=.*/OPENAI_API_KEY=$openai_key/" .env33 fi3435 # Set environment to development36 sed -i "s/APP_ENV=.*/APP_ENV=development/" .env3738 echo ".env file created. Please review and update as needed."39else40 echo ".env file already exists. Skipping creation."41fi4243# Check if Ollama is installed44if ! command -v ollama >/dev/null 2>&1; then45 echo "Ollama not found. Would you like to install it? (y/n)"46 read install_ollama4748 if [ "$install_ollama" = "y" ]; then49 echo "Installing Ollama..."50 if [[ "$OSTYPE" == "darwin"* ]]; then51 # macOS52 curl -fsSL https://ollama.com/install.sh | sh53 else54 # Linux55 curl -fsSL https://ollama.com/install.sh | sh56 fi57 else58 echo "Skipping Ollama installation. You will need to install it manually."59 fi60else61 echo "Ollama already installed."62fi6364# Pull required Ollama models65if command -v ollama >/dev/null 2>&1; then66 echo "Would you like to pull the recommended Ollama models? (y/n)"67 read pull_models6869 if [ "$pull_models" = "y" ]; then70 echo "Pulling Ollama models..."71 ollama pull llama272 ollama pull mistral73 ollama pull codellama74 fi75fi7677# Start Redis for development78echo "Starting Redis with Docker..."79docker-compose up -d redis8081# Initialize database82echo "Initializing database..."83python scripts/init_db.py8485# Run tests to verify setup86echo "Running tests to verify setup..."87pytest tests/unit8889echo "Setup complete! You can now start the development server with:"90echo "uvicorn app.main:app --reload"
Docker Compose for Local Services
yaml1# docker-compose.yml2version: '3.8'34services:5 app:6 build:7 context: .8 dockerfile: Dockerfile.dev9 ports:10 - "8000:8000"11 volumes:12 - .:/app13 environment:14 - PYTHONPATH=/app15 - REDIS_URL=redis://redis:6379/016 - OLLAMA_HOST=http://ollama:1143417 - APP_ENV=development18 - FORCE_DEV_MODE=true19 depends_on:20 - redis21 - ollama22 command: uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload2324 redis:25 image: redis:alpine26 ports:27 - "6379:6379"28 volumes:29 - redis_data:/data3031 ollama:32 image: ollama/ollama:latest33 ports:34 - "11434:11434"35 volumes:36 - ollama_data:/root/.ollama37 deploy:38 resources:39 reservations:40 devices:41 - driver: nvidia42 count: all43 capabilities: [gpu]4445 ui:46 build:47 context: ./ui48 dockerfile: Dockerfile.dev49 ports:50 - "3000:3000"51 volumes:52 - ./ui:/app53 - /app/node_modules54 environment:55 - API_URL=http://app:800056 depends_on:57 - app58 command: npm start5960volumes:61 redis_data:62 ollama_data:
Development Dockerfile
dockerfile1# Dockerfile.dev2FROM python:3.11-slim34WORKDIR /app56# Install system dependencies7RUN apt-get update && apt-get install -y --no-install-recommends \8 curl \9 gcc \10 build-essential \11 && rm -rf /var/lib/apt/lists/*1213# Install Python dependencies14COPY requirements.txt requirements-dev.txt ./15RUN pip install --no-cache-dir -r requirements.txt -r requirements-dev.txt1617# Copy application code18COPY . .1920# Set development environment21ENV PYTHONUNBUFFERED=122ENV PYTHONDONTWRITEBYTECODE=123ENV APP_ENV=development2425# Make scripts executable26RUN chmod +x scripts/*.sh2728# Default command29CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]
Configuration for Local Environment
python1# app/config/local.py2"""Configuration for local development environment."""34import os5from typing import Dict, Any, List67# API configuration8API_HOST = "0.0.0.0"9API_PORT = 800010API_RELOAD = True11API_DEBUG = True1213# OpenAI configuration14OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "")15OPENAI_ORG_ID = os.environ.get("OPENAI_ORG_ID", "")16OPENAI_MODEL = "gpt-3.5-turbo" # Default to cheaper model for development1718# Ollama configuration19OLLAMA_HOST = os.environ.get("OLLAMA_HOST", "http://localhost:11434")20OLLAMA_MODEL = "llama2" # Default local model21ENABLE_GPU = True2223# App configuration24LOG_LEVEL = "DEBUG"25ENABLE_CORS = True26CORS_ORIGINS = ["http://localhost:3000", "http://127.0.0.1:3000"]2728# Feature flags29ENABLE_CACHING = True30ENABLE_RATE_LIMITING = False # Disable rate limiting in local development31ENABLE_PARALLEL_PROCESSING = True32ENABLE_RESPONSE_VERIFICATION = True3334# Development-specific settings35FORCE_DEV_MODE = os.environ.get("FORCE_DEV_MODE", "false").lower() == "true"36DEV_OPENAI_QUOTA = 100 # Maximum OpenAI API calls per day in development3738# Redis configuration39REDIS_URL = os.environ.get("REDIS_URL", "redis://localhost:6379/0")
Production Deployment
Kubernetes Manifests for Production
yaml1# kubernetes/deployment.yaml2apiVersion: apps/v13kind: Deployment4metadata:5 name: mcp-api6 labels:7 app: mcp-api8spec:9 replicas: 310 selector:11 matchLabels:12 app: mcp-api13 template:14 metadata:15 labels:16 app: mcp-api17 spec:18 containers:19 - name: api20 image: ${DOCKER_REGISTRY}/mcp-api:${IMAGE_TAG}21 imagePullPolicy: Always22 ports:23 - containerPort: 800024 env:25 - name: APP_ENV26 value: "production"27 - name: REDIS_URL28 valueFrom:29 secretKeyRef:30 name: mcp-secrets31 key: redis_url32 - name: OPENAI_API_KEY33 valueFrom:34 secretKeyRef:35 name: mcp-secrets36 key: openai_api_key37 - name: OLLAMA_HOST38 value: "http://ollama-service:11434"39 - name: MONTHLY_BUDGET40 value: "${MONTHLY_BUDGET}"41 resources:42 requests:43 cpu: 500m44 memory: 512Mi45 limits:46 cpu: 1000m47 memory: 1Gi48 readinessProbe:49 httpGet:50 path: /api/health51 port: 800052 initialDelaySeconds: 1053 periodSeconds: 554 livenessProbe:55 httpGet:56 path: /api/health57 port: 800058 initialDelaySeconds: 2059 periodSeconds: 1560---61apiVersion: apps/v162kind: Deployment63metadata:64 name: ollama65 labels:66 app: ollama67spec:68 replicas: 1 # Start with a single replica for Ollama69 selector:70 matchLabels:71 app: ollama72 template:73 metadata:74 labels:75 app: ollama76 spec:77 containers:78 - name: ollama79 image: ollama/ollama:latest80 ports:81 - containerPort: 1143482 volumeMounts:83 - mountPath: /root/.ollama84 name: ollama-data85 resources:86 requests:87 cpu: 1000m88 memory: 4Gi89 limits:90 cpu: 4000m91 memory: 16Gi92 # If using GPU93 env:94 - name: NVIDIA_VISIBLE_DEVICES95 value: "all"96 - name: NVIDIA_DRIVER_CAPABILITIES97 value: "compute,utility"98 volumes:99 - name: ollama-data100 persistentVolumeClaim:101 claimName: ollama-pvc102---103apiVersion: v1104kind: Service105metadata:106 name: mcp-api-service107spec:108 selector:109 app: mcp-api110 ports:111 - port: 80112 targetPort: 8000113 type: ClusterIP114---115apiVersion: v1116kind: Service117metadata:118 name: ollama-service119spec:120 selector:121 app: ollama122 ports:123 - port: 11434124 targetPort: 11434125 type: ClusterIP126---127apiVersion: networking.k8s.io/v1128kind: Ingress129metadata:130 name: mcp-ingress131 annotations:132 kubernetes.io/ingress.class: "nginx"133 cert-manager.io/cluster-issuer: "letsencrypt-prod"134spec:135 tls:136 - hosts:137 - api.mcpservice.com138 secretName: mcp-tls139 rules:140 - host: api.mcpservice.com141 http:142 paths:143 - path: /144 pathType: Prefix145 backend:146 service:147 name: mcp-api-service148 port:149 number: 80150---151apiVersion: v1152kind: PersistentVolumeClaim153metadata:154 name: ollama-pvc155spec:156 accessModes:157 - ReadWriteOnce158 resources:159 requests:160 storage: 50Gi # Adjust based on your models
Horizontal Pod Autoscaling (HPA)
yaml1# kubernetes/hpa.yaml2apiVersion: autoscaling/v23kind: HorizontalPodAutoscaler4metadata:5 name: mcp-api-hpa6spec:7 scaleTargetRef:8 apiVersion: apps/v19 kind: Deployment10 name: mcp-api11 minReplicas: 312 maxReplicas: 1013 metrics:14 - type: Resource15 resource:16 name: cpu17 target:18 type: Utilization19 averageUtilization: 7020 - type: Resource21 resource:22 name: memory23 target:24 type: Utilization25 averageUtilization: 80
Deployment Script
bash1#!/bin/bash2# deploy.sh - Production deployment script34set -e # Exit on error56# Check required environment variables7if [ -z "$DOCKER_REGISTRY" ] || [ -z "$IMAGE_TAG" ] || [ -z "$K8S_NAMESPACE" ]; then8 echo "Error: Required environment variables not set."9 echo "Please set DOCKER_REGISTRY, IMAGE_TAG, and K8S_NAMESPACE."10 exit 111fi1213# Build and push Docker image14echo "Building and pushing Docker image..."15docker build -t ${DOCKER_REGISTRY}/mcp-api:${IMAGE_TAG} -f Dockerfile.prod .16docker push ${DOCKER_REGISTRY}/mcp-api:${IMAGE_TAG}1718# Apply Kubernetes configuration19echo "Applying Kubernetes configuration..."2021# Create namespace if it doesn't exist22kubectl get namespace ${K8S_NAMESPACE} || kubectl create namespace ${K8S_NAMESPACE}2324# Apply secrets25echo "Applying secrets..."26kubectl apply -f kubernetes/secrets.yaml -n ${K8S_NAMESPACE}2728# Deploy Redis if needed29echo "Deploying Redis..."30helm upgrade --install redis bitnami/redis \31 --namespace ${K8S_NAMESPACE} \32 --set auth.password=${REDIS_PASSWORD} \33 --set master.persistence.size=8Gi3435# Deploy application36echo "Deploying application..."37# Replace variables in deployment file38envsubst < kubernetes/deployment.yaml | kubectl apply -f - -n ${K8S_NAMESPACE}3940# Apply HPA41kubectl apply -f kubernetes/hpa.yaml -n ${K8S_NAMESPACE}4243# Verify deployment44echo "Verifying deployment..."45kubectl rollout status deployment/mcp-api -n ${K8S_NAMESPACE}46kubectl rollout status deployment/ollama -n ${K8S_NAMESPACE}4748# Initialize Ollama models if needed49echo "Would you like to initialize Ollama models? (y/n)"50read init_models5152if [ "$init_models" = "y" ]; then53 echo "Initializing Ollama models..."54 # Get pod name55 OLLAMA_POD=$(kubectl get pods -l app=ollama -n ${K8S_NAMESPACE} -o jsonpath="{.items[0].metadata.name}")5657 # Pull models58 kubectl exec ${OLLAMA_POD} -n ${K8S_NAMESPACE} -- ollama pull llama259 kubectl exec ${OLLAMA_POD} -n ${K8S_NAMESPACE} -- ollama pull mistral60 kubectl exec ${OLLAMA_POD} -n ${K8S_NAMESPACE} -- ollama pull codellama61fi6263echo "Deployment complete!"64echo "API available at: https://api.mcpservice.com"
Production Dockerfile
dockerfile1# Dockerfile.prod2FROM python:3.11-slim as builder34WORKDIR /app56# Install build dependencies7RUN apt-get update && apt-get install -y --no-install-recommends \8 gcc \9 build-essential \10 && rm -rf /var/lib/apt/lists/*1112# Install Python dependencies13COPY requirements.txt ./14RUN pip wheel --no-cache-dir --no-deps --wheel-dir /app/wheels -r requirements.txt1516# Final stage17FROM python:3.11-slim1819WORKDIR /app2021# Copy wheels from builder stage22COPY --from=builder /app/wheels /wheels23RUN pip install --no-cache /wheels/*2425# Copy application code26COPY app /app/app27COPY scripts /app/scripts28COPY alembic.ini /app/2930# Create non-root user31RUN useradd -m appuser && \32 chown -R appuser:appuser /app33USER appuser3435# Set production environment36ENV PYTHONPATH=/app37ENV APP_ENV=production38ENV PYTHONUNBUFFERED=13940# Expose port41EXPOSE 80004243# Run using Gunicorn in production44CMD ["gunicorn", "-k", "uvicorn.workers.UvicornWorker", "-c", "app/config/gunicorn.py", "app.main:app"]
Gunicorn Configuration for Production
python1# app/config/gunicorn.py2"""Gunicorn configuration for production deployment."""34import multiprocessing5import os67# Bind to 0.0.0.0:80008bind = "0.0.0.0:8000"910# Worker configuration11workers = multiprocessing.cpu_count() * 2 + 112worker_class = "uvicorn.workers.UvicornWorker"13worker_connections = 100014timeout = 6015keepalive = 51617# Logging18accesslog = "-"19errorlog = "-"20loglevel = os.environ.get("LOG_LEVEL", "info").lower()2122# Security23limit_request_line = 409424limit_request_fields = 10025limit_request_field_size = 81902627# Process naming28proc_name = "mcp-api"
Cloud Deployment (AWS)
AWS CloudFormation Template
yaml1# aws/cloudformation.yaml2AWSTemplateFormatVersion: '2010-09-09'3Description: 'MCP OpenAI-Ollama Hybrid System'45Parameters:6 Environment:7 Description: Deployment environment8 Type: String9 Default: Production10 AllowedValues:11 - Development12 - Staging13 - Production1415 ECRRepositoryName:16 Description: ECR Repository name17 Type: String18 Default: mcp-api1920 VpcId:21 Description: VPC ID22 Type: AWS::EC2::VPC::Id2324 SubnetIds:25 Description: Subnet IDs for the ECS tasks26 Type: List<AWS::EC2::Subnet::Id>2728 OllamaInstanceType:29 Description: EC2 instance type for Ollama30 Type: String31 Default: g4dn.xlarge32 AllowedValues:33 - g4dn.xlarge34 - g5.xlarge35 - p3.2xlarge36 - c5.2xlarge # CPU-only option3738 ApiInstanceCount:39 Description: Number of API instances40 Type: Number41 Default: 242 MinValue: 143 MaxValue: 104445Resources:46 # ECR Repository47 ECRRepository:48 Type: AWS::ECR::Repository49 Properties:50 RepositoryName: !Ref ECRRepositoryName51 ImageScanningConfiguration:52 ScanOnPush: true53 LifecyclePolicy:54 LifecyclePolicyText: |55 {56 "rules": [57 {58 "rulePriority": 1,59 "description": "Keep only the 10 most recent images",60 "selection": {61 "tagStatus": "any",62 "countType": "imageCountMoreThan",63 "countNumber": 1064 },65 "action": {66 "type": "expire"67 }68 }69 ]70 }7172 # ElastiCache Redis73 RedisSecurityGroup:74 Type: AWS::EC2::SecurityGroup75 Properties:76 GroupDescription: Security group for Redis cluster77 VpcId: !Ref VpcId78 SecurityGroupIngress:79 - IpProtocol: tcp80 FromPort: 637981 ToPort: 637982 SourceSecurityGroupId: !GetAtt APISecurityGroup.GroupId8384 RedisSubnetGroup:85 Type: AWS::ElastiCache::SubnetGroup86 Properties:87 Description: Subnet group for Redis88 SubnetIds: !Ref SubnetIds8990 RedisCluster:91 Type: AWS::ElastiCache::CacheCluster92 Properties:93 Engine: redis94 CacheNodeType: cache.t3.medium95 NumCacheNodes: 196 VpcSecurityGroupIds:97 - !GetAtt RedisSecurityGroup.GroupId98 CacheSubnetGroupName: !Ref RedisSubnetGroup99 AutoMinorVersionUpgrade: true100101 # Ollama EC2 Instance102 OllamaSecurityGroup:103 Type: AWS::EC2::SecurityGroup104 Properties:105 GroupDescription: Security group for Ollama EC2 instance106 VpcId: !Ref VpcId107 SecurityGroupIngress:108 - IpProtocol: tcp109 FromPort: 11434110 ToPort: 11434111 SourceSecurityGroupId: !GetAtt APISecurityGroup.GroupId112 - IpProtocol: tcp113 FromPort: 22114 ToPort: 22115 CidrIp: 0.0.0.0/0 # Restrict this in production116117 OllamaInstanceRole:118 Type: AWS::IAM::Role119 Properties:120 AssumeRolePolicyDocument:121 Version: '2012-10-17'122 Statement:123 - Effect: Allow124 Principal:125 Service: ec2.amazonaws.com126 Action: sts:AssumeRole127 ManagedPolicyArns:128 - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore129130 OllamaInstanceProfile:131 Type: AWS::IAM::InstanceProfile132 Properties:133 Roles:134 - !Ref OllamaInstanceRole135136 OllamaEBSVolume:137 Type: AWS::EC2::Volume138 Properties:139 AvailabilityZone: !Select [0, !GetAZs '']140 Size: 100141 VolumeType: gp3142 Encrypted: true143 Tags:144 - Key: Name145 Value: OllamaVolume146147 OllamaInstance:148 Type: AWS::EC2::Instance149 Properties:150 InstanceType: !Ref OllamaInstanceType151 ImageId: ami-0261755bbcb8c4a84 # Amazon Linux 2 AMI - update as needed152 SecurityGroupIds:153 - !GetAtt OllamaSecurityGroup.GroupId154 SubnetId: !Select [0, !Ref SubnetIds]155 IamInstanceProfile: !Ref OllamaInstanceProfile156 BlockDeviceMappings:157 - DeviceName: /dev/xvda158 Ebs:159 VolumeSize: 30160 VolumeType: gp3161 DeleteOnTermination: true162 UserData:163 Fn::Base64: !Sub |164 #!/bin/bash165 # Install Docker166 amazon-linux-extras install docker -y167 systemctl start docker168 systemctl enable docker169170 # Install Ollama171 curl -fsSL https://ollama.com/install.sh | sh172173 # Run Ollama in Docker174 docker run -d --name ollama \175 -p 11434:11434 \176 -v ollama:/root/.ollama \177 ollama/ollama178179 # Pull models180 docker exec ollama ollama pull llama2181 docker exec ollama ollama pull mistral182 docker exec ollama ollama pull codellama183 Tags:184 - Key: Name185 Value: !Sub "${AWS::StackName}-ollama"186187 OllamaVolumeAttachment:188 Type: AWS::EC2::VolumeAttachment189 Properties:190 InstanceId: !Ref OllamaInstance191 VolumeId: !Ref OllamaEBSVolume192 Device: /dev/sdf193194 # API ECS Cluster195 ECSCluster:196 Type: AWS::ECS::Cluster197 Properties:198 ClusterName: !Sub "${AWS::StackName}-cluster"199 CapacityProviders:200 - FARGATE201 DefaultCapacityProviderStrategy:202 - CapacityProvider: FARGATE203 Weight: 1204205 APISecurityGroup:206 Type: AWS::EC2::SecurityGroup207 Properties:208 GroupDescription: Security group for API ECS tasks209 VpcId: !Ref VpcId210 SecurityGroupIngress:211 - IpProtocol: tcp212 FromPort: 8000213 ToPort: 8000214 CidrIp: 0.0.0.0/0 # Restrict in production215216 # ECS Task Definition217 ECSTaskExecutionRole:218 Type: AWS::IAM::Role219 Properties:220 AssumeRolePolicyDocument:221 Version: '2012-10-17'222 Statement:223 - Effect: Allow224 Principal:225 Service: ecs-tasks.amazonaws.com226 Action: sts:AssumeRole227 ManagedPolicyArns:228 - arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy229230 ECSTaskRole:231 Type: AWS::IAM::Role232 Properties:233 AssumeRolePolicyDocument:234 Version: '2012-10-17'235 Statement:236 - Effect: Allow237 Principal:238 Service: ecs-tasks.amazonaws.com239 Action: sts:AssumeRole240 ManagedPolicyArns:241 - arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess242243 APITaskDefinition:244 Type: AWS::ECS::TaskDefinition245 Properties:246 Family: !Sub "${AWS::StackName}-api"247 Cpu: '1024'248 Memory: '2048'249 NetworkMode: awsvpc250 RequiresCompatibilities:251 - FARGATE252 ExecutionRoleArn: !GetAtt ECSTaskExecutionRole.Arn253 TaskRoleArn: !GetAtt ECSTaskRole.Arn254 ContainerDefinitions:255 - Name: api256 Image: !Sub "${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/${ECRRepositoryName}:latest"257 Essential: true258 PortMappings:259 - ContainerPort: 8000260 Environment:261 - Name: REDIS_URL262 Value: !Sub "redis://${RedisCluster.RedisEndpoint.Address}:${RedisCluster.RedisEndpoint.Port}/0"263 - Name: OLLAMA_HOST264 Value: !Sub "http://${OllamaInstance.PrivateIp}:11434"265 - Name: APP_ENV266 Value: !Ref Environment267 LogConfiguration:268 LogDriver: awslogs269 Options:270 awslogs-group: !Ref APILogGroup271 awslogs-region: !Ref AWS::Region272 awslogs-stream-prefix: api273 HealthCheck:274 Command:275 - CMD-SHELL276 - curl -f http://localhost:8000/api/health || exit 1277 Interval: 30278 Timeout: 5279 Retries: 3280281 APILogGroup:282 Type: AWS::Logs::LogGroup283 Properties:284 LogGroupName: !Sub "/ecs/${AWS::StackName}-api"285 RetentionInDays: 7286287 # ECS Service288 APIService:289 Type: AWS::ECS::Service290 Properties:291 ServiceName: !Sub "${AWS::StackName}-api"292 Cluster: !Ref ECSCluster293 TaskDefinition: !Ref APITaskDefinition294 DesiredCount: !Ref ApiInstanceCount295 LaunchType: FARGATE296 NetworkConfiguration:297 AwsvpcConfiguration:298 AssignPublicIp: ENABLED299 SecurityGroups:300 - !GetAtt APISecurityGroup.GroupId301 Subnets: !Ref SubnetIds302 LoadBalancers:303 - TargetGroupArn: !Ref ALBTargetGroup304 ContainerName: api305 ContainerPort: 8000306 DependsOn: ALBListener307308 # Application Load Balancer309 ALB:310 Type: AWS::ElasticLoadBalancingV2::LoadBalancer311 Properties:312 Name: !Sub "${AWS::StackName}-alb"313 Type: application314 Scheme: internet-facing315 SecurityGroups:316 - !GetAtt ALBSecurityGroup.GroupId317 Subnets: !Ref SubnetIds318 LoadBalancerAttributes:319 - Key: idle_timeout.timeout_seconds320 Value: '60'321322 ALBSecurityGroup:323 Type: AWS::EC2::SecurityGroup324 Properties:325 GroupDescription: Security group for ALB326 VpcId: !Ref VpcId327 SecurityGroupIngress:328 - IpProtocol: tcp329 FromPort: 80330 ToPort: 80331 CidrIp: 0.0.0.0/0332 - IpProtocol: tcp333 FromPort: 443334 ToPort: 443335 CidrIp: 0.0.0.0/0336337 ALBTargetGroup:338 Type: AWS::ElasticLoadBalancingV2::TargetGroup339 Properties:340 Name: !Sub "${AWS::StackName}-target-group"341 Port: 8000342 Protocol: HTTP343 TargetType: ip344 VpcId: !Ref VpcId345 HealthCheckPath: /api/health346 HealthCheckIntervalSeconds: 30347 HealthCheckTimeoutSeconds: 5348 HealthyThresholdCount: 3349 UnhealthyThresholdCount: 3350351 ALBListener:352 Type: AWS::ElasticLoadBalancingV2::Listener353 Properties:354 LoadBalancerArn: !Ref ALB355 Port: 80356 Protocol: HTTP357 DefaultActions:358 - Type: forward359 TargetGroupArn: !Ref ALBTargetGroup360361Outputs:362 APIEndpoint:363 Description: URL for API364 Value: !Sub "http://${ALB.DNSName}"365366 OllamaEndpoint:367 Description: Ollama Server Private IP368 Value: !GetAtt OllamaInstance.PrivateIp369370 ECRRepository:371 Description: ECR Repository URL372 Value: !Sub "${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/${ECRRepositoryName}"373374 RedisEndpoint:375 Description: Redis Endpoint376 Value: !Sub "${RedisCluster.RedisEndpoint.Address}:${RedisCluster.RedisEndpoint.Port}"
AWS Deployment Script
bash1#!/bin/bash2# aws_deploy.sh - AWS deployment script34set -e # Exit on error56# Check required AWS CLI7if ! command -v aws &> /dev/null; then8 echo "AWS CLI is required but not installed. Aborting."9 exit 110fi1112# AWS configuration13AWS_REGION="us-east-1"14STACK_NAME="mcp-hybrid-system"15CFN_TEMPLATE="aws/cloudformation.yaml"16IMAGE_TAG=$(git rev-parse --short HEAD)1718# Check if stack exists19if aws cloudformation describe-stacks --stack-name $STACK_NAME --region $AWS_REGION &> /dev/null; then20 STACK_ACTION="update"21else22 STACK_ACTION="create"23fi2425# Deploy CloudFormation stack26if [ "$STACK_ACTION" = "create" ]; then27 echo "Creating CloudFormation stack..."28 aws cloudformation create-stack \29 --stack-name $STACK_NAME \30 --template-body file://$CFN_TEMPLATE \31 --capabilities CAPABILITY_IAM \32 --parameters \33 ParameterKey=Environment,ParameterValue=Production \34 ParameterKey=OllamaInstanceType,ParameterValue=g4dn.xlarge \35 ParameterKey=ApiInstanceCount,ParameterValue=2 \36 --region $AWS_REGION3738 # Wait for stack creation39 echo "Waiting for stack creation to complete..."40 aws cloudformation wait stack-create-complete \41 --stack-name $STACK_NAME \42 --region $AWS_REGION43else44 echo "Updating CloudFormation stack..."45 aws cloudformation update-stack \46 --stack-name $STACK_NAME \47 --template-body file://$CFN_TEMPLATE \48 --capabilities CAPABILITY_IAM \49 --parameters \50 ParameterKey=Environment,ParameterValue=Production \51 ParameterKey=OllamaInstanceType,ParameterValue=g4dn.xlarge \52 ParameterKey=ApiInstanceCount,ParameterValue=2 \53 --region $AWS_REGION5455 # Wait for stack update56 echo "Waiting for stack update to complete..."57 aws cloudformation wait stack-update-complete \58 --stack-name $STACK_NAME \59 --region $AWS_REGION60fi6162# Get stack outputs63echo "Getting stack outputs..."64ECR_REPOSITORY=$(aws cloudformation describe-stacks \65 --stack-name $STACK_NAME \66 --query "Stacks[0].Outputs[?OutputKey=='ECRRepository'].OutputValue" \67 --output text \68 --region $AWS_REGION)6970API_ENDPOINT=$(aws cloudformation describe-stacks \71 --stack-name $STACK_NAME \72 --query "Stacks[0].Outputs[?OutputKey=='APIEndpoint'].OutputValue" \73 --output text \74 --region $AWS_REGION)7576# Build and push Docker image77echo "Building and pushing Docker image to ECR..."78# Login to ECR79aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $ECR_REPOSITORY8081# Build and push82docker build -t $ECR_REPOSITORY:$IMAGE_TAG -t $ECR_REPOSITORY:latest -f Dockerfile.prod .83docker push $ECR_REPOSITORY:$IMAGE_TAG84docker push $ECR_REPOSITORY:latest8586# Update ECS service to force deployment87echo "Updating ECS service..."88ECS_CLUSTER="${STACK_NAME}-cluster"89ECS_SERVICE="${STACK_NAME}-api"9091aws ecs update-service \92 --cluster $ECS_CLUSTER \93 --service $ECS_SERVICE \94 --force-new-deployment \95 --region $AWS_REGION9697echo "Deployment complete!"98echo "API Endpoint: $API_ENDPOINT"
Optimization and Deployment Strategies for OpenAI-Ollama Hybrid AI System (Continued)
Monitoring and Observability Configuration
Prometheus and Grafana Setup for Metrics
yaml1# monitoring/prometheus-config.yaml2apiVersion: v13kind: ConfigMap4metadata:5 name: prometheus-config6data:7 prometheus.yml: |8 global:9 scrape_interval: 15s10 evaluation_interval: 15s1112 scrape_configs:13 - job_name: 'mcp-api'14 metrics_path: /metrics15 kubernetes_sd_configs:16 - role: pod17 relabel_configs:18 - source_labels: [__meta_kubernetes_pod_label_app]19 regex: mcp-api20 action: keep2122 - job_name: 'ollama'23 metrics_path: /metrics24 static_configs:25 - targets: ['ollama-service:11434']2627 alerting:28 alertmanagers:29 - static_configs:30 - targets: ['alertmanager:9093']31---32apiVersion: apps/v133kind: Deployment34metadata:35 name: prometheus36spec:37 replicas: 138 selector:39 matchLabels:40 app: prometheus41 template:42 metadata:43 labels:44 app: prometheus45 spec:46 containers:47 - name: prometheus48 image: prom/prometheus:v2.42.049 ports:50 - containerPort: 909051 volumeMounts:52 - name: config-volume53 mountPath: /etc/prometheus54 - name: prometheus-data55 mountPath: /prometheus56 args:57 - "--config.file=/etc/prometheus/prometheus.yml"58 - "--storage.tsdb.path=/prometheus"59 - "--web.console.libraries=/usr/share/prometheus/console_libraries"60 - "--web.console.templates=/usr/share/prometheus/consoles"61 - "--web.enable-lifecycle"62 volumes:63 - name: config-volume64 configMap:65 name: prometheus-config66 - name: prometheus-data67 persistentVolumeClaim:68 claimName: prometheus-pvc69---70apiVersion: v171kind: Service72metadata:73 name: prometheus-service74spec:75 selector:76 app: prometheus77 ports:78 - port: 909079 targetPort: 909080 type: ClusterIP81---82apiVersion: apps/v183kind: Deployment84metadata:85 name: grafana86spec:87 replicas: 188 selector:89 matchLabels:90 app: grafana91 template:92 metadata:93 labels:94 app: grafana95 spec:96 containers:97 - name: grafana98 image: grafana/grafana:9.4.799 ports:100 - containerPort: 3000101 volumeMounts:102 - name: grafana-data103 mountPath: /var/lib/grafana104 env:105 - name: GF_SECURITY_ADMIN_USER106 valueFrom:107 secretKeyRef:108 name: grafana-secrets109 key: admin_user110 - name: GF_SECURITY_ADMIN_PASSWORD111 valueFrom:112 secretKeyRef:113 name: grafana-secrets114 key: admin_password115 volumes:116 - name: grafana-data117 persistentVolumeClaim:118 claimName: grafana-pvc119---120apiVersion: v1121kind: Service122metadata:123 name: grafana-service124spec:125 selector:126 app: grafana127 ports:128 - port: 3000129 targetPort: 3000130 type: ClusterIP131---132apiVersion: v1133kind: PersistentVolumeClaim134metadata:135 name: prometheus-pvc136spec:137 accessModes:138 - ReadWriteOnce139 resources:140 requests:141 storage: 10Gi142---143apiVersion: v1144kind: PersistentVolumeClaim145metadata:146 name: grafana-pvc147spec:148 accessModes:149 - ReadWriteOnce150 resources:151 requests:152 storage: 5Gi
Grafana Dashboard Configuration
json1{2 "annotations": {3 "list": [4 {5 "builtIn": 1,6 "datasource": "-- Grafana --",7 "enable": true,8 "hide": true,9 "iconColor": "rgba(0, 211, 255, 1)",10 "name": "Annotations & Alerts",11 "type": "dashboard"12 }13 ]14 },15 "editable": true,16 "gnetId": null,17 "graphTooltip": 0,18 "id": 1,19 "links": [],20 "panels": [21 {22 "aliasColors": {},23 "bars": false,24 "dashLength": 10,25 "dashes": false,26 "datasource": "Prometheus",27 "fieldConfig": {28 "defaults": {29 "custom": {}30 },31 "overrides": []32 },33 "fill": 1,34 "fillGradient": 0,35 "gridPos": {36 "h": 8,37 "w": 12,38 "x": 0,39 "y": 040 },41 "hiddenSeries": false,42 "id": 2,43 "legend": {44 "avg": false,45 "current": false,46 "max": false,47 "min": false,48 "show": true,49 "total": false,50 "values": false51 },52 "lines": true,53 "linewidth": 1,54 "nullPointMode": "null",55 "options": {56 "alertThreshold": true57 },58 "percentage": false,59 "pluginVersion": "7.2.0",60 "pointradius": 2,61 "points": false,62 "renderer": "flot",63 "seriesOverrides": [],64 "spaceLength": 10,65 "stack": false,66 "steppedLine": false,67 "targets": [68 {69 "expr": "rate(api_requests_total[5m])",70 "interval": "",71 "legendFormat": "Requests ({{provider}})",72 "refId": "A"73 }74 ],75 "thresholds": [],76 "timeFrom": null,77 "timeRegions": [],78 "timeShift": null,79 "title": "Request Rate by Provider",80 "tooltip": {81 "shared": true,82 "sort": 0,83 "value_type": "individual"84 },85 "type": "graph",86 "xaxis": {87 "buckets": null,88 "mode": "time",89 "name": null,90 "show": true,91 "values": []92 },93 "yaxes": [94 {95 "format": "short",96 "label": "Requests/sec",97 "logBase": 1,98 "max": null,99 "min": null,100 "show": true101 },102 {103 "format": "short",104 "label": null,105 "logBase": 1,106 "max": null,107 "min": null,108 "show": true109 }110 ],111 "yaxis": {112 "align": false,113 "alignLevel": null114 }115 },116 {117 "aliasColors": {},118 "bars": false,119 "dashLength": 10,120 "dashes": false,121 "datasource": "Prometheus",122 "fieldConfig": {123 "defaults": {124 "custom": {}125 },126 "overrides": []127 },128 "fill": 1,129 "fillGradient": 0,130 "gridPos": {131 "h": 8,132 "w": 12,133 "x": 12,134 "y": 0135 },136 "hiddenSeries": false,137 "id": 3,138 "legend": {139 "avg": false,140 "current": false,141 "max": false,142 "min": false,143 "show": true,144 "total": false,145 "values": false146 },147 "lines": true,148 "linewidth": 1,149 "nullPointMode": "null",150 "options": {151 "alertThreshold": true152 },153 "percentage": false,154 "pluginVersion": "7.2.0",155 "pointradius": 2,156 "points": false,157 "renderer": "flot",158 "seriesOverrides": [],159 "spaceLength": 10,160 "stack": false,161 "steppedLine": false,162 "targets": [163 {164 "expr": "api_response_time_seconds{quantile=\"0.5\"}",165 "interval": "",166 "legendFormat": "50th % ({{provider}})",167 "refId": "A"168 },169 {170 "expr": "api_response_time_seconds{quantile=\"0.9\"}",171 "interval": "",172 "legendFormat": "90th % ({{provider}})",173 "refId": "B"174 },175 {176 "expr": "api_response_time_seconds{quantile=\"0.99\"}",177 "interval": "",178 "legendFormat": "99th % ({{provider}})",179 "refId": "C"180 }181 ],182 "thresholds": [],183 "timeFrom": null,184 "timeRegions": [],185 "timeShift": null,186 "title": "Response Time by Provider",187 "tooltip": {188 "shared": true,189 "sort": 0,190 "value_type": "individual"191 },192 "type": "graph",193 "xaxis": {194 "buckets": null,195 "mode": "time",196 "name": null,197 "show": true,198 "values": []199 },200 "yaxes": [201 {202 "format": "s",203 "label": "Response Time",204 "logBase": 1,205 "max": null,206 "min": null,207 "show": true208 },209 {210 "format": "short",211 "label": null,212 "logBase": 1,213 "max": null,214 "min": null,215 "show": true216 }217 ],218 "yaxis": {219 "align": false,220 "alignLevel": null221 }222 },223 {224 "datasource": "Prometheus",225 "fieldConfig": {226 "defaults": {227 "custom": {},228 "mappings": [],229 "thresholds": {230 "mode": "absolute",231 "steps": [232 {233 "color": "green",234 "value": null235 },236 {237 "color": "red",238 "value": 80239 }240 ]241 }242 },243 "overrides": []244 },245 "gridPos": {246 "h": 8,247 "w": 8,248 "x": 0,249 "y": 8250 },251 "id": 4,252 "options": {253 "colorMode": "value",254 "graphMode": "area",255 "justifyMode": "auto",256 "orientation": "auto",257 "reduceOptions": {258 "calcs": [259 "mean"260 ],261 "fields": "",262 "values": false263 },264 "textMode": "auto"265 },266 "pluginVersion": "7.2.0",267 "targets": [268 {269 "expr": "sum(api_requests_total{provider=\"openai\"})",270 "interval": "",271 "legendFormat": "",272 "refId": "A"273 }274 ],275 "timeFrom": null,276 "timeShift": null,277 "title": "OpenAI Total Requests",278 "type": "stat"279 },280 {281 "datasource": "Prometheus",282 "fieldConfig": {283 "defaults": {284 "custom": {},285 "mappings": [],286 "thresholds": {287 "mode": "absolute",288 "steps": [289 {290 "color": "green",291 "value": null292 },293 {294 "color": "red",295 "value": 80296 }297 ]298 }299 },300 "overrides": []301 },302 "gridPos": {303 "h": 8,304 "w": 8,305 "x": 8,306 "y": 8307 },308 "id": 5,309 "options": {310 "colorMode": "value",311 "graphMode": "area",312 "justifyMode": "auto",313 "orientation": "auto",314 "reduceOptions": {315 "calcs": [316 "mean"317 ],318 "fields": "",319 "values": false320 },321 "textMode": "auto"322 },323 "pluginVersion": "7.2.0",324 "targets": [325 {326 "expr": "sum(api_requests_total{provider=\"ollama\"})",327 "interval": "",328 "legendFormat": "",329 "refId": "A"330 }331 ],332 "timeFrom": null,333 "timeShift": null,334 "title": "Ollama Total Requests",335 "type": "stat"336 },337 {338 "datasource": "Prometheus",339 "fieldConfig": {340 "defaults": {341 "custom": {},342 "mappings": [],343 "thresholds": {344 "mode": "absolute",345 "steps": [346 {347 "color": "green",348 "value": null349 },350 {351 "color": "red",352 "value": 80353 }354 ]355 },356 "unit": "currencyUSD"357 },358 "overrides": []359 },360 "gridPos": {361 "h": 8,362 "w": 8,363 "x": 16,364 "y": 8365 },366 "id": 6,367 "options": {368 "colorMode": "value",369 "graphMode": "area",370 "justifyMode": "auto",371 "orientation": "auto",372 "reduceOptions": {373 "calcs": [374 "sum"375 ],376 "fields": "",377 "values": false378 },379 "textMode": "auto"380 },381 "pluginVersion": "7.2.0",382 "targets": [383 {384 "expr": "sum(api_openai_cost_total)",385 "interval": "",386 "legendFormat": "",387 "refId": "A"388 }389 ],390 "timeFrom": null,391 "timeShift": null,392 "title": "OpenAI Cost",393 "type": "stat"394 },395 {396 "aliasColors": {},397 "bars": false,398 "dashLength": 10,399 "dashes": false,400 "datasource": "Prometheus",401 "fieldConfig": {402 "defaults": {403 "custom": {}404 },405 "overrides": []406 },407 "fill": 1,408 "fillGradient": 0,409 "gridPos": {410 "h": 8,411 "w": 12,412 "x": 0,413 "y": 16414 },415 "hiddenSeries": false,416 "id": 7,417 "legend": {418 "avg": false,419 "current": false,420 "max": false,421 "min": false,422 "show": true,423 "total": false,424 "values": false425 },426 "lines": true,427 "linewidth": 1,428 "nullPointMode": "null",429 "options": {430 "alertThreshold": true431 },432 "percentage": false,433 "pluginVersion": "7.2.0",434 "pointradius": 2,435 "points": false,436 "renderer": "flot",437 "seriesOverrides": [],438 "spaceLength": 10,439 "stack": false,440 "steppedLine": false,441 "targets": [442 {443 "expr": "rate(api_token_usage_total{type=\"prompt\"}[5m])",444 "interval": "",445 "legendFormat": "Prompt ({{provider}})",446 "refId": "A"447 },448 {449 "expr": "rate(api_token_usage_total{type=\"completion\"}[5m])",450 "interval": "",451 "legendFormat": "Completion ({{provider}})",452 "refId": "B"453 }454 ],455 "thresholds": [],456 "timeFrom": null,457 "timeRegions": [],458 "timeShift": null,459 "title": "Token Usage Rate by Type",460 "tooltip": {461 "shared": true,462 "sort": 0,463 "value_type": "individual"464 },465 "type": "graph",466 "xaxis": {467 "buckets": null,468 "mode": "time",469 "name": null,470 "show": true,471 "values": []472 },473 "yaxes": [474 {475 "format": "short",476 "label": "Tokens/sec",477 "logBase": 1,478 "max": null,479 "min": null,480 "show": true481 },482 {483 "format": "short",484 "label": null,485 "logBase": 1,486 "max": null,487 "min": null,488 "show": true489 }490 ],491 "yaxis": {492 "align": false,493 "alignLevel": null494 }495 },496 {497 "aliasColors": {},498 "bars": false,499 "dashLength": 10,500 "dashes": false,501 "datasource": "Prometheus",502 "fieldConfig": {503 "defaults": {504 "custom": {}505 },506 "overrides": []507 },508 "fill": 1,509 "fillGradient": 0,510 "gridPos": {511 "h": 8,512 "w": 12,513 "x": 12,514 "y": 16515 },516 "hiddenSeries": false,517 "id": 8,518 "legend": {519 "avg": false,520 "current": false,521 "max": false,522 "min": false,523 "show": true,524 "total": false,525 "values": false526 },527 "lines": true,528 "linewidth": 1,529 "nullPointMode": "null",530 "options": {531 "alertThreshold": true532 },533 "percentage": false,534 "pluginVersion": "7.2.0",535 "pointradius": 2,536 "points": false,537 "renderer": "flot",538 "seriesOverrides": [],539 "spaceLength": 10,540 "stack": false,541 "steppedLine": false,542 "targets": [543 {544 "expr": "rate(api_cache_hits_total[5m])",545 "interval": "",546 "legendFormat": "Cache Hits",547 "refId": "A"548 },549 {550 "expr": "rate(api_cache_misses_total[5m])",551 "interval": "",552 "legendFormat": "Cache Misses",553 "refId": "B"554 }555 ],556 "thresholds": [],557 "timeFrom": null,558 "timeRegions": [],559 "timeShift": null,560 "title": "Cache Performance",561 "tooltip": {562 "shared": true,563 "sort": 0,564 "value_type": "individual"565 },566 "type": "graph",567 "xaxis": {568 "buckets": null,569 "mode": "time",570 "name": null,571 "show": true,572 "values": []573 },574 "yaxes": [575 {576 "format": "short",577 "label": "Rate",578 "logBase": 1,579 "max": null,580 "min": null,581 "show": true582 },583 {584 "format": "short",585 "label": null,586 "logBase": 1,587 "max": null,588 "min": null,589 "show": true590 }591 ],592 "yaxis": {593 "align": false,594 "alignLevel": null595 }596 }597 ],598 "refresh": "10s",599 "schemaVersion": 26,600 "style": "dark",601 "tags": [],602 "templating": {603 "list": []604 },605 "time": {606 "from": "now-6h",607 "to": "now"608 },609 "timepicker": {610 "refresh_intervals": [611 "5s",612 "10s",613 "30s",614 "1m",615 "5m",616 "15m",617 "30m",618 "1h",619 "2h",620 "1d"621 ]622 },623 "timezone": "",624 "title": "MCP Hybrid System Dashboard",625 "uid": "mcp-dashboard",626 "version": 1627}
Implementing Metrics Collection in API
python1# app/middleware/metrics.py2from fastapi import Request3import time4from prometheus_client import Counter, Histogram, Gauge5import logging67# Initialize metrics8REQUEST_COUNT = Counter(9 'api_requests_total',10 'Total count of API requests',11 ['method', 'endpoint', 'provider', 'model', 'status']12)1314RESPONSE_TIME = Histogram(15 'api_response_time_seconds',16 'Response time in seconds',17 ['method', 'endpoint', 'provider']18)1920TOKEN_USAGE = Counter(21 'api_token_usage_total',22 'Total token usage',23 ['provider', 'model', 'type'] # type: prompt or completion24)2526OPENAI_COST = Counter(27 'api_openai_cost_total',28 'Total OpenAI API cost in USD',29 ['model']30)3132ACTIVE_REQUESTS = Gauge(33 'api_active_requests',34 'Number of active requests',35 ['method']36)3738CACHE_HITS = Counter(39 'api_cache_hits_total',40 'Total cache hits',41 ['cache_type'] # exact or semantic42)4344CACHE_MISSES = Counter(45 'api_cache_misses_total',46 'Total cache misses',47 []48)4950logger = logging.getLogger(__name__)5152async def metrics_middleware(request: Request, call_next):53 """Middleware to collect metrics for API requests."""54 # Track active requests55 ACTIVE_REQUESTS.labels(method=request.method).inc()5657 # Start timing58 start_time = time.time()5960 # Default status code61 status_code = 50062 provider = "unknown"63 model = "unknown"6465 try:66 # Process the request67 response = await call_next(request)68 status_code = response.status_code6970 # Try to get provider and model from response headers if available71 provider = response.headers.get("X-Provider", "unknown")72 model = response.headers.get("X-Model", "unknown")7374 return response75 except Exception as e:76 logger.exception("Unhandled exception in request")77 raise78 finally:79 # Calculate response time80 response_time = time.time() - start_time8182 # Record metrics83 REQUEST_COUNT.labels(84 method=request.method,85 endpoint=request.url.path,86 provider=provider,87 model=model,88 status=status_code89 ).inc()9091 RESPONSE_TIME.labels(92 method=request.method,93 endpoint=request.url.path,94 provider=provider95 ).observe(response_time)9697 # Decrement active requests98 ACTIVE_REQUESTS.labels(method=request.method).dec()
Scaling Strategies
Optimizing Ollama Scaling for High Loads
python1# app/services/ollama_scaling.py2import logging3import asyncio4import time5from typing import Dict, List, Any, Optional6import random7import httpx89logger = logging.getLogger(__name__)1011class OllamaScalingService:12 """13 Manages load balancing and scaling for multiple Ollama instances.14 """1516 def __init__(self):17 self.ollama_instances = []18 self.instance_status = {}19 self.model_availability = {}20 self.health_check_interval = 60 # seconds21 self.enable_scaling = False22 self.min_instances = 123 self.max_instances = 524 self.health_check_task = None2526 async def initialize(self, instances: List[str]):27 """Initialize the service with a list of Ollama instances."""28 self.ollama_instances = instances29 self.instance_status = {instance: False for instance in instances}30 self.model_availability = {instance: [] for instance in instances}3132 # Start health checking33 self.health_check_task = asyncio.create_task(self._health_check_loop())3435 # Perform initial health check36 await self._check_all_instances()3738 logger.info(f"Initialized Ollama scaling with {len(instances)} instances")3940 async def shutdown(self):41 """Shutdown the service."""42 if self.health_check_task:43 self.health_check_task.cancel()44 try:45 await self.health_check_task46 except asyncio.CancelledError:47 pass4849 async def _health_check_loop(self):50 """Periodically check health of all instances."""51 while True:52 try:53 await self._check_all_instances()54 await asyncio.sleep(self.health_check_interval)55 except asyncio.CancelledError:56 break57 except Exception as e:58 logger.error(f"Error in health check loop: {str(e)}")59 await asyncio.sleep(5) # Shorter retry on error6061 async def _check_all_instances(self):62 """Check health and model availability for all instances."""63 tasks = []64 for instance in self.ollama_instances:65 tasks.append(self._check_instance(instance))6667 # Run all checks in parallel68 await asyncio.gather(*tasks, return_exceptions=True)6970 # Log status71 healthy_count = sum(1 for status in self.instance_status.values() if status)72 logger.debug(f"Ollama health check: {healthy_count}/{len(self.ollama_instances)} instances healthy")7374 async def _check_instance(self, instance: str):75 """Check health and model availability for a single instance."""76 try:77 async with httpx.AsyncClient(timeout=5.0) as client:78 response = await client.get(f"{instance}/api/version")7980 if response.status_code == 200:81 # Instance is healthy82 self.instance_status[instance] = True8384 # Check available models85 models_response = await client.get(f"{instance}/api/tags")86 if models_response.status_code == 200:87 data = models_response.json()88 models = [model["name"] for model in data.get("models", [])]89 self.model_availability[instance] = models90 else:91 self.instance_status[instance] = False92 except Exception as e:93 logger.warning(f"Health check failed for {instance}: {str(e)}")94 self.instance_status[instance] = False9596 def get_instance_for_model(self, model: str) -> Optional[str]:97 """Get the best instance for a specific model."""98 # Filter to healthy instances that have the model99 candidates = [100 instance for instance, status in self.instance_status.items()101 if status and model in self.model_availability.get(instance, [])102 ]103104 if not candidates:105 return None106107 # Use random selection for basic load balancing108 # A more sophisticated version would track load, response times, etc.109 return random.choice(candidates)110111 def get_healthy_instance(self) -> Optional[str]:112 """Get any healthy instance."""113 candidates = [114 instance for instance, status in self.instance_status.items()115 if status116 ]117118 if not candidates:119 return None120121 return random.choice(candidates)122123 async def ensure_model_availability(self, model: str) -> bool:124 """125 Ensure at least one instance has the required model.126 Returns True if model is available or successfully pulled.127 """128 # Check if any instance already has this model129 for instance, models in self.model_availability.items():130 if self.instance_status.get(instance, False) and model in models:131 return True132133 # Try to pull the model on a healthy instance134 instance = self.get_healthy_instance()135 if not instance:136 logger.error(f"No healthy Ollama instances available to pull model {model}")137 return False138139 # Try to pull the model140 try:141 async with httpx.AsyncClient(timeout=300.0) as client: # Longer timeout for model pull142 response = await client.post(143 f"{instance}/api/pull",144 json={"name": model}145 )146147 if response.status_code == 200:148 logger.info(f"Successfully pulled model {model} on {instance}")149 # Update model availability150 if instance in self.model_availability:151 self.model_availability[instance].append(model)152 return True153 else:154 logger.error(f"Failed to pull model {model} on {instance}: {response.text}")155 return False156 except Exception as e:157 logger.error(f"Error pulling model {model} on {instance}: {str(e)}")158 return False
Autoscaling Configuration for Cloud Deployments
yaml1# kubernetes/autoscaler-config.yaml2apiVersion: autoscaling.k8s.io/v13kind: VerticalPodAutoscaler4metadata:5 name: mcp-api-vpa6spec:7 targetRef:8 apiVersion: "apps/v1"9 kind: Deployment10 name: mcp-api11 updatePolicy:12 updateMode: "Auto"13 resourcePolicy:14 containerPolicies:15 - containerName: '*'16 minAllowed:17 cpu: 250m18 memory: 256Mi19 maxAllowed:20 cpu: 2000m21 memory: 4Gi22 controlledResources: ["cpu", "memory"]23---24apiVersion: keda.sh/v1alpha125kind: ScaledObject26metadata:27 name: mcp-api-scaler28spec:29 scaleTargetRef:30 name: mcp-api31 minReplicaCount: 232 maxReplicaCount: 2033 pollingInterval: 1534 cooldownPeriod: 30035 triggers:36 - type: prometheus37 metadata:38 serverAddress: http://prometheus-service:909039 metricName: api_active_requests40 threshold: '10'41 query: sum(api_active_requests)42 - type: prometheus43 metadata:44 serverAddress: http://prometheus-service:909045 metricName: api_response_time_p9046 threshold: '2.0'47 query: histogram_quantile(0.9, sum(rate(api_response_time_seconds_bucket[2m])) by (le))
Cost Optimization - Monthly Budget Tracking
python1# app/services/budget_service.py2import logging3import time4from datetime import datetime, timedelta5import aioredis6import json7from typing import Dict, Any, Optional89logger = logging.getLogger(__name__)1011class BudgetService:12 """13 Manages API budget tracking and quota enforcement.14 """1516 def __init__(self, redis_url: str):17 self.redis = None18 self.redis_url = redis_url19 self.monthly_budget = 0.020 self.daily_budget = 0.021 self.alert_threshold = 0.8 # Alert at 80% of budget22 self.budget_lock_key = "budget:lock"23 self.last_reset_check = 02425 async def initialize(self, monthly_budget: float = 0.0):26 """Initialize the budget service."""27 self.redis = await aioredis.create_redis_pool(self.redis_url)28 self.monthly_budget = monthly_budget29 self.daily_budget = monthly_budget / 30 if monthly_budget > 0 else 03031 # Initialize monthly budget in Redis if not already set32 if not await self.redis.exists("budget:monthly:total"):33 await self.redis.set("budget:monthly:total", str(monthly_budget))3435 # Initialize current usage if not already set36 if not await self.redis.exists("budget:monthly:used"):37 await self.redis.set("budget:monthly:used", "0.0")3839 # Set the reset day (1st of month)40 if not await self.redis.exists("budget:reset_day"):41 await self.redis.set("budget:reset_day", "1")4243 # Check if we need to reset the budget44 await self._check_budget_reset()4546 logger.info(f"Budget service initialized with monthly budget: ${monthly_budget:.2f}")4748 async def close(self):49 """Close the Redis connection."""50 if self.redis:51 self.redis.close()52 await self.redis.wait_closed()5354 async def _check_budget_reset(self):55 """Check if the budget needs to be reset (new month)."""56 now = time.time()57 # Only check once per hour to avoid excessive checks58 if now - self.last_reset_check < 3600:59 return6061 self.last_reset_check = now6263 try:64 # Try to acquire lock to avoid multiple resets65 lock = await self.redis.set(66 self.budget_lock_key, "1",67 expire=60, exist="SET_IF_NOT_EXIST"68 )6970 if not lock:71 return # Another process is handling reset7273 # Get the reset day (default to 1st of month)74 reset_day = int(await self.redis.get("budget:reset_day") or "1")7576 # Get last reset timestamp77 last_reset = float(await self.redis.get("budget:last_reset") or "0")7879 # Check if we're in a new month since last reset80 last_reset_date = datetime.fromtimestamp(last_reset)81 now_date = datetime.now()8283 # If it's a new month and we've passed the reset day84 if (now_date.year > last_reset_date.year or85 (now_date.year == last_reset_date.year and now_date.month > last_reset_date.month)) and \86 now_date.day >= reset_day:8788 # Reset monthly usage89 await self.redis.set("budget:monthly:used", "0.0")9091 # Update last reset timestamp92 await self.redis.set("budget:last_reset", str(now))9394 # Log the reset95 logger.info("Monthly budget reset performed")9697 # Archive previous month's usage for reporting98 prev_month = last_reset_date.strftime("%Y-%m")99 prev_usage = await self.redis.get("budget:monthly:used") or "0.0"100 await self.redis.set(f"budget:archive:{prev_month}", prev_usage)101 finally:102 # Release lock103 await self.redis.delete(self.budget_lock_key)104105 async def record_usage(self, cost: float, provider: str, model: str):106 """Record API usage cost."""107 if cost <= 0:108 return109110 # Only track costs for OpenAI111 if provider != "openai":112 return113114 # Check if we need to reset first115 await self._check_budget_reset()116117 # Update monthly usage118 await self.redis.incrbyfloat("budget:monthly:used", cost)119120 # Update model-specific usage121 await self.redis.incrbyfloat(f"budget:model:{model}", cost)122123 # Update daily usage124 today = datetime.now().strftime("%Y-%m-%d")125 await self.redis.incrbyfloat(f"budget:daily:{today}", cost)126127 # Log high-cost operations128 if cost > 0.1: # Log individual requests that cost more than 10 cents129 logger.info(f"High-cost API request: ${cost:.4f} for {provider}:{model}")130131 # Check if we've exceeded the alert threshold132 usage = float(await self.redis.get("budget:monthly:used") or "0")133 budget = float(await self.redis.get("budget:monthly:total") or "0")134135 if budget > 0 and usage >= budget * self.alert_threshold:136 # Check if we've already alerted for this threshold137 alerted = await self.redis.get(f"budget:alerted:{int(self.alert_threshold * 100)}")138139 if not alerted:140 percentage = (usage / budget) * 100141 logger.warning(f"Budget alert: Used ${usage:.2f} of ${budget:.2f} ({percentage:.1f}%)")142143 # Mark as alerted for this threshold144 await self.redis.set(145 f"budget:alerted:{int(self.alert_threshold * 100)}", "1",146 expire=86400 # Expire after 1 day147 )148149 async def check_budget_available(self, estimated_cost: float) -> bool:150 """151 Check if there's enough budget for an estimated operation.152 Returns True if operation is allowed, False if it would exceed budget.153 """154 if estimated_cost <= 0:155 return True156157 if self.monthly_budget <= 0:158 return True # No budget constraints159160 # Get current usage161 usage = float(await self.redis.get("budget:monthly:used") or "0")162 budget = float(await self.redis.get("budget:monthly:total") or "0")163164 # Check if operation would exceed budget165 return (usage + estimated_cost) <= budget166167 async def get_usage_stats(self) -> Dict[str, Any]:168 """Get current budget usage statistics."""169 usage = float(await self.redis.get("budget:monthly:used") or "0")170 budget = float(await self.redis.get("budget:monthly:total") or "0")171172 # Get daily usage for the last 30 days173 daily_usage = {}174 today = datetime.now()175176 for i in range(30):177 date = (today - timedelta(days=i)).strftime("%Y-%m-%d")178 day_usage = float(await self.redis.get(f"budget:daily:{date}") or "0")179 daily_usage[date] = day_usage180181 # Get usage by model182 model_keys = await self.redis.keys("budget:model:*")183 model_usage = {}184185 for key in model_keys:186 model = key.decode('utf-8').replace("budget:model:", "")187 model_cost = float(await self.redis.get(key) or "0")188 model_usage[model] = model_cost189190 # Calculate percentage used191 percentage_used = (usage / budget) * 100 if budget > 0 else 0192193 return {194 "current_usage": usage,195 "monthly_budget": budget,196 "percentage_used": percentage_used,197 "daily_usage": daily_usage,198 "model_usage": model_usage,199 "remaining_budget": budget - usage if budget > 0 else 0200 }
Conclusion
The optimization and deployment strategies outlined in this document provide a comprehensive framework for implementing an efficient, cost-effective, and highly accurate hybrid AI system that leverages both OpenAI's cloud capabilities and Ollama's local inference.
Key aspects of this implementation include:
-
Performance Optimization:
- Query routing optimization based on complexity analysis
- Semantic response caching for frequent queries
- Parallel processing for complex queries
- Dynamic batching for high-load scenarios
- Model-specific prompt optimization
-
Cost Reduction:
- Intelligent token usage optimization
- Tiered model selection based on task requirements
- Local model prioritization for development
- Request batching and rate limiting
- Memory and context compression
-
Response Accuracy:
- Advanced prompt templating for different scenarios
- Chain-of-thought reasoning for complex queries
- Self-verification and error correction
- Domain-specific knowledge integration
- Dynamic few-shot learning with examples
-
Deployment Options:
- Local development environment with Docker Compose
- Production Kubernetes deployment with autoscaling
- AWS cloud deployment with CloudFormation
- Comprehensive monitoring with Prometheus and Grafana
- Budget tracking and cost optimization
These strategies work in concert to create a system that intelligently balances the tradeoffs between performance, cost, and accuracy, adapting to specific requirements and constraints in different deployment scenarios.
By implementing this hybrid approach, organizations can significantly reduce API costs while maintaining high quality responses, with the added benefits of enhanced privacy for sensitive data and reduced dependency on external services. The local inference capabilities also provide resilience against API outages and rate limiting, ensuring consistent service availability.
MCP (Modern Computational Paradigm) System
Comprehensive Documentation
This documentation provides a complete guide to understanding, installing, configuring, and using the MCP system - a hybrid architecture that integrates OpenAI's API capabilities with Ollama's local inference to create an optimized, cost-effective AI solution.
Table of Contents
- Introduction
- System Architecture
- Installation Guide
- Configuration
- API Reference
- Usage Examples
- Performance Optimization
- Cost Optimization
- Monitoring and Observability
- Troubleshooting
- Contributing
- License
README.md
markdown1# MCP - Modern Computational Paradigm234567MCP is a hybrid AI system that intelligently integrates OpenAI's cloud capabilities with Ollama's local inference. This architecture optimizes for cost, performance, and privacy while maintaining response quality.89## Key Features1011- **Intelligent Query Routing**: Automatically selects between OpenAI and Ollama based on query complexity, privacy requirements, and performance needs12- **Advanced Agent Framework**: Configurable AI agents with specialized capabilities13- **Cost Optimization**: Reduces API costs by up to 70% through local model usage, caching, and token optimization14- **Privacy Control**: Keeps sensitive information local when appropriate15- **Performance Optimization**: Parallel processing, response caching, and dynamic batching for high throughput16- **Comprehensive Monitoring**: Built-in metrics and observability1718## Quick Start1920### Prerequisites2122- Python 3.11+23- Docker and Docker Compose (for containerized deployment)24- Ollama (for local model inference)25- OpenAI API key2627### Installation28291. Clone the repository:30 ```bash31 git clone https://github.com/yourusername/mcp.git32 cd mcp
-
Create and activate a virtual environment:
bash1python -m venv venv2source venv/bin/activate # On Windows: venv\Scripts\activate -
Install dependencies:
bash1pip install -r requirements.txt -
Set up environment variables:
bash1cp .env.example .env2# Edit .env with your configuration -
Start Ollama (if not already running):
bash1ollama serve -
Start the application:
bash1uvicorn app.main:app --reload
The API will be available at http://localhost:8000.
Docker Deployment
For containerized deployment:
bash1docker-compose up -d
Documentation
For complete documentation, see:
Architecture
MCP uses a sophisticated routing architecture to determine the optimal inference provider for each request:
text1┌─────────────────┐ ┌──────────────────┐ ┌─────────────┐2│ │ │ │ │ │3│ Client Request │────▶│ Routing Decision │────▶│ OpenAI API │4│ │ │ │ │ │5└─────────────────┘ └──────────────────┘ └─────────────┘6 │7 │8 ▼9 ┌─────────────┐10 │ │11 │ Ollama API │12 │ │13 └─────────────┘
License
MIT License - see LICENSE for details.
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for details.
text1---23# Installation Guide45## Prerequisites67Before installing the MCP system, ensure your environment meets the following requirements:89### System Requirements1011- **Operating System**: Linux (recommended), macOS, or Windows12- **CPU**: 4+ cores recommended13- **RAM**: Minimum 8GB, 16GB+ recommended14- **Disk Space**: 10GB minimum for installation, 50GB+ recommended for model storage15- **GPU**: Optional but recommended for Ollama (NVIDIA with CUDA support)1617### Software Requirements1819- **Python**: Version 3.11 or higher20- **Docker**: Version 20.10 or higher (for containerized deployment)21- **Docker Compose**: Version 2.0 or higher22- **Kubernetes**: Version 1.21+ (for Kubernetes deployment)23- **Ollama**: Latest version (for local model inference)24- **Redis**: Version 6.0+ (for caching and rate limiting)2526### Required API Keys2728- **OpenAI API Key**: Register at [OpenAI Platform](https://platform.openai.com/)2930## Local Development Setup3132Follow these steps to set up a local development environment:3334### 1. Clone the Repository3536```bash37git clone https://github.com/yourusername/mcp.git38cd mcp
2. Set Up Virtual Environment
bash1# Create virtual environment2python -m venv venv34# Activate virtual environment5# On Linux/macOS:6source venv/bin/activate7# On Windows:8venv\Scripts\activate
3. Install Dependencies
bash1pip install --upgrade pip2pip install -r requirements.txt3pip install -r requirements-dev.txt # For development tools
4. Install and Configure Ollama
bash1# macOS (using Homebrew)2brew install ollama34# Linux5curl -fsSL https://ollama.com/install.sh | sh67# Start Ollama service8ollama serve
5. Pull Required Models
bash1# Pull basic models2ollama pull llama23ollama pull mistral4ollama pull codellama
6. Set Up Environment Variables
bash1# Copy the example environment file2cp .env.example .env34# Edit the file with your configuration5# At minimum, set OPENAI_API_KEY6nano .env
7. Initialize Local Services
bash1# Start Redis using Docker2docker-compose up -d redis34# Initialize database (if applicable)5python scripts/init_db.py
8. Start Development Server
bash1# Start with auto-reload for development2uvicorn app.main:app --reload --port 8000
9. Verify Installation
Open your browser and navigate to:
- API documentation: http://localhost:8000/docs
- Health check: http://localhost:8000/api/health
Docker Deployment
For a containerized deployment using Docker Compose:
1. Ensure Docker and Docker Compose are Installed
bash1# Verify installation2docker --version3docker-compose --version
2. Configure Environment Variables
bash1# Copy and edit environment variables2cp .env.example .env3nano .env
3. Start Services with Docker Compose
bash1# Build and start all services2docker-compose up -d34# View logs5docker-compose logs -f
The application will be available at http://localhost:8000.
4. Stopping the Services
bash1docker-compose down
Kubernetes Deployment
For production deployment on Kubernetes:
1. Prerequisites
- Kubernetes cluster
- kubectl configured
- Helm (optional, for Redis deployment)
2. Set Up Namespace and Secrets
bash1# Create namespace2kubectl create namespace mcp34# Create secrets5kubectl create secret generic mcp-secrets \6 --from-literal=openai-api-key=YOUR_OPENAI_API_KEY \7 --from-literal=redis-password=YOUR_REDIS_PASSWORD \8 -n mcp
3. Deploy Redis (if needed)
bash1# Using Helm2helm repo add bitnami https://charts.bitnami.com/bitnami3helm install redis bitnami/redis \4 --namespace mcp \5 --set auth.password=YOUR_REDIS_PASSWORD \6 --set master.persistence.size=8Gi
4. Deploy MCP Components
bash1# Apply Kubernetes manifests2kubectl apply -f kubernetes/deployment.yaml -n mcp3kubectl apply -f kubernetes/service.yaml -n mcp4kubectl apply -f kubernetes/ingress.yaml -n mcp
5. Set Up Autoscaling (Optional)
bash1kubectl apply -f kubernetes/hpa.yaml -n mcp
6. Check Deployment Status
bash1kubectl get pods -n mcp2kubectl get services -n mcp3kubectl get ingress -n mcp
AWS Deployment
For deployment on AWS Cloud:
1. Prerequisites
- AWS CLI configured
- Appropriate IAM permissions
2. CloudFormation Deployment
bash1# Deploy using CloudFormation template2aws cloudformation create-stack \3 --stack-name mcp-hybrid-system \4 --template-body file://aws/cloudformation.yaml \5 --capabilities CAPABILITY_IAM \6 --parameters \7 ParameterKey=Environment,ParameterValue=Production \8 ParameterKey=OllamaInstanceType,ParameterValue=g4dn.xlarge910# Check deployment status11aws cloudformation describe-stacks --stack-name mcp-hybrid-system
3. Deploy API Image to ECR
bash1# Log in to ECR2aws ecr get-login-password | docker login --username AWS --password-stdin YOUR_AWS_ACCOUNT_ID.dkr.ecr.YOUR_REGION.amazonaws.com34# Build and push image5docker build -t YOUR_AWS_ACCOUNT_ID.dkr.ecr.YOUR_REGION.amazonaws.com/mcp-api:latest -f Dockerfile.prod .6docker push YOUR_AWS_ACCOUNT_ID.dkr.ecr.YOUR_REGION.amazonaws.com/mcp-api:latest
4. Update ECS Service
bash1# Force new deployment to use the updated image2aws ecs update-service --cluster mcp-hybrid-system-cluster --service mcp-hybrid-system-api --force-new-deployment
API Reference
Authentication
The MCP API uses API key authentication. Include your API key in all requests using either:
Bearer Token Authentication
Authorization: Bearer YOUR_API_KEY
Query Parameter
?api_key=YOUR_API_KEY
Chat Endpoints
Create Chat Completion
Generates a completion for a given conversation.
Endpoint: POST /api/v1/chat/completions
Request Body:
json1{2 "messages": [3 {"role": "system", "content": "You are a helpful assistant."},4 {"role": "user", "content": "Hello, who are you?"}5 ],6 "model": "auto",7 "temperature": 0.7,8 "max_tokens": 1024,9 "stream": false,10 "routing_preferences": {11 "force_provider": null,12 "privacy_level": "standard",13 "latency_preference": "balanced"14 },15 "tools": []16}
Parameters:
| Name | Type | Description | |------|------|-------------| | messages | array | Array of message objects representing the conversation history | | model | string | The model to use, or "auto" for automatic selection | | temperature | number | Controls randomness (0-1) | | max_tokens | integer | Maximum tokens in response | | stream | boolean | Whether to stream the response | | routing_preferences | object | Preferences for provider selection | | tools | array | List of tools the assistant can use |
Response:
json1{2 "id": "resp_abc123",3 "object": "chat.completion",4 "created": 1677858242,5 "provider": "openai",6 "model": "gpt-4o",7 "usage": {8 "prompt_tokens": 56,9 "completion_tokens": 325,10 "total_tokens": 38111 },12 "message": {13 "role": "assistant",14 "content": "Hello! I'm an AI assistant...",15 "tool_calls": []16 },17 "routing_metrics": {18 "complexity_score": 0.78,19 "privacy_impact": "low",20 "decision_factors": ["complexity", "tool_requirements"]21 }22}
Stream Chat Completion
Stream a completion for a conversation.
Endpoint: POST /api/v1/chat/streaming
Request Body: Same as /api/v1/chat/completions but stream must be true.
Response: Server-sent events (SSE) stream of partial completions.
Hybrid Chat
Intelligent routing between OpenAI and Ollama based on query characteristics.
Endpoint: POST /api/v1/chat/hybrid
Request Body:
json1{2 "messages": [3 {"role": "user", "content": "Explain quantum computing"}4 ],5 "mode": "auto",6 "options": {7 "prioritize_privacy": false,8 "prioritize_speed": false9 }10}
Response: Same format as /api/v1/chat/completions.
Agent Endpoints
Run Agent
Execute an agent with specific configuration.
Endpoint: POST /api/v1/agents/run
Request Body:
json1{2 "agent_config": {3 "instructions": "You are a research assistant...",4 "model": "gpt-4o",5 "tools": [6 {7 "type": "function",8 "function": {9 "name": "search_knowledge_base",10 "description": "Search for information",11 "parameters": {12 "type": "object",13 "properties": {14 "query": {15 "type": "string"16 }17 },18 "required": ["query"]19 }20 }21 }22 ]23 },24 "messages": [25 {"role": "user", "content": "Find information about renewable energy"}26 ],27 "metadata": {28 "session_id": "user_session_123"29 }30}
Response:
json1{2 "run_id": "run_abc123",3 "status": "in_progress",4 "created_at": 1677858242,5 "estimated_completion_time": 1677858260,6 "polling_url": "/api/v1/agents/status/run_abc123"7}
Get Agent Status
Check the status of a running agent.
Endpoint: GET /api/v1/agents/status/{run_id}
Response:
json1{2 "run_id": "run_abc123",3 "status": "completed",4 "result": {5 "output": "Renewable energy comes from sources that are...",6 "tool_calls": []7 },8 "created_at": 1677858242,9 "completed_at": 167785826010}
List Available Agents
List all available agent configurations.
Endpoint: GET /api/v1/agents
Response:
json1{2 "agents": [3 {4 "id": "research",5 "name": "Research Assistant",6 "description": "Specialized in finding and synthesizing information"7 },8 {9 "id": "coding",10 "name": "Code Assistant",11 "description": "Helps with programming tasks"12 }13 ]14}
Model Management Endpoints
List Models
List all available models.
Endpoint: GET /api/v1/models
Response:
json1{2 "openai_models": [3 {4 "id": "gpt-4o",5 "name": "GPT-4o",6 "capabilities": ["general", "code", "reasoning"],7 "context_window": 1280008 },9 {10 "id": "gpt-3.5-turbo",11 "name": "GPT-3.5 Turbo",12 "capabilities": ["general"],13 "context_window": 1600014 }15 ],16 "ollama_models": [17 {18 "id": "llama2",19 "name": "Llama 2",20 "capabilities": ["general"],21 "context_window": 409622 },23 {24 "id": "mistral",25 "name": "Mistral",26 "capabilities": ["general", "reasoning"],27 "context_window": 819228 }29 ]30}
Get Model Details
Get detailed information about a specific model.
Endpoint: GET /api/v1/models/{model_id}
Response:
json1{2 "id": "mistral",3 "name": "Mistral",4 "provider": "ollama",5 "capabilities": ["general", "reasoning"],6 "context_window": 8192,7 "recommended_usage": "General purpose tasks with reasoning requirements",8 "performance_characteristics": {9 "average_response_time": 2.4,10 "tokens_per_second": 4511 }12}
Pull Ollama Model
Pull a new model for Ollama.
Endpoint: POST /api/v1/models/ollama/pull
Request Body:
json1{2 "model": "wizard-math"3}
Response:
json1{2 "status": "pulling",3 "model": "wizard-math",4 "estimated_time": 1205}
System Endpoints
Health Check
Check system health.
Endpoint: GET /api/v1/health
Response:
json1{2 "status": "ok",3 "version": "1.0.0",4 "providers": {5 "openai": "connected",6 "ollama": "connected"7 },8 "uptime": 36009}
System Configuration
Get current system configuration.
Endpoint: GET /api/v1/config
Response:
json1{2 "routing": {3 "complexity_threshold": 0.65,4 "privacy_sensitive_patterns": ["password", "secret", "key"],5 "default_provider": "auto"6 },7 "caching": {8 "enabled": true,9 "ttl": 360010 },11 "optimization": {12 "token_optimization": true,13 "parallel_processing": true14 },15 "monitoring": {16 "metrics_collection": true,17 "log_level": "info"18 }19}
Update Configuration
Update system configuration.
Endpoint: POST /api/v1/config
Request Body:
json1{2 "routing": {3 "complexity_threshold": 0.74 },5 "caching": {6 "ttl": 72007 }8}
Response:
json1{2 "status": "updated",3 "updated_fields": ["routing.complexity_threshold", "caching.ttl"]4}
System Metrics
Get system performance metrics.
Endpoint: GET /api/v1/metrics
Response:
json1{2 "requests": {3 "total": 15420,4 "last_minute": 42,5 "last_hour": 12546 },7 "routing": {8 "openai_requests": 6210,9 "ollama_requests": 9210,10 "auto_routing_accuracy": 0.9411 },12 "performance": {13 "average_response_time": 2.3,14 "p95_response_time": 6.1,15 "cache_hit_rate": 0.3716 },17 "cost": {18 "total_openai_cost": 135.42,19 "estimated_savings": 98.67,20 "cost_per_request": 0.008821 }22}
Configuration
Environment Variables
The MCP system can be configured using the following environment variables:
Core Configuration
| Variable | Description | Default Value |
|----------|-------------|---------------|
| OPENAI_API_KEY | OpenAI API Key | (Required) |
| OPENAI_ORG_ID | OpenAI Organization ID | (Optional) |
| OPENAI_MODEL | Default OpenAI model | gpt-4o |
| OLLAMA_HOST | Ollama host URL | http://localhost:11434 |
| OLLAMA_MODEL | Default Ollama model | llama2 |
| APP_ENV | Environment (development, staging, production) | development |
| LOG_LEVEL | Logging level | INFO |
| PORT | API server port | 8000 |
Redis Configuration
| Variable | Description | Default Value |
|----------|-------------|---------------|
| REDIS_URL | Redis connection URL | redis://localhost:6379/0 |
| REDIS_PASSWORD | Redis password | (Optional) |
| ENABLE_CACHING | Enable response caching | true |
| CACHE_TTL | Cache TTL in seconds | 3600 |
Routing Configuration
| Variable | Description | Default Value |
|----------|-------------|---------------|
| COMPLEXITY_THRESHOLD | Threshold for routing to OpenAI | 0.65 |
| PRIVACY_SENSITIVE_TOKENS | Comma-separated list of privacy-sensitive tokens | password,secret,key |
| DEFAULT_PROVIDER | Default provider if not specified | auto |
| FORCE_OLLAMA | Force using Ollama for all requests | false |
| FORCE_OPENAI | Force using OpenAI for all requests | false |
Performance Configuration
| Variable | Description | Default Value |
|----------|-------------|---------------|
| ENABLE_PARALLEL_PROCESSING | Enable parallel processing for complex queries | true |
| MAX_PARALLEL_REQUESTS | Maximum number of parallel requests | 4 |
| ENABLE_BATCHING | Enable request batching | true |
| MAX_BATCH_SIZE | Maximum batch size | 5 |
| REQUEST_TIMEOUT | Request timeout in seconds | 120 |
Cost Optimization
| Variable | Description | Default Value |
|----------|-------------|---------------|
| MONTHLY_BUDGET | Monthly budget cap for OpenAI usage (USD) | 0 (no limit) |
| ENABLE_TOKEN_OPTIMIZATION | Enable token usage optimization | true |
| TOKEN_BUDGET | Token budget per request | 0 (no limit) |
| DEV_MODE_TOKEN_LIMIT | Token limit in development mode | 1000 |
Monitoring
| Variable | Description | Default Value |
|----------|-------------|---------------|
| ENABLE_METRICS | Enable metrics collection | true |
| METRICS_PORT | Prometheus metrics port | 9090 |
| ENABLE_TRACING | Enable distributed tracing | false |
| SENTRY_DSN | Sentry DSN for error tracking | (Optional) |
Advanced Configuration
Configuration File
For more advanced configuration, create a YAML configuration file at config/config.yaml:
yaml1routing:2 # Complexity assessment weights3 complexity_weights:4 length: 0.35 specialized_terms: 0.46 sentence_structure: 0.378 # Ollama model routing9 ollama_routing:10 code_generation: "codellama"11 mathematical: "wizard-math"12 creative: "dolphin-mistral"13 general: "mistral"1415 # OpenAI model routing16 openai_routing:17 complex_reasoning: "gpt-4o"18 general: "gpt-3.5-turbo"1920caching:21 # Semantic caching configuration22 semantic:23 enabled: true24 similarity_threshold: 0.9225 max_cached_items: 10002627 # Exact match caching28 exact:29 enabled: true30 max_cached_items: 5003132optimization:33 # Chain of thought settings34 chain_of_thought:35 enabled: true36 task_types: ["reasoning", "math", "decision"]3738 # Response verification39 verification:40 enabled: true41 high_risk_categories: ["financial", "legal", "medical"]4243monitoring:44 # Logging configuration45 logging:46 format: "json"47 include_request_body: false48 mask_sensitive_data: true4950 # Alert thresholds51 alerts:52 high_latency_threshold: 5.0 # seconds53 error_rate_threshold: 0.05 # 5%54 budget_warning_threshold: 0.8 # 80% of budget
Custom Provider Configuration
To configure additional inference providers, add a providers.yaml file:
yaml1providers:2 - name: azure-openai3 type: openai-compatible4 base_url: https://your-deployment.openai.azure.com5 api_key_env: AZURE_OPENAI_API_KEY6 models:7 - id: gpt-48 deployment_id: your-gpt4-deployment9 - id: gpt-35-turbo10 deployment_id: your-gpt35-deployment1112 - name: local-inference13 type: ollama-compatible14 base_url: http://localhost:808015 models:16 - id: local-model17 capabilities: ["general"]
Model Selection
Model Tiers
MCP uses a tiered approach to model selection:
| Tier | OpenAI Models | Ollama Models | Use Cases | |------|---------------|--------------|-----------| | High | gpt-4o, gpt-4 | llama2:70b, codellama:34b | Complex reasoning, creative tasks, code generation | | Medium | gpt-3.5-turbo | mistral, codellama | General purpose, standard code tasks | | Low | gpt-3.5-turbo | llama2, phi | Simple queries, development testing |
Task-Specific Model Mapping
MCP maps specific task types to appropriate models:
| Task Type | High Tier | Medium Tier | Low Tier | |-----------|-----------|-------------|----------| | Code Generation | gpt-4o | codellama | codellama | | Creative Writing | gpt-4o | mistral | mistral | | Mathematical | gpt-4o | gpt-3.5-turbo | wizard-math | | General Knowledge | gpt-3.5-turbo | mistral | llama2 | | Summarization | gpt-3.5-turbo | mistral | llama2 |
To override the automatic model selection, specify the model explicitly in your request:
json1{2 "model": "openai:gpt-4o" // Force OpenAI GPT-4o3}
Or:
json1{2 "model": "ollama:mistral" // Force Ollama Mistral3}
Usage Examples
Basic Chat Interaction
Python Example
python1import requests2import json34API_URL = "http://localhost:8000/api/v1"5API_KEY = "your_api_key_here"67headers = {8 "Content-Type": "application/json",9 "Authorization": f"Bearer {API_KEY}"10}1112# Basic chat completion13def chat(message, history=None):14 history = history or []15 history.append({"role": "user", "content": message})1617 response = requests.post(18 f"{API_URL}/chat/completions",19 headers=headers,20 json={21 "messages": history,22 "model": "auto", # Let the system decide23 "temperature": 0.724 }25 )2627 if response.status_code == 200:28 result = response.json()29 assistant_message = result["message"]["content"]30 history.append({"role": "assistant", "content": assistant_message})3132 print(f"Model used: {result['model']} via {result['provider']}")33 return assistant_message, history34 else:35 print(f"Error: {response.status_code}")36 print(response.text)37 return None, history3839# Example conversation40history = []41response, history = chat("Hello! What can you tell me about artificial intelligence?", history)42print(f"Assistant: {response}\n")4344response, history = chat("What are some practical applications?", history)45print(f"Assistant: {response}")
cURL Example
bash1# Simple completion2curl -X POST http://localhost:8000/api/v1/chat/completions \3 -H "Content-Type: application/json" \4 -H "Authorization: Bearer your_api_key_here" \5 -d '{6 "messages": [7 {"role": "user", "content": "Explain how photosynthesis works"}8 ],9 "model": "auto",10 "temperature": 0.711 }'1213# Streaming response14curl -X POST http://localhost:8000/api/v1/chat/streaming \15 -H "Content-Type: application/json" \16 -H "Authorization: Bearer your_api_key_here" \17 -d '{18 "messages": [19 {"role": "user", "content": "Write a short poem about robots"}20 ],21 "model": "auto",22 "stream": true23 }'
Working with Agents
Python Example
python1import requests2import json3import time45API_URL = "http://localhost:8000/api/v1"6API_KEY = "your_api_key_here"78headers = {9 "Content-Type": "application/json",10 "Authorization": f"Bearer {API_KEY}"11}1213# Run an agent with tools14def run_research_agent(query):15 # Define agent configuration with tools16 agent_config = {17 "instructions": "You are a research assistant specialized in finding information.",18 "model": "gpt-4o",19 "tools": [20 {21 "type": "function",22 "function": {23 "name": "search_web",24 "description": "Search the web for information",25 "parameters": {26 "type": "object",27 "properties": {28 "query": {29 "type": "string",30 "description": "Search query"31 },32 "num_results": {33 "type": "integer",34 "description": "Number of results to return"35 }36 },37 "required": ["query"]38 }39 }40 }41 ]42 }4344 # Run the agent45 response = requests.post(46 f"{API_URL}/agents/run",47 headers=headers,48 json={49 "agent_config": agent_config,50 "messages": [51 {"role": "user", "content": query}52 ]53 }54 )5556 if response.status_code != 200:57 print(f"Error: {response.status_code}")58 print(response.text)59 return None6061 result = response.json()62 run_id = result["run_id"]6364 # Poll for completion65 while True:66 status_response = requests.get(67 f"{API_URL}/agents/status/{run_id}",68 headers=headers69 )7071 if status_response.status_code != 200:72 print(f"Error checking status: {status_response.status_code}")73 return None7475 status_data = status_response.json()7677 if status_data["status"] == "completed":78 return status_data["result"]["output"]79 elif status_data["status"] == "failed":80 print(f"Agent run failed: {status_data.get('error')}")81 return None8283 time.sleep(1) # Poll every second8485# Example usage86result = run_research_agent("What are the latest advancements in fusion energy?")87print(result)
cURL Example
bash1# Run an agent2curl -X POST http://localhost:8000/api/v1/agents/run \3 -H "Content-Type: application/json" \4 -H "Authorization: Bearer your_api_key_here" \5 -d '{6 "agent_config": {7 "instructions": "You are a coding assistant.",8 "model": "gpt-4o",9 "tools": [10 {11 "type": "function",12 "function": {13 "name": "generate_code",14 "description": "Generate code in a specific language",15 "parameters": {16 "type": "object",17 "properties": {18 "language": {19 "type": "string",20 "description": "Programming language"21 },22 "task": {23 "type": "string",24 "description": "Task description"25 }26 },27 "required": ["language", "task"]28 }29 }30 }31 ]32 },33 "messages": [34 {"role": "user", "content": "Write a Python function to detect palindromes"}35 ]36 }'3738# Check status39curl -X GET http://localhost:8000/api/v1/agents/status/run_abc123 \40 -H "Authorization: Bearer your_api_key_here"
Customizing Model Selection
Python Example
python1import requests23API_URL = "http://localhost:8000/api/v1"4API_KEY = "your_api_key_here"56headers = {7 "Content-Type": "application/json",8 "Authorization": f"Bearer {API_KEY}"9}1011# Custom routing preferences12def custom_routing_chat(message, routing_preferences):13 response = requests.post(14 f"{API_URL}/chat/completions",15 headers=headers,16 json={17 "messages": [18 {"role": "user", "content": message}19 ],20 "routing_preferences": routing_preferences21 }22 )2324 if response.status_code == 200:25 result = response.json()26 print(f"Provider: {result['provider']}, Model: {result['model']}")27 return result["message"]["content"]28 else:29 print(f"Error: {response.status_code}")30 print(response.text)31 return None3233# Examples with different routing preferences34response = custom_routing_chat(35 "What is the capital of France?",36 {37 "force_provider": "ollama", # Force Ollama38 "privacy_level": "standard",39 "latency_preference": "balanced"40 }41)42print(f"Response: {response}\n")4344response = custom_routing_chat(45 "Analyze the philosophical implications of artificial general intelligence.",46 {47 "force_provider": "openai", # Force OpenAI48 "privacy_level": "standard",49 "latency_preference": "quality" # Prefer quality over speed50 }51)52print(f"Response: {response}\n")5354response = custom_routing_chat(55 "What is my personal password?",56 {57 "force_provider": None, # Auto-select58 "privacy_level": "high", # Privacy-sensitive query59 "latency_preference": "balanced"60 }61)62print(f"Response: {response}")
cURL Example
bash1# Force Ollama for this request2curl -X POST http://localhost:8000/api/v1/chat/completions \3 -H "Content-Type: application/json" \4 -H "Authorization: Bearer your_api_key_here" \5 -d '{6 "messages": [7 {"role": "user", "content": "What is the capital of Sweden?"}8 ],9 "routing_preferences": {10 "force_provider": "ollama",11 "privacy_level": "standard",12 "latency_preference": "speed"13 }14 }'1516# Force specific model17curl -X POST http://localhost:8000/api/v1/chat/completions \18 -H "Content-Type: application/json" \19 -H "Authorization: Bearer your_api_key_here" \20 -d '{21 "messages": [22 {"role": "user", "content": "Write Python code to implement merge sort"}23 ],24 "model": "ollama:codellama"25 }'
Tool Integration
Python Example
python1import requests23API_URL = "http://localhost:8000/api/v1"4API_KEY = "your_api_key_here"56headers = {7 "Content-Type": "application/json",8 "Authorization": f"Bearer {API_KEY}"9}1011# Chat with tool integration12def chat_with_tools(message, tools):13 response = requests.post(14 f"{API_URL}/chat/completions",15 headers=headers,16 json={17 "messages": [18 {"role": "user", "content": message}19 ],20 "tools": tools21 }22 )2324 if response.status_code != 200:25 print(f"Error: {response.status_code}")26 print(response.text)27 return None2829 result = response.json()3031 # Check if the model wants to call a tool32 if "tool_calls" in result["message"] and result["message"]["tool_calls"]:33 tool_calls = result["message"]["tool_calls"]34 print(f"Tool calls requested: {len(tool_calls)}")3536 # Process each tool call37 for tool_call in tool_calls:38 # In a real implementation, you would execute the actual tool here39 # For this example, we'll just simulate it40 function_name = tool_call["function"]["name"]41 arguments = json.loads(tool_call["function"]["arguments"])4243 print(f"Executing tool: {function_name}")44 print(f"Arguments: {arguments}")4546 # Simulate tool execution47 if function_name == "get_weather":48 tool_result = f"Weather in {arguments['location']}: Sunny, 22°C"49 elif function_name == "search_database":50 tool_result = f"Database results for {arguments['query']}: 3 records found"51 else:52 tool_result = "Unknown tool"5354 # Send the tool result back55 response = requests.post(56 f"{API_URL}/chat/completions",57 headers=headers,58 json={59 "messages": [60 {"role": "user", "content": message},61 {62 "role": "assistant",63 "content": result["message"]["content"],64 "tool_calls": result["message"]["tool_calls"]65 },66 {67 "role": "tool",68 "tool_call_id": tool_call["id"],69 "content": tool_result70 }71 ]72 }73 )7475 if response.status_code == 200:76 final_result = response.json()77 return final_result["message"]["content"]78 else:79 print(f"Error in tool response: {response.status_code}")80 return None8182 # If no tool calls, return the direct response83 return result["message"]["content"]8485# Define available tools86tools = [87 {88 "type": "function",89 "function": {90 "name": "get_weather",91 "description": "Get current weather in a location",92 "parameters": {93 "type": "object",94 "properties": {95 "location": {96 "type": "string",97 "description": "City name"98 },99 "unit": {100 "type": "string",101 "enum": ["celsius", "fahrenheit"],102 "description": "Temperature unit"103 }104 },105 "required": ["location"]106 }107 }108 },109 {110 "type": "function",111 "function": {112 "name": "search_database",113 "description": "Search a database for information",114 "parameters": {115 "type": "object",116 "properties": {117 "query": {118 "type": "string",119 "description": "Search query"120 },121 "limit": {122 "type": "integer",123 "description": "Maximum number of results"124 }125 },126 "required": ["query"]127 }128 }129 }130]131132# Example usage133response = chat_with_tools("What's the weather like in Paris?", tools)134print(f"Final response: {response}")
Troubleshooting
Common Issues
Installation Issues
Ollama Installation Fails
Symptoms:
- Error messages during Ollama installation
ollama servecommand not found
Possible Solutions:
- Check system requirements (minimum 8GB RAM recommended)
- For Linux, ensure you have the required dependencies:
bash1sudo apt-get update2sudo apt-get install -y ca-certificates curl
- Try the manual installation from ollama.ai
- Check if Ollama is running:
bash1ps aux | grep ollama
Python Dependency Errors
Symptoms:
pip installfails with compatibility errors- Import errors when starting the application
Possible Solutions:
- Ensure you're using Python 3.11 or higher:
bash1python --version
- Try creating a fresh virtual environment:
bash1rm -rf venv2python -m venv venv3source venv/bin/activate4pip install --upgrade pip
- Install dependencies one by one to identify problematic packages:
bash1pip install -r requirements.txt --no-deps
- Check for conflicts with pip:
bash1pip check
API Connection Issues
OpenAI API Key Invalid
Symptoms:
- Error messages about authentication
- "Invalid API key" errors
Possible Solutions:
- Verify your API key is correct and active in the OpenAI dashboard
- Check if the key is properly set in your
.envfile or environment variables - Ensure there are no spaces or unexpected characters in the key
- Test the key with a simple OpenAI API request:
bash1curl https://api.openai.com/v1/models \2 -H "Authorization: Bearer YOUR_API_KEY"
Ollama Connection Failed
Symptoms:
- "Connection refused" errors when connecting to Ollama
- API requests to Ollama timeout
Possible Solutions:
- Verify Ollama is running:
bash1ollama list # Should show available models
- If not running, start the Ollama service:
bash1ollama serve
- Check if the Ollama port is accessible:
bash1curl http://localhost:11434/api/tags
- Verify your
OLLAMA_HOSTsetting in the configuration - If using Docker, ensure proper network configuration between containers
Performance Issues
High Latency with Ollama
Symptoms:
- Very slow responses from Ollama models
- Timeouts during inference
Possible Solutions:
- Check if you have GPU support enabled:
bash1nvidia-smi # Should show GPU usage
- Try a smaller model:
bash1ollama pull tinyllama
- Adjust model parameters in your request:
json1{2 "model": "ollama:llama2",3 "max_tokens": 512,4 "temperature": 0.75}
- Check system resource usage:
bash1htop
- Increase the timeout in your configuration
Memory Usage Too High
Symptoms:
- Out of memory errors
- System becomes unresponsive
Possible Solutions:
- Use smaller models (e.g.,
mistral:7binstead of larger variants) - Reduce batch sizes in configuration
- Implement memory limits:
bash1# In docker-compose.yml2services:3 ollama:4 deploy:5 resources:6 limits:7 memory: 12G
- Enable context window optimization:
ENABLE_TOKEN_OPTIMIZATION=true
Routing and Model Issues
All Requests Going to One Provider
Symptoms:
- All requests route to OpenAI despite configuration
- All requests route to Ollama regardless of complexity
Possible Solutions:
- Check for environment variables forcing a provider:
text1FORCE_OLLAMA=false2FORCE_OPENAI=false
- Verify complexity threshold setting:
COMPLEXITY_THRESHOLD=0.65 - Review routing preferences in requests:
json1{2 "routing_preferences": {3 "force_provider": null4 }5}
- Check logs for routing decisions
Model Not Found
Symptoms:
- "Model not found" errors
- Models available but not being used
Possible Solutions:
- List available models:
bash1ollama list
- Pull the missing model:
bash1ollama pull mistral
- Verify model names match exactly what you're requesting
- Check model mapping in configuration
Diagnostics
Log Analysis
MCP logs contain valuable diagnostic information. Use the following commands to analyze logs:
bash1# View API logs2docker-compose logs -f app34# View Ollama logs5docker-compose logs -f ollama67# Search for errors8docker-compose logs | grep -i error910# Check routing decisions11docker-compose logs app | grep "Routing decision"
Health Check
Use the health check endpoint to verify system status:
bash1curl http://localhost:8000/api/v1/health23# For more detailed health information4curl http://localhost:8000/api/v1/health/details
Debug Mode
Enable debug logging for more detailed information:
bash1# Set environment variable2export LOG_LEVEL=DEBUG34# Or modify in .env file5LOG_LEVEL=DEBUG
Performance Testing
Use the built-in benchmark tool to test system performance:
bash1python scripts/benchmark.py --provider both --queries 10 --complexity mixed
Log Management
Log Levels
MCP uses the following log levels:
ERROR: Critical errors that require immediate attentionWARNING: Non-critical issues that might indicate problemsINFO: General operational informationDEBUG: Detailed information for debugging purposes
Log Formats
Logs can be formatted as text or JSON:
bash1# Set JSON logging2export LOG_FORMAT=json34# Set text logging (default)5export LOG_FORMAT=text
External Log Management
For production environments, consider forwarding logs to an external system:
bash1# Using Fluentd2docker-compose -f docker-compose.yml -f docker-compose.logging.yml up -d
Or configure log drivers in Docker:
yaml1# In docker-compose.yml2services:3 app:4 logging:5 driver: "json-file"6 options:7 max-size: "10m"8 max-file: "3"
Contributing
Contributions to the MCP system are welcome! Please follow these guidelines:
Getting Started
-
Fork the Repository
Fork the repository on GitHub and clone your fork locally:
bash1git clone https://github.com/YOUR-USERNAME/mcp.git2cd mcp -
Set Up Development Environment
Follow the installation instructions in the Installation Guide section.
-
Create a Branch
Create a branch for your feature or bugfix:
bash1git checkout -b feature/your-feature-name2# or3git checkout -b fix/your-bugfix-name
Development Guidelines
Code Style
- Follow PEP 8 style guidelines for Python code
- Use type hints for all function definitions
- Format code with Black
- Verify style with flake8
bash1# Install development tools2pip install black flake8 mypy34# Format code5black app tests67# Check style8flake8 app tests910# Run type checking11mypy app
Testing
- Write unit tests for all new functionality
- Ensure existing tests pass before submitting a PR
- Maintain or improve code coverage
bash1# Run tests2pytest34# Run tests with coverage5pytest --cov=app tests/67# Run only unit tests8pytest tests/unit/910# Run integration tests11pytest tests/integration/
Documentation
- Update documentation for any new features or changes
- Document all public APIs with docstrings
- Keep the README and guides up to date
Submitting Changes
-
Commit Your Changes
Make focused commits with meaningful commit messages:
bash1git add .2git commit -m "Add feature: detailed description of changes" -
Pull Latest Changes
Rebase your branch on the latest main:
bash1git checkout main2git pull upstream main3git checkout your-branch4git rebase main -
Push to Your Fork
bash1git push origin your-branch -
Create a Pull Request
Open a pull request from your fork to the main repository:
- Provide a clear title and description
- Reference any related issues
- Describe testing performed
- Include screenshots for UI changes
Code of Conduct
- Be respectful and inclusive in all interactions
- Provide constructive feedback
- Focus on the issues, not the people
- Welcome contributors of all backgrounds and experience levels
License
By contributing to this project, you agree that your contributions will be licensed under the project's MIT License.
License
MIT License
text1Copyright (c) 2023 MCP Contributors23Permission is hereby granted, free of charge, to any person obtaining a copy4of this software and associated documentation files (the "Software"), to deal5in the Software without restriction, including without limitation the rights6to use, copy, modify, merge, publish, distribute, sublicense, and/or sell7copies of the Software, and to permit persons to whom the Software is8furnished to do so, subject to the following conditions:910The above copyright notice and this permission notice shall be included in all11copies or substantial portions of the Software.1213THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR14IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,15FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE16AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER17LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,18OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE19SOFTWARE.
Third-Party Licenses
This project incorporates several third-party open-source libraries, each with its own license:
- FastAPI: MIT License
- Pydantic: MIT License
- Uvicorn: BSD 3-Clause License
- OpenAI Python: MIT License
- Redis-py: MIT License
- Prometheus Client: Apache License 2.0
- Ollama: MIT License
Full license texts are included in the LICENSE-3RD-PARTY file in the repository.
Usage Restrictions
While the MCP system itself is open source, usage of the OpenAI API is subject to OpenAI's terms of service and usage policies. Please ensure your use of the API complies with these terms.

Sovereign AI: Building Local-First Intelligent Systems
by Daniel Kliewer · Paperback · 72 pages
The hands-on guide to building AI that runs on your hardware, keeps your data private, and eliminates cloud dependence. Working code included.