November 15, 2025·39 min

Building and Evaluating a Local-First Research Assistant with GraphRAG and vero-eval

Complete technical guide to building a production-ready research assistant using GraphRAG, Neo4j knowledge graphs, Ollama local LLMs, and vero-eval evaluation framework for rigorous AI system testing.

Daniel Kliewer

Author, Sovereign AI

AIGraphRAGLocal LLMNeo4jOllamavero-evalResearch AssistantKnowledge GraphRAGAI Evaluation

From the Book

This is from Sovereign AI: Building Local-First Intelligent Systems.

Get the Book — $88

Building and Evaluating a Local-First Research Assistant with GraphRAG and vero-eval

A comprehensive guide to creating a persona-driven AI assistant with rigorous evaluation using Neo4j, Ollama, and the vero-eval framework

Introduction: Why Local GraphRAG Matters for Research Workflows

If you're building AI-powered applications in 2025, you've likely hit two major pain points: context limitations and lack of systematic evaluation. Large Language Models are powerful, but they struggle with long-term memory and consistent performance across edge cases. Enter GraphRAG—a methodology that combines knowledge graphs with retrieval-augmented generation to give your AI genuine memory and contextual awareness.

In this guide, we'll build a Local Research Assistant that:

Stores and retrieves research papers, notes, and conversations in a Neo4j knowledge graph
Uses Ollama for completely local inference (no API costs, full privacy)
Implements persona-driven responses that adapt based on RLHF feedback
Most importantly: Measures performance rigorously using the vero-eval framework

This isn't another "hello world" tutorial. We're building production-ready infrastructure that you can deploy for real research workflows, with proper testing and evaluation baked in from day one.

Prerequisites and Starting Point

Before we dive in, you'll need:

System Requirements:

Python 3.9+
Node.js 18+
Docker (for Neo4j)
16GB+ RAM recommended

Core Technologies:

Ollama for local LLM inference
Neo4j for graph database
vero-eval for evaluation
Next.js + FastAPI (from the starter template)

Clone the Starter Repository:

bash
1git clone https://github.com/kliewerdaniel/chrisbot.git research-assistant
2cd research-assistant

This gives us a solid foundation with the frontend, basic chat interface, and project structure already in place. We'll extend it to build our research-focused GraphRAG system.

Part 1: Understanding the Architecture

Our Research Assistant follows the PersonaGen architecture pattern outlined by Daniel Kliewer, but applied to academic research workflows:

text
1┌─────────────────────────────────────────────────────────┐
2│                    User Interface                        │
3│              (Next.js Chat Interface)                    │
4└────────────────────┬────────────────────────────────────┘
5                     │
6                     ▼
7┌─────────────────────────────────────────────────────────┐
8│                 Reasoning Agent                          │
9│      (Tool Calling + RLHF Threshold Logic)              │
10└────────────────────┬────────────────────────────────────┘
11                     │
12          ┌──────────┴──────────┐
13          ▼                     ▼
14┌──────────────────┐   ┌──────────────────┐
15│   Neo4j Graph    │   │  Ollama LLM      │
16│   RAG System     │   │  (Mistral/Llama) │
17│                  │   │                  │
18│ • Papers         │   │ • Generation     │
19│ • Authors        │   │ • Embeddings     │
20│ • Concepts       │   │ • Extraction     │
21│ • Citations      │   │                  │
22└──────────────────┘   └──────────────────┘
23          │
24          ▼
25┌─────────────────────────────────────────────────────────┐
26│              vero-eval Framework                         │
27│  • Test Dataset Generation                              │
28│  • Retrieval Metrics (Precision, Recall, MRR)          │
29│  • Generation Metrics (Faithfulness, BERTScore)        │
30│  • Persona Stress Testing                               │
31└─────────────────────────────────────────────────────────┘

Key Insight: The persona system adapts its behavior based on evaluation feedback. If vero-eval shows poor retrieval for technical queries, the RLHF thresholds adjust to require more context before responding.

Part 2: Setting Up Neo4j GraphRAG

Neo4j is our memory layer. Following the official Neo4j GenAI integration patterns, we'll create a graph schema optimized for research.

Installing Neo4j GraphRAG for Python

bash
1# Install the official Neo4j GraphRAG package
2pip install neo4j-graphrag
3
4# Install Ollama integration
5pip install "neo4j-graphrag[ollama]"
6
7# Start Neo4j (using Docker)
8docker run \
9    --name research-neo4j \
10    -p 7474:7474 -p 7687:7687 \
11    -e NEO4J_AUTH=neo4j/research2025 \
12    -v $PWD/neo4j-data:/data \
13    neo4j:latest

Neo4j Knowledge Graph Setup for Research Assistant

Defining the Research Knowledge Schema

Create scripts/graph_schema.py:

python
1from neo4j_graphrag import GraphSchema
2from dataclasses import dataclass
3
4@dataclass
5class ResearchSchema(GraphSchema):
6    """
7    Knowledge graph schema for research assistant.
8    
9    Nodes:
10    - Paper: Research papers with metadata
11    - Author: Paper authors with affiliation
12    - Concept: Extracted key concepts/topics
13    - Note: User's research notes
14    - Question: User queries with context
15    
16    Relationships:
17    - AUTHORED: Author -> Paper
18    - CITES: Paper -> Paper
19    - DISCUSSES: Paper -> Concept
20    - RELATES_TO: Concept -> Concept
21    - ANSWERS: Paper -> Question
22    """
23    
24    node_types = {
25        'Paper': {
26            'properties': ['title', 'abstract', 'year', 'doi', 'pdf_path'],
27            'embedding_property': 'abstract_embedding'
28        },
29        'Author': {
30            'properties': ['name', 'affiliation', 'h_index'],
31            'embedding_property': None
32        },
33        'Concept': {
34            'properties': ['name', 'definition', 'domain'],
35            'embedding_property': 'definition_embedding'
36        },
37        'Note': {
38            'properties': ['content', 'timestamp', 'tags'],
39            'embedding_property': 'content_embedding'
40        },
41        'Question': {
42            'properties': ['query', 'timestamp', 'answered'],
43            'embedding_property': 'query_embedding'
44        }
45    }
46    
47    relationship_types = {
48        'AUTHORED': ('Author', 'Paper'),
49        'CITES': ('Paper', 'Paper'),
50        'DISCUSSES': ('Paper', 'Concept'),
51        'RELATES_TO': ('Concept', 'Concept'),
52        'ANSWERS': ('Paper', 'Question'),
53        'ANNOTATES': ('Note', 'Paper')
54    }

Why this schema? Research workflows have natural graph structures:

Papers cite each other (transitive relationships)
Concepts relate to multiple papers
Authors collaborate across papers
User notes connect to specific papers

This lets us traverse the graph to find: "What papers discussing transformer architectures were cited by papers on RAG systems after 2023?"

Building the Graph Ingestion Pipeline

Create scripts/ingest_research_data.py:

python
1import ollama
2from neo4j import GraphDatabase
3from neo4j_graphrag import GraphRAG
4from pathlib import Path
5import PyPDF2
6
7class ResearchGraphBuilder:
8    def __init__(self, neo4j_uri="bolt://localhost:7687", 
9                 neo4j_user="neo4j", 
10                 neo4j_password="research2025",
11                 ollama_model="mistral"):
12        
13        self.driver = GraphDatabase.driver(neo4j_uri, 
14                                          auth=(neo4j_user, neo4j_password))
15        self.ollama_model = ollama_model
16        self.graph_rag = GraphRAG(self.driver)
17        
18    def extract_paper_metadata(self, pdf_path: Path) -> dict:
19        """Extract title, abstract, and key sections from PDF"""
20        with open(pdf_path, 'rb') as file:
21            reader = PyPDF2.PdfReader(file)
22            
23            # Extract first 3 pages (usually contains abstract)
24            text = ""
25            for page in reader.pages[:3]:
26                text += page.extract_text()
27        
28        # Use Ollama to extract structured metadata
29        prompt = f"""Extract from this research paper excerpt:
30        1. Title
31        2. Authors (list)
32        3. Abstract
33        4. Key concepts (5-7 main topics)
34        
35        Text: {text[:4000]}
36        
37        Return as JSON."""
38        
39        response = ollama.generate(
40            model=self.ollama_model,
41            prompt=prompt,
42            format='json'
43        )
44        
45        return json.loads(response['response'])
46    
47    def create_paper_node(self, metadata: dict, pdf_path: Path):
48        """Create Paper node with embeddings"""
49        
50        # Generate embedding for abstract
51        abstract_embedding = ollama.embeddings(
52            model='nomic-embed-text',
53            prompt=metadata['abstract']
54        )['embedding']
55        
56        with self.driver.session() as session:
57            session.run("""
58                CREATE (p:Paper {
59                    title: $title,
60                    abstract: $abstract,
61                    year: $year,
62                    pdf_path: $pdf_path,
63                    abstract_embedding: $embedding
64                })
65                WITH p
66                UNWIND $authors AS author_name
67                MERGE (a:Author {name: author_name})
68                CREATE (a)-[:AUTHORED]->(p)
69                
70                WITH p
71                UNWIND $concepts AS concept_name
72                MERGE (c:Concept {name: concept_name})
73                CREATE (p)-[:DISCUSSES]->(c)
74                """,
75                title=metadata['title'],
76                abstract=metadata['abstract'],
77                year=metadata.get('year', 2024),
78                pdf_path=str(pdf_path),
79                embedding=abstract_embedding,
80                authors=metadata['authors'],
81                concepts=metadata['concepts']
82            )
83    
84    def ingest_directory(self, papers_dir: Path):
85        """Ingest all PDFs in a directory"""
86        pdf_files = list(papers_dir.glob("*.pdf"))
87        
88        print(f"Found {len(pdf_files)} papers to ingest...")
89        
90        for pdf_path in pdf_files:
91            print(f"Processing: {pdf_path.name}")
92            try:
93                metadata = self.extract_paper_metadata(pdf_path)
94                self.create_paper_node(metadata, pdf_path)
95                print(f"✓ Ingested: {metadata['title']}")
96            except Exception as e:
97                print(f"✗ Failed {pdf_path.name}: {e}")

Key Pattern: We're using Ollama for both extraction (via generate) and embeddings (via embeddings). This keeps everything local. For production, you might cache embeddings in a vector index.

Creating Vector Indexes for Hybrid Search

Following Neo4j's GenAI integration guide, we create vector indexes:

python
1def create_vector_indexes(self):
2    """Create vector indexes for similarity search"""
3    with self.driver.session() as session:
4        # Abstract embeddings (4096 dimensions for nomic-embed-text)
5        session.run("""
6            CREATE VECTOR INDEX paper_abstracts IF NOT EXISTS
7            FOR (p:Paper)
8            ON p.abstract_embedding
9            OPTIONS {
10                indexConfig: {
11                    `vector.dimensions`: 4096,
12                    `vector.similarity_function`: 'cosine'
13                }
14            }
15        """)
16        
17        # Concept embeddings
18        session.run("""
19            CREATE VECTOR INDEX concept_definitions IF NOT EXISTS
20            FOR (c:Concept)
21            ON c.definition_embedding
22            OPTIONS {
23                indexConfig: {
24                    `vector.dimensions`: 4096,
25                    `vector.similarity_function`: 'cosine'
26                }
27            }
28        """)
29        
30        # Note embeddings
31        session.run("""
32            CREATE VECTOR INDEX note_contents IF NOT EXISTS
33            FOR (n:Note)
34            ON n.content_embedding
35            OPTIONS {
36                indexConfig: {
37                    `vector.dimensions`: 4096,
38                    `vector.similarity_function`: 'cosine'
39                }
40            }
41        """)

Critical: The dimension count (4096) must match your embedding model. nomic-embed-text uses 4096, but if you switch to all-MiniLM-L6-v2, you'd need 384.

Part 3: Implementing Hybrid Retrieval

Now we implement the retrieval layer that combines vector similarity with graph traversal:

Hybrid Retrieval System Architecture

python
1class HybridRetriever:
2    def __init__(self, driver, ollama_model="mistral"):
3        self.driver = driver
4        self.ollama_model = ollama_model
5    
6    def retrieve_context(self, query: str, limit: int = 5) -> list[dict]:
7        """
8        Hybrid retrieval combining:
9        1. Vector similarity search
10        2. Graph traversal for related concepts
11        3. Citation network expansion
12        """
13        
14        # Generate query embedding
15        query_embedding = ollama.embeddings(
16            model='nomic-embed-text',
17            prompt=query
18        )['embedding']
19        
20        with self.driver.session() as session:
21            # Vector similarity search
22            vector_results = session.run("""
23                CALL db.index.vector.queryNodes(
24                    'paper_abstracts', 
25                    $limit, 
26                    $query_embedding
27                )
28                YIELD node, score
29                MATCH (node)<-[:AUTHORED]-(author:Author)
30                MATCH (node)-[:DISCUSSES]->(concept:Concept)
31                
32                RETURN 
33                    node.title AS title,
34                    node.abstract AS abstract,
35                    node.year AS year,
36                    score AS relevance_score,
37                    collect(DISTINCT author.name) AS authors,
38                    collect(DISTINCT concept.name) AS concepts,
39                    'vector_search' AS retrieval_method
40                ORDER BY score DESC
41                """,
42                query_embedding=query_embedding,
43                limit=limit
44            ).data()
45            
46            # Graph traversal for cited papers
47            graph_results = []
48            if vector_results:
49                top_paper_title = vector_results[0]['title']
50                
51                graph_results = session.run("""
52                    MATCH (seed:Paper {title: $seed_title})
53                    MATCH (seed)-[:CITES]->(cited:Paper)
54                    MATCH (cited)<-[:AUTHORED]-(author:Author)
55                    MATCH (cited)-[:DISCUSSES]->(concept:Concept)
56                    WHERE any(c IN $query_concepts WHERE c IN collect(concept.name))
57                    
58                    RETURN 
59                        cited.title AS title,
60                        cited.abstract AS abstract,
61                        cited.year AS year,
62                        0.7 AS relevance_score,
63                        collect(DISTINCT author.name) AS authors,
64                        collect(DISTINCT concept.name) AS concepts,
65                        'citation_traversal' AS retrieval_method
66                    LIMIT $limit
67                    """,
68                    seed_title=top_paper_title,
69                    query_concepts=self._extract_query_concepts(query),
70                    limit=limit // 2
71                ).data()
72            
73            # Combine and deduplicate
74            all_results = vector_results + graph_results
75            seen_titles = set()
76            unique_results = []
77            
78            for result in all_results:
79                if result['title'] not in seen_titles:
80                    seen_titles.add(result['title'])
81                    unique_results.append(result)
82            
83            return sorted(unique_results, 
84                         key=lambda x: x['relevance_score'], 
85                         reverse=True)[:limit]
86    
87    def _extract_query_concepts(self, query: str) -> list[str]:
88        """Extract key concepts from query using LLM"""
89        response = ollama.generate(
90            model=self.ollama_model,
91            prompt=f"Extract 3-5 key technical concepts from this query: {query}. Return as comma-separated list.",
92            options={'temperature': 0.1}
93        )
94        return [c.strip() for c in response['response'].split(',')]

Why hybrid? Pure vector search might miss important papers that don't match semantically but are cited by relevant papers. Graph traversal captures these relationships.

Part 4: The Reasoning Agent and Persona Layer

The reasoning agent decides when to query the graph and how to format responses based on RLHF-adjusted thresholds:

Persona-Driven Response System Architecture

python
1# In scripts/reasoning_agent.py
2
3import json
4from pathlib import Path
5
6class PersonaReasoningAgent:
7    def __init__(self, persona_config_path: Path = Path("data/persona.json")):
8        self.persona_config = self._load_persona(persona_config_path)
9        self.retriever = HybridRetriever(driver, ollama_model)
10        
11    def _load_persona(self, config_path: Path) -> dict:
12        """Load persona configuration with RLHF thresholds"""
13        with open(config_path) as f:
14            return json.load(f)
15    
16    def should_retrieve_context(self, query: str) -> bool:
17        """
18        Decide if we need to retrieve context based on:
19        1. Query complexity
20        2. RLHF confidence threshold
21        3. Recent retrieval success rate
22        """
23        
24        # Simple heuristic: technical terms or specific paper requests
25        technical_indicators = [
26            'paper', 'research', 'study', 'findings',
27            'method', 'algorithm', 'experiment', 'results'
28        ]
29        
30        needs_retrieval = any(term in query.lower() 
31                             for term in technical_indicators)
32        
33        # Check RLHF threshold
34        confidence_threshold = self.persona_config['rlhf_thresholds']['retrieval_required']
35        
36        # If recent queries had low-quality responses, lower threshold
37        if self.persona_config['recent_success_rate'] < 0.7:
38            confidence_threshold *= 0.8
39        
40        return needs_retrieval or confidence_threshold > 0.5
41    
42    def generate_response(self, query: str, chat_history: list = None) -> dict:
43        """
44        Main orchestration logic:
45        1. Decide if retrieval needed
46        2. Retrieve context if necessary
47        3. Generate response with persona coloring
48        4. Grade output (RLHF scoring)
49        """
50        
51        # Step 1: Retrieval decision
52        needs_context = self.should_retrieve_context(query)
53        
54        context_docs = []
55        if needs_context:
56            context_docs = self.retriever.retrieve_context(query, limit=5)
57        
58        # Step 2: Format context for LLM
59        context_str = self._format_context(context_docs)
60        
61        # Step 3: Generate with persona
62        system_prompt = self._build_persona_prompt(context_str)
63        
64        response = ollama.generate(
65            model='mistral',
66            prompt=query,
67            system=system_prompt,
68            context=chat_history
69        )
70        
71        # Step 4: RLHF grading
72        quality_grade = self._grade_response(query, response['response'], context_docs)
73        
74        # Update RLHF thresholds based on grade
75        self._update_persona_thresholds(quality_grade)
76        
77        return {
78            'response': response['response'],
79            'context_used': context_docs,
80            'quality_grade': quality_grade,
81            'retrieval_method': context_docs[0]['retrieval_method'] if context_docs else None
82        }
83    
84    def _build_persona_prompt(self, context: str) -> str:
85        """
86        Build system prompt from persona configuration.
87        This is the 'coloring' step mentioned in the architecture.
88        """
89        base_template = self.persona_config['system_prompt_template']
90        
91        # Insert context if available
92        if context:
93            base_template += f"\n\nRelevant Research Context:\n{context}"
94        
95        # Add persona modifiers based on RLHF values
96        formality = self.persona_config['rlhf_thresholds']['formality_level']
97        if formality > 0.7:
98            base_template += "\n\nUse academic, formal language with proper citations."
99        else:
100            base_template += "\n\nExplain concepts clearly and conversationally."
101        
102        return base_template
103    
104    def _grade_response(self, query: str, response: str, context: list) -> float:
105        """
106        RLHF grading: 0 (needs improvement) to 1 (excellent).
107        In production, this would be human feedback, but we start with heuristics.
108        """
109        
110        # Heuristic checks:
111        # 1. Did we use retrieved context?
112        used_context = any(
113            doc['title'].lower() in response.lower() 
114            for doc in context
115        ) if context else True
116        
117        # 2. Is response substantive (not too short)?
118        is_substantive = len(response.split()) > 50
119        
120        # 3. Does response directly address query?
121        query_terms = set(query.lower().split())
122        response_terms = set(response.lower().split())
123        overlap = len(query_terms & response_terms) / len(query_terms)
124        
125        # Weighted score
126        score = (
127            0.4 * float(used_context) +
128            0.3 * float(is_substantive) +
129            0.3 * overlap
130        )
131        
132        return min(1.0, score)
133    
134    def _update_persona_thresholds(self, quality_grade: float):
135        """
136        Update RLHF thresholds based on response quality.
137        This is the adaptive learning mechanism.
138        """
139        
140        # If grade < 0.5, we need more context
141        if quality_grade < 0.5:
142            self.persona_config['rlhf_thresholds']['retrieval_required'] += 0.05
143        else:
144            # Successful response, can relax threshold slightly
145            self.persona_config['rlhf_thresholds']['retrieval_required'] -= 0.02
146        
147        # Clamp values
148        self.persona_config['rlhf_thresholds']['retrieval_required'] = max(
149            0.0, 
150            min(1.0, self.persona_config['rlhf_thresholds']['retrieval_required'])
151        )
152        
153        # Save updated config
154        with open("data/persona.json", 'w') as f:
155            json.dump(self.persona_config, f, indent=2)

Key Insight: The persona adapts over time. If vero-eval (which we'll integrate next) shows poor performance, these thresholds shift to require more evidence before responding.

Part 5: Integrating vero-eval for Rigorous Testing

This is where the magic happens. vero-eval provides production-grade evaluation that goes far beyond simple accuracy metrics. It tests edge cases, persona stress scenarios, and real-world failure modes.

vero-eval Testing Framework for AI Research Assistant

Installing and Configuring vero-eval

bash
1# Install vero-eval
2pip install vero-eval
3
4# Initialize evaluation directory
5mkdir -p evaluation/datasets evaluation/results

Generating a Research-Specific Test Dataset

vero-eval can generate test datasets tailored to your domain:

python
1# evaluation/generate_test_dataset.py
2
3from vero.test_dataset_generator import generate_and_save
4from pathlib import Path
5
6def generate_research_test_dataset():
7    """
8    Generate challenging test queries for research assistant.
9    vero-eval creates persona-based edge cases automatically.
10    """
11    
12    # Point to your research papers directory
13    data_path = Path('data/research_papers')
14    
15    # Define the use case
16    use_case = """
17    This is a research assistant that helps academics:
18    - Find relevant papers on specific topics
19    - Understand connections between research areas
20    - Get summaries of complex papers
21    - Discover citation networks
22    - Answer technical questions about methodologies
23    
24    Edge cases to test:
25    - Queries about very recent papers (after knowledge cutoff)
26    - Multi-hop reasoning (papers that cite papers that discuss X)
27    - Ambiguous author names
28    - Requests for specific experimental results
29    - Cross-domain queries (e.g., physics papers relevant to biology)
30    """
31    
32    # Generate dataset with persona variations
33    generate_and_save(
34        data_path=str(data_path),
35        usecase=use_case,
36        save_path_dir='evaluation/datasets/research_assistant_v1',
37        n_queries=150,  # Generate 150 test queries
38        
39        # Persona variations
40        personas=[
41            {
42                'name': 'PhD Student',
43                'characteristics': 'Detail-oriented, asks follow-up questions, wants methodology details'
44            },
45            {
46                'name': 'Senior Researcher',
47                'characteristics': 'Broad queries, interested in connections, asks about citations'
48            },
49            {
50                'name': 'Industry Practitioner',
51                'characteristics': 'Practical focus, wants applicable results, less theory'
52            }
53        ],
54        
55        # vero-eval will use Ollama for generation
56        llm_provider='ollama',
57        model_name='mistral'
58    )
59    
60    print("✓ Generated test dataset with persona variations")
61    print("  Check: evaluation/datasets/research_assistant_v1/")
62
63if __name__ == "__main__":
64    generate_research_test_dataset()

Run this:

bash
1python evaluation/generate_test_dataset.py

This creates a JSON file with queries like:

json
1{
2  "query": "What papers discuss attention mechanisms in the context of graph neural networks published after 2022?",
3  "persona": "Senior Researcher",
4  "expected_characteristics": ["multi-hop", "temporal_constraint", "domain_crossing"],
5  "ground_truth_chunk_ids": ["paper_47", "paper_89", "paper_102"],
6  "complexity_score": 0.85
7}

Running the Evaluation Suite

Now we test our system against this dataset:

python
1# evaluation/run_evaluation.py
2
3from vero.evaluator import Evaluator
4from vero.metrics import (
5    PrecisionMetric, RecallMetric, SufficiencyMetric,
6    FaithfulnessMetric, BERTScoreMetric, RougeMetric,
7    MRRMetric, MAPMetric, NDCGMetric
8)
9from reasoning_agent import PersonaReasoningAgent
10import json
11
12def run_full_evaluation():
13    """
14    Run comprehensive evaluation using vero-eval framework.
15    Tests both retrieval and generation quality.
16    """
17    
18    # Initialize our system
19    agent = PersonaReasoningAgent()
20    
21    # Load test dataset
22    with open('evaluation/datasets/research_assistant_v1/queries.json') as f:
23        test_queries = json.load(f)
24    
25    # Initialize vero-eval
26    evaluator = Evaluator(
27        test_dataset=test_queries,
28        trace_db_path='evaluation/trace.db'  # Logs all queries
29    )
30    
31    # Define evaluation metrics
32    retrieval_metrics = [
33        PrecisionMetric(k=5),
34        RecallMetric(k=5),
35        SufficiencyMetric(),  # Are retrieved docs sufficient to answer?
36    ]
37    
38    generation_metrics = [
39        FaithfulnessMetric(),  # Is response faithful to retrieved docs?
40        BERTScoreMetric(),     # Semantic similarity to reference answers
41        RougeMetric()          # Token overlap with references
42    ]
43    
44    ranking_metrics = [
45        MRRMetric(),  # Mean Reciprocal Rank
46        MAPMetric(),  # Mean Average Precision
47        NDCGMetric()  # Normalized Discounted Cumulative Gain
48    ]
49    
50    results = {
51        'retrieval': {},
52        'generation': {},
53        'ranking': {},
54        'per_persona': {}
55    }
56    
57    # Run evaluation for each query
58    for query_data in test_queries:
59        query = query_data['query']
60        persona = query_data['persona']
61        ground_truth = query_data['ground_truth_chunk_ids']
62        
63        # Generate response using our system
64        response_data = agent.generate_response(query)
65        
66        # Extract retrieved document IDs
67        retrieved_ids = [
68            doc.get('paper_id', doc['title']) 
69            for doc in response_data['context_used']
70        ]
71        
72        # Log to vero-eval's trace database
73        evaluator.log_query(
74            query=query,
75            retrieved_docs=retrieved_ids,
76            generated_response=response_data['response'],
77            metadata={'persona': persona}
78        )
79        
80        # Evaluate retrieval
81        for metric in retrieval_metrics:
82            score = metric.compute(
83                retrieved=retrieved_ids,
84                relevant=ground_truth
85            )
86            
87            metric_name = metric.__class__.__name__
88            if metric_name not in results['retrieval']:
89                results['retrieval'][metric_name] = []
90            results['retrieval'][metric_name].append(score)
91        
92        # Evaluate generation
93        for metric in generation_metrics:
94            score = metric.compute(
95                generated=response_data['response'],
96                reference=query_data.get('reference_answer', ''),
97                context=response_data['context_used']
98            )
99            
100            metric_name = metric.__class__.__name__
101            if metric_name not in results['generation']:
102                results['generation'][metric_name] = []
103            results['generation'][metric_name].append(score)
104        
105        # Track per-persona performance
106        if persona not in results['per_persona']:
107            results['per_persona'][persona] = {
108                'precision': [],
109                'faithfulness': []
110            }
111        
112        results['per_persona'][persona]['precision'].append(
113            results['retrieval']['PrecisionMetric'][-1]
114        )
115        results['per_persona'][persona]['faithfulness'].append(
116            results['generation']['FaithfulnessMetric'][-1]
117        )
118    
119    # Aggregate results
120    for category in ['retrieval', 'generation']:
121        for metric_name, scores in results[category].items():
122            results[category][metric_name] = {
123                'mean': sum(scores) / len(scores),
124                'min': min(scores),
125                'max': max(scores),
126                'std': np.std(scores)
127            }
128    
129    # Save results
130    with open('evaluation/results/full_evaluation.json', 'w') as f:
131        json.dump(results, f, indent=2)
132    
133    print("✓ Evaluation complete!")
134    print(f"  Retrieval Precision@5: {results['retrieval']['PrecisionMetric']['mean']:.3f}")
135    print(f"  Retrieval Recall@5: {results['retrieval']['RecallMetric']['mean']:.3f}")
136    print(f"  Generation Faithfulness: {results['generation']['FaithfulnessMetric']['mean']:.3f}")
137    
138    return results
139
140if __name__ == "__main__":
141    results = run_full_evaluation()

Run the evaluation:

bash
1python evaluation/run_evaluation.py

Generating Performance Reports

vero-eval includes a report generator:

python
1from vero.report import ReportGenerator
2
3# Generate comprehensive HTML report
4generator = ReportGenerator(
5    trace_db_path='evaluation/trace.db',
6    results_path='evaluation/results/full_evaluation.json'
7)
8
9generator.generate_report(
10    output_path='evaluation/results/performance_report.html',
11    include_sections=[
12        'executive_summary',
13        'retrieval_analysis',
14        'generation_analysis',
15        'persona_breakdown',
16        'failure_cases',
17        'recommendations'
18    ]
19)
20
21print("✓ Report generated: evaluation/results/performance_report.html")

This creates an interactive HTML report showing:

Overall metrics with confidence intervals
Per-persona performance breakdown
Failure case analysis (queries where system performed poorly)
Recommendations for improvement

Part 6: The RLHF Feedback Loop

Now we close the loop: use vero-eval results to update the persona's RLHF thresholds:

python
1# evaluation/update_persona_from_results.py
2
3import json
4
5def update_persona_thresholds(evaluation_results: dict):
6    """
7    Analyze vero-eval results and adjust persona thresholds.
8    This is the core RLHF mechanism.
9    """
10    
11    # Load current persona config
12    with open('data/persona.json') as f:
13        persona_config = json.load(f)
14    
15    # Analyze retrieval performance
16    retrieval_recall = evaluation_results['retrieval']['RecallMetric']['mean']
17    
18    if retrieval_recall < 0.6:
19        # Low recall → need to retrieve more documents
20        persona_config['rlhf_thresholds']['retrieval_limit'] += 2
21        persona_config['rlhf_thresholds']['retrieval_required'] += 0.1
22        
23        print("⚠️  Low recall detected. Increasing retrieval aggressiveness.")
24    
25    # Analyze generation faithfulness
26    faithfulness = evaluation_results['generation']['FaithfulnessMetric']['mean']
27    
28    if faithfulness < 0.7:
29        # Responses not faithful to sources → need stronger grounding
30        persona_config['rlhf_thresholds']['minimum_context_overlap'] = 0.4
31        persona_config['system_prompt_template'] += (
32            "\n\nIMPORTANT: Always cite specific papers when making claims. "
33            "Do not speculate beyond what the retrieved papers state."
34        )
35        
36        print("⚠️  Low faithfulness detected. Strengthening citation requirements.")
37    
38    # Per-persona adjustments
39    for persona_name, metrics in evaluation_results['per_persona'].items():
40        avg_precision = sum(metrics['precision']) / len(metrics['precision'])
41        
42        if avg_precision < 0.5:
43            print(f"⚠️  {persona_name} persona underperforming (Precision: {avg_precision:.2f})")
44            
45            # Could adjust persona-specific prompts here
46            # For now, log for manual review
47    
48    # Save updated config
49    with open('data/persona.json', 'w') as f:
50        json.dump(persona_config, f, indent=2)
51    
52    print("✓ Persona thresholds updated based on evaluation results")
53
54# Usage after evaluation
55with open('evaluation/results/full_evaluation.json') as f:
56    results = json.load(f)
57
58update_persona_thresholds(results)

The workflow becomes:

Run system on test queries
vero-eval measures performance
Script analyzes metrics
Persona thresholds adjust automatically
Re-evaluate to confirm improvement

This is reinforcement learning through human feedback (RLHF) in action, but guided by rigorous automated evaluation rather than ad-hoc human ratings.

Part 7: Integrating with the Frontend

Now we wire this into the Next.js chat interface. Update src/app/api/chat/route.ts:

typescript
1import { NextRequest } from 'next/server'
2import { spawn } from 'child_process'
3import path from 'path'
4
5export async function POST(request: NextRequest) {
6  const { message, messages, graphRAG = true } = await request.json()
7  
8  if (!graphRAG) {
9    // Regular chat without RAG
10    return handleRegularChat(message, messages)
11  }
12  
13  // Call our Python reasoning agent
14  const agentPath = path.join(process.cwd(), 'scripts', 'reasoning_agent.py')
15  
16  const result = await new Promise<{response: string, context: any[]}>((resolve, reject) => {
17    const pythonProcess = spawn('python3', [
18      agentPath,
19      'generate',
20      JSON.stringify({ query: message, chat_history: messages })
21    ])
22    
23    let stdout = ''
24    let stderr = ''
25    
26    pythonProcess.stdout.on('data', (data) => {
27      stdout += data.toString()
28    })
29    
30    pythonProcess.stderr.on('data', (data) => {
31      stderr += data.toString()
32    })
33    
34    pythonProcess.on('close', (code) => {
35      if (code === 0) {
36        try {
37          const result = JSON.parse(stdout)
38          resolve(result)
39        } catch (e) {
40          reject(new Error(`Failed to parse response: ${e}`))
41        }
42      } else {
43        reject(new Error(`Agent failed: ${stderr}`))
44      }
45    })
46  })
47  
48  // Stream response back to client
49  const stream = new ReadableStream({
50    start(controller) {
51      // Send response with context metadata
52      const formatted = `${result.response}\n\n---\n**Sources:**\n${
53        result.context.map((doc, i) => 
54          `[${i+1}] ${doc.title} (${doc.year})`
55        ).join('\n')
56      }`
57      
58      controller.enqueue(new TextEncoder().encode(formatted))
59      controller.close()
60    }
61  })
62  
63  return new Response(stream, {
64    headers: {
65      'Content-Type': 'text/plain; charset=utf-8',
66    },
67  })
68}

Update the chat UI to show retrieval metadata:

typescript
1// In src/components/Chat.tsx
2
3{message.role === 'assistant' && message.context && (
4  <div className="mt-2 text-xs text-muted-foreground">
5    <details>
6      <summary className="cursor-pointer hover:text-foreground">
7        📚 {message.context.length} sources retrieved
8      </summary>
9      <ul className="mt-2 space-y-1">
10        {message.context.map((doc, i) => (
11          <li key={i} className="flex items-center gap-2">
12            <span className="font-mono">
13              {doc.retrieval_method === 'vector_search' ? '🔍' : '🔗'}
14            </span>
15            <span>{doc.title}</span>
16            <span className="text-muted-foreground">
17              (relevance: {(doc.relevance_score * 100).toFixed(0)}%)
18            </span>
19          </li>
20        ))}
21      </ul>
22    </details>
23  </div>
24)}

Now users can see which papers were retrieved and how (vector search vs. citation traversal).

Part 8: Running the Complete System

Setup Script

Create setup.sh:

bash
1#!/bin/bash
2
3echo "🔬 Setting up Research Assistant GraphRAG System"
4
5# 1. Install Python dependencies
6echo "📦 Installing Python dependencies..."
7pip install -r requirements.txt
8
9# 2. Start Neo4j
10echo "🗄️  Starting Neo4j..."
11docker-compose up -d neo4j
12
13# Wait for Neo4j to be ready
14echo "⏳ Waiting for Neo4j..."
15until curl -s http://localhost:7474 > /dev/null; do
16  sleep 2
17done
18echo "✓ Neo4j ready"
19
20# 3. Start Ollama
21echo "🤖 Checking Ollama..."
22if ! command -v ollama &> /dev/null; then
23    echo "Please install Ollama from https://ollama.ai"
24    exit 1
25fi
26
27ollama serve &
28sleep 5
29
30# Pull required models
31ollama pull mistral
32ollama pull nomic-embed-text
33
34# 4. Initialize Neo4j graph schema
35echo "📊 Initializing graph schema..."
36python scripts/init_graph_schema.py
37
38# 5. Ingest sample research papers
39echo "📚 Ingesting sample papers..."
40python scripts/ingest_research_data.py --directory data/sample_papers
41
42# 6. Generate test dataset
43echo "🧪 Generating evaluation dataset..."
44python evaluation/generate_test_dataset.py
45
46# 7. Run initial evaluation
47echo "📈 Running initial evaluation..."
48python evaluation/run_evaluation.py
49
50# 8. Start Next.js frontend
51echo "🌐 Starting frontend..."
52npm install
53npm run dev &
54
55echo ""
56echo "✅ Setup complete!"
57echo ""
58echo "🔗 Access points:"
59echo "   Frontend: http://localhost:3000"
60echo "   Neo4j Browser: http://localhost:7474"
61echo "   Evaluation Reports: evaluation/results/"
62echo ""
63echo "📖 Next steps:"
64echo "   1. Add your research papers to data/research_papers/"
65echo "   2. Run: python scripts/ingest_research_data.py"
66echo "   3. Chat with your research assistant at localhost:3000"
67echo "   4. Check evaluation results in evaluation/results/"

Run it:

bash
1chmod +x setup.sh
2./setup.sh

Part 9: Practical Use Cases and Patterns

Use Case 1: Literature Review Assistant

python
1# Example query patterns for literature reviews
2
3queries = [
4    "What are the main approaches to attention mechanisms in transformers since 2020?",
5    "Find papers that cite Vaswani et al. 2017 and discuss efficiency improvements",
6    "What experimental setups are common in graph neural network papers?",
7    "Compare the methodologies used in top-cited RAG papers"
8]
9
10for query in queries:
11    response = agent.generate_response(query)
12    
13    # System automatically:
14    # 1. Retrieves relevant papers using hybrid search
15    # 2. Traverses citation network
16    # 3. Formats response with proper attributions
17    # 4. Logs everything to vero-eval trace DB

Use Case 2: Cross-Domain Research Discovery

python
1# Finding connections between domains
2
3query = """
4Are there any techniques from computer vision that have been 
5successfully applied to natural language processing in the last 3 years?
6"""
7
8# The graph traversal will:
9# 1. Find CV papers discussing specific techniques
10# 2. Find NLP papers citing those CV papers
11# 3. Identify the bridging concepts
12# 4. Present a coherent narrative
13
14response = agent.generate_response(query)

Use Case 3: Methodology Extraction

python
1# Extracting specific methodological details
2
3query = """
4What evaluation metrics are most commonly used in papers about 
5few-shot learning for NLP tasks?
6"""
7
8# Behind the scenes:
9# 1. Retrieve few-shot NLP papers
10# 2. Extract methodology sections (using LLM)
11# 3. Aggregate metrics across papers
12# 4. Present frequency analysis

Part 10: Measuring Success with vero-eval

After running the system for a while, check the vero-eval dashboard:

python
1# evaluation/generate_dashboard.py
2
3from vero.dashboard import create_dashboard
4from vero.trace_db import TraceDB
5
6# Load trace database
7trace_db = TraceDB('evaluation/trace.db')
8
9# Create interactive dashboard
10create_dashboard(
11    trace_db=trace_db,
12    output_path='evaluation/dashboard.html',
13    metrics=[
14        'retrieval_precision',
15        'retrieval_recall',
16        'generation_faithfulness',
17        'response_time',
18        'context_sufficiency'
19    ],
20    groupby=['persona', 'query_complexity']
21)

This generates an interactive Plotly dashboard showing:

Metric trends over time (is the system improving?)
Persona performance comparison (which user types are we serving well?)
Query complexity vs. accuracy (where do we struggle?)
Retrieval method effectiveness (vector vs. graph traversal success rates)

Advanced Patterns and Optimizations

Pattern 1: Caching Embeddings

For production, cache embeddings to avoid recomputation:

python
1import pickle
2from pathlib import Path
3
4class EmbeddingCache:
5    def __init__(self, cache_dir: Path = Path('cache/embeddings')):
6        self.cache_dir = cache_dir
7        self.cache_dir.mkdir(parents=True, exist_ok=True)
8    
9    def get_embedding(self, text: str, model: str = 'nomic-embed-text') -> list[float]:
10        # Create hash of text for cache key
11        cache_key = hashlib.md5(text.encode()).hexdigest()
12        cache_path = self.cache_dir / f"{cache_key}_{model}.pkl"
13        
14        if cache_path.exists():
15            with open(cache_path, 'rb') as f:
16                return pickle.load(f)
17        
18        # Generate new embedding
19        embedding = ollama.embeddings(model=model, prompt=text)['embedding']
20        
21        # Cache it
22        with open(cache_path, 'wb') as f:
23            pickle.dump(embedding, f)
24        
25        return embedding

Pattern 2: Batch Processing for Large Collections

When ingesting 1000+ papers:

python
1def ingest_batch(papers: list[Path], batch_size: int = 10):
2    """Process papers in batches to manage memory"""
3    
4    for i in range(0, len(papers), batch_size):
5        batch = papers[i:i+batch_size]
6        
7        # Extract metadata in parallel
8        with ThreadPoolExecutor(max_workers=batch_size) as executor:
9            metadata_list = executor.map(extract_paper_metadata, batch)
10        
11        # Insert into Neo4j in single transaction
12        with driver.session() as session:
13            with session.begin_transaction() as tx:
14                for metadata, pdf_path in zip(metadata_list, batch):
15                    create_paper_node(tx, metadata, pdf_path)
16                
17                tx.commit()
18        
19        print(f"✓ Processed {i+batch_size}/{len(papers)} papers")

Pattern 3: Incremental Evaluation

Don't wait to run full evaluation. Track metrics continuously:

python
1class ContinuousEvaluator:
2    def __init__(self, alert_threshold: float = 0.6):
3        self.alert_threshold = alert_threshold
4        self.recent_scores = []
5        
6    def evaluate_response(self, query: str, response: dict):
7        # Quick evaluation on the fly
8        score = self._quick_score(response)
9        self.recent_scores.append(score)
10        
11        # Keep only last 50 queries
12        if len(self.recent_scores) > 50:
13            self.recent_scores.pop(0)
14        
15        # Alert if average drops
16        if len(self.recent_scores) >= 10:
17            avg = sum(self.recent_scores) / len(self.recent_scores)
18            if avg < self.alert_threshold:
19                self._send_alert(avg)
20    
21    def _quick_score(self, response: dict) -> float:
22        # Lightweight scoring
23        has_context = len(response['context_used']) > 0
24        response_length = len(response['response'].split())
25        
26        return 0.7 * has_context + 0.3 * min(1.0, response_length / 100)

Troubleshooting Common Issues

Issue 1: Neo4j Connection Errors

python
1# Test Neo4j connection
2from neo4j import GraphDatabase
3
4def test_connection():
5    try:
6        driver = GraphDatabase.driver(
7            "bolt://localhost:7687",
8            auth=("neo4j", "research2025")
9        )
10        
11        with driver.session() as session:
12            result = session.run("RETURN 1 AS num")
13            print("✓ Neo4j connection successful")
14            
15    except Exception as e:
16        print(f"✗ Connection failed: {e}")
17        print("  Make sure Neo4j is running: docker ps")

Issue 2: Ollama Model Not Found

bash
1# Check available models
2ollama list
3
4# Pull missing models
5ollama pull mistral
6ollama pull nomic-embed-text
7
8# Verify they work
9ollama run mistral "Test query"

Issue 3: Low Retrieval Scores

Check your embeddings:

python
1# Verify embeddings are being generated correctly
2from ingest_research_data import ResearchGraphBuilder
3
4builder = ResearchGraphBuilder()
5
6# Test on a sample paper
7test_text = "Transformers are a type of neural network architecture..."
8embedding = builder.graph_rag.generate_embedding(test_text)
9
10print(f"Embedding dimension: {len(embedding)}")  # Should be 4096
11print(f"Sample values: {embedding[:5]}")

Conclusion and Next Steps

You now have a production-ready Research Assistant with:

✅ Local-first architecture (no API costs, full privacy)
✅ Neo4j knowledge graph (papers, authors, concepts, citations)
✅ Hybrid retrieval (vector similarity + graph traversal)
✅ Persona-driven responses with RLHF adaptation
✅ Comprehensive evaluation via vero-eval framework
✅ Automated improvement through feedback loops

Recommended Next Steps:

Expand the Dataset: Ingest your actual research papers

bash
1python scripts/ingest_research_data.py --directory ~/Documents/Research

Run Weekly Evaluations: Set up a cron job

bash
10 2 * * 0 cd /path/to/research-assistant && python evaluation/run_evaluation.py

Fine-tune Personas: Create persona configs for different user types:
- PhD Student persona (detail-oriented, wants methodology)
- Senior Researcher persona (big picture, cross-domain)
- Industry persona (practical applications)
Integrate Additional Sources:
- arXiv API for latest papers
- Connected Papers for visualization
- Semantic Scholar for citation data
Scale Up:
- Use a vector database (Pinecone, Weaviate) for 10K+ papers
- Implement query result caching
- Add paper summarization pipeline

Resources for Going Deeper

Neo4j GenAI Integration: Official Documentation
llama.cpp: Mastering Local LLM Integration
vero-eval Framework: GitHub Repository

Production Deployment Checklist

Before deploying to production, ensure you've addressed:

python
1# deployment/production_checklist.py
2
3PRODUCTION_CHECKLIST = {
4    'Infrastructure': [
5        '☐ Neo4j running with persistent volumes',
6        '☐ Ollama configured with appropriate model cache',
7        '☐ Redis/Memcached for query result caching',
8        '☐ Load balancer for API endpoints',
9        '☐ CDN for static assets'
10    ],
11    'Security': [
12        '☐ API authentication implemented',
13        '☐ Rate limiting configured (per user/IP)',
14        '☐ Input sanitization for all user queries',
15        '☐ Neo4j credentials rotated and secured',
16        '☐ HTTPS enabled with valid certificates'
17    ],
18    'Monitoring': [
19        '☐ Prometheus metrics exported',
20        '☐ Grafana dashboards for system health',
21        '☐ vero-eval continuous evaluation running',
22        '☐ Error tracking (Sentry/Rollbar)',
23        '☐ Query latency monitoring'
24    ],
25    'Data Management': [
26        '☐ Automated backups of Neo4j database',
27        '☐ Embedding cache backup strategy',
28        '☐ Data retention policies defined',
29        '☐ GDPR compliance for user queries',
30        '☐ Paper metadata update pipeline'
31    ],
32    'Performance': [
33        '☐ Embedding generation batched/cached',
34        '☐ Neo4j indexes optimized',
35        '☐ Query result caching implemented',
36        '☐ Connection pooling configured',
37        '☐ Async processing for long-running queries'
38    ]
39}

Part 11: Advanced vero-eval Techniques

Now let's dive deeper into what makes vero-eval exceptional for production AI systems.

Stress Testing with Adversarial Queries

vero-eval can generate adversarial test cases that expose edge cases:

python
1# evaluation/adversarial_testing.py
2
3from vero.adversarial import AdversarialGenerator
4from reasoning_agent import PersonaReasoningAgent
5
6def run_adversarial_tests():
7    """
8    Generate adversarial queries designed to break the system.
9    This reveals weaknesses before users find them.
10    """
11    
12    agent = PersonaReasoningAgent()
13    
14    # Initialize adversarial generator
15    adv_gen = AdversarialGenerator(
16        base_queries=load_valid_queries(),
17        attack_types=[
18            'jailbreak',          # Try to bypass safety guardrails
19            'context_overflow',   # Queries requiring huge context
20            'ambiguous_reference', # "the paper mentioned earlier" without context
21            'temporal_confusion', # Mixing past/future tenses
22            'multi_hop_complex',  # Require 3+ reasoning steps
23            'contradictory',      # Ask for contradicting information
24            'out_of_domain'       # Queries completely outside research
25        ]
26    )
27    
28    adversarial_queries = adv_gen.generate(n=50)
29    
30    failures = []
31    
32    for query_data in adversarial_queries:
33        query = query_data['query']
34        attack_type = query_data['attack_type']
35        
36        print(f"Testing: {attack_type} - {query[:60]}...")
37        
38        try:
39            response = agent.generate_response(query)
40            
41            # Check for failure modes
42            if len(response['response']) < 10:
43                failures.append({
44                    'query': query,
45                    'attack_type': attack_type,
46                    'failure_mode': 'empty_response'
47                })
48            
49            elif 'hallucination' in detect_hallucinations(
50                response['response'], 
51                response['context_used']
52            ):
53                failures.append({
54                    'query': query,
55                    'attack_type': attack_type,
56                    'failure_mode': 'hallucination'
57                })
58            
59        except Exception as e:
60            failures.append({
61                'query': query,
62                'attack_type': attack_type,
63                'failure_mode': 'exception',
64                'error': str(e)
65            })
66    
67    # Generate failure report
68    with open('evaluation/results/adversarial_failures.json', 'w') as f:
69        json.dump(failures, f, indent=2)
70    
71    print(f"\n⚠️  Found {len(failures)} failure cases out of 50 adversarial queries")
72    print(f"   Failure rate: {len(failures)/50*100:.1f}%")
73    
74    # Categorize failures
75    failure_by_type = {}
76    for failure in failures:
77        attack_type = failure['attack_type']
78        failure_by_type[attack_type] = failure_by_type.get(attack_type, 0) + 1
79    
80    print("\n📊 Failures by attack type:")
81    for attack_type, count in sorted(failure_by_type.items(), 
82                                     key=lambda x: x[1], 
83                                     reverse=True):
84        print(f"   {attack_type}: {count}")
85    
86    return failures
87
88def detect_hallucinations(response: str, context_docs: list) -> list:
89    """
90    Detect potential hallucinations by checking if claims in response
91    are supported by retrieved context.
92    """
93    
94    hallucinations = []
95    
96    # Extract claims from response (sentences making factual statements)
97    claims = extract_claims(response)
98    
99    # Create context text corpus
100    context_text = "\n".join([doc['abstract'] for doc in context_docs])
101    
102    for claim in claims:
103        # Check if claim is substantiated by context
104        # Use simple token overlap for now (could use entailment model)
105        claim_tokens = set(claim.lower().split())
106        context_tokens = set(context_text.lower().split())
107        
108        overlap = len(claim_tokens & context_tokens) / len(claim_tokens)
109        
110        if overlap < 0.3:  # Less than 30% overlap suggests hallucination
111            hallucinations.append({
112                'claim': claim,
113                'overlap_score': overlap,
114                'severity': 'high' if overlap < 0.1 else 'medium'
115            })
116    
117    return hallucinations
118
119def extract_claims(response: str) -> list[str]:
120    """Extract factual claims from response."""
121    # Simple heuristic: sentences with "is", "are", "shows", "demonstrates"
122    sentences = response.split('.')
123    
124    claim_indicators = ['is', 'are', 'shows', 'demonstrates', 'found', 'reports']
125    
126    claims = [
127        sent.strip() for sent in sentences
128        if any(indicator in sent.lower() for indicator in claim_indicators)
129        and len(sent.split()) > 5  # Substantial claim
130    ]
131    
132    return claims
133
134if __name__ == "__main__":
135    failures = run_adversarial_tests()

Run this regularly:

bash
1# Weekly adversarial testing
20 3 * * 1 cd /path/to/research-assistant && python evaluation/adversarial_testing.py

Continuous Monitoring with vero-eval

Set up real-time quality monitoring:

python
1# evaluation/continuous_monitor.py
2
3from vero.monitor import QualityMonitor
4from datetime import datetime, timedelta
5import smtplib
6from email.mime.text import MIMEText
7
8class ProductionMonitor:
9    def __init__(self, trace_db_path: str):
10        self.monitor = QualityMonitor(trace_db_path)
11        self.alert_thresholds = {
12            'precision_drop': 0.15,      # Alert if precision drops by 15%
13            'latency_spike': 2.0,        # Alert if latency > 2 seconds
14            'error_rate': 0.05,          # Alert if error rate > 5%
15            'faithfulness_drop': 0.20    # Alert if faithfulness drops by 20%
16        }
17        
18    def check_system_health(self):
19        """
20        Run every hour to check if system performance is degrading.
21        """
22        
23        # Get metrics for last 24 hours
24        recent_metrics = self.monitor.get_metrics(
25            start_time=datetime.now() - timedelta(hours=24),
26            end_time=datetime.now()
27        )
28        
29        # Get baseline metrics (last week average)
30        baseline_metrics = self.monitor.get_metrics(
31            start_time=datetime.now() - timedelta(days=7),
32            end_time=datetime.now() - timedelta(days=1)
33        )
34        
35        alerts = []
36        
37        # Check for precision drop
38        precision_drop = (
39            baseline_metrics['precision'] - recent_metrics['precision']
40        )
41        if precision_drop > self.alert_thresholds['precision_drop']:
42            alerts.append({
43                'severity': 'high',
44                'metric': 'precision',
45                'message': f"Precision dropped by {precision_drop:.2%}",
46                'baseline': baseline_metrics['precision'],
47                'current': recent_metrics['precision']
48            })
49        
50        # Check for latency spikes
51        if recent_metrics['avg_latency'] > self.alert_thresholds['latency_spike']:
52            alerts.append({
53                'severity': 'medium',
54                'metric': 'latency',
55                'message': f"Average latency: {recent_metrics['avg_latency']:.2f}s",
56                'baseline': baseline_metrics['avg_latency'],
57                'current': recent_metrics['avg_latency']
58            })
59        
60        # Check error rate
61        if recent_metrics['error_rate'] > self.alert_thresholds['error_rate']:
62            alerts.append({
63                'severity': 'critical',
64                'metric': 'error_rate',
65                'message': f"Error rate: {recent_metrics['error_rate']:.2%}",
66                'baseline': baseline_metrics['error_rate'],
67                'current': recent_metrics['error_rate']
68            })
69        
70        # Check faithfulness
71        faithfulness_drop = (
72            baseline_metrics['faithfulness'] - recent_metrics['faithfulness']
73        )
74        if faithfulness_drop > self.alert_thresholds['faithfulness_drop']:
75            alerts.append({
76                'severity': 'high',
77                'metric': 'faithfulness',
78                'message': f"Faithfulness dropped by {faithfulness_drop:.2%}",
79                'baseline': baseline_metrics['faithfulness'],
80                'current': recent_metrics['faithfulness']
81            })
82        
83        # Send alerts if any
84        if alerts:
85            self.send_alerts(alerts)
86        
87        # Log to monitoring system
88        self.log_health_check(recent_metrics, alerts)
89        
90        return alerts
91    
92    def send_alerts(self, alerts: list):
93        """Send alerts via email/Slack/PagerDuty"""
94        
95        critical_alerts = [a for a in alerts if a['severity'] == 'critical']
96        
97        if critical_alerts:
98            # Page on-call engineer
99            self.page_oncall(critical_alerts)
100        
101        # Email summary
102        email_body = self.format_alert_email(alerts)
103        self.send_email(
104            to='team@example.com',
105            subject=f"🚨 Research Assistant Quality Alert - {len(alerts)} issues",
106            body=email_body
107        )
108    
109    def format_alert_email(self, alerts: list) -> str:
110        """Format alerts as HTML email"""
111        
112        html = """
113        <h2>Research Assistant Quality Alerts</h2>
114        <p>The following performance degradations were detected:</p>
115        <table border="1" cellpadding="10">
116            <tr>
117                <th>Severity</th>
118                <th>Metric</th>
119                <th>Baseline</th>
120                <th>Current</th>
121                <th>Message</th>
122            </tr>
123        """
124        
125        for alert in alerts:
126            severity_color = {
127                'critical': '#ff0000',
128                'high': '#ff6600',
129                'medium': '#ffaa00'
130            }[alert['severity']]
131            
132            html += f"""
133            <tr>
134                <td style="background-color: {severity_color}; color: white;">
135                    {alert['severity'].upper()}
136                </td>
137                <td>{alert['metric']}</td>
138                <td>{alert['baseline']:.3f}</td>
139                <td>{alert['current']:.3f}</td>
140                <td>{alert['message']}</td>
141            </tr>
142            """
143        
144        html += """
145        </table>
146        <p>
147        <a href="http://your-monitoring-url/dashboard">View Full Dashboard</a>
148        </p>
149        """
150        
151        return html
152    
153    def log_health_check(self, metrics: dict, alerts: list):
154        """Log to your monitoring system (Prometheus/Datadog/etc)"""
155        
156        # Example: Push to Prometheus Pushgateway
157        # In production, you'd use actual client library
158        
159        print(f"[{datetime.now()}] Health Check:")
160        print(f"  Precision: {metrics['precision']:.3f}")
161        print(f"  Recall: {metrics['recall']:.3f}")
162        print(f"  Faithfulness: {metrics['faithfulness']:.3f}")
163        print(f"  Avg Latency: {metrics['avg_latency']:.2f}s")
164        print(f"  Error Rate: {metrics['error_rate']:.2%}")
165        
166        if alerts:
167            print(f"  ⚠️  {len(alerts)} alerts triggered")
168        else:
169            print(f"  ✓ All metrics within normal range")
170
171# Run as scheduled job
172if __name__ == "__main__":
173    monitor = ProductionMonitor('evaluation/trace.db')
174    alerts = monitor.check_system_health()
175    
176    if alerts:
177        exit(1)  # Non-zero exit code for alerting systems

Set up as cron job:

bash
1# Check every hour
20 * * * * cd /path/to/research-assistant && python evaluation/continuous_monitor.py

Part 12: Scaling Beyond 10K Papers

As your research collection grows, you'll need to optimize:

1. Migrate to a Dedicated Vector Database

For 10K+ papers, Neo4j's vector indexes can become slow. Use a specialized vector DB:

python
1# scripts/migrate_to_pinecone.py
2
3import pinecone
4from neo4j import GraphDatabase
5import os
6
7def migrate_embeddings_to_pinecone():
8    """
9    Migrate embeddings from Neo4j to Pinecone for faster retrieval.
10    Keep Neo4j for graph relationships, Pinecone for vector search.
11    """
12    
13    # Initialize Pinecone
14    pinecone.init(
15        api_key=os.getenv("PINECONE_API_KEY"),
16        environment="us-west1-gcp"
17    )
18    
19    # Create index if doesn't exist
20    if "research-papers" not in pinecone.list_indexes():
21        pinecone.create_index(
22            name="research-papers",
23            dimension=4096,  # nomic-embed-text
24            metric="cosine",
25            pods=2,
26            replicas=1,
27            pod_type="p1.x1"
28        )
29    
30    index = pinecone.Index("research-papers")
31    
32    # Extract embeddings from Neo4j
33    driver = GraphDatabase.driver(
34        "bolt://localhost:7687",
35        auth=("neo4j", "research2025")
36    )
37    
38    with driver.session() as session:
39        # Get papers in batches
40        batch_size = 100
41        offset = 0
42        
43        while True:
44            papers = session.run("""
45                MATCH (p:Paper)
46                RETURN p.title AS title,
47                       p.abstract AS abstract,
48                       p.abstract_embedding AS embedding,
49                       p.year AS year,
50                       ID(p) AS neo4j_id
51                ORDER BY p.year DESC
52                SKIP $offset
53                LIMIT $batch_size
54                """,
55                offset=offset,
56                batch_size=batch_size
57            ).data()
58            
59            if not papers:
60                break
61            
62            # Prepare vectors for Pinecone
63            vectors = []
64            for paper in papers:
65                vectors.append({
66                    'id': str(paper['neo4j_id']),
67                    'values': paper['embedding'],
68                    'metadata': {
69                        'title': paper['title'],
70                        'abstract': paper['abstract'][:500],  # Truncate
71                        'year': paper['year'],
72                        'neo4j_id': paper['neo4j_id']
73                    }
74                })
75            
76            # Upsert to Pinecone
77            index.upsert(vectors=vectors, namespace="papers")
78            
79            print(f"✓ Migrated {offset + len(papers)} papers")
80            offset += batch_size
81    
82    print(f"\n✅ Migration complete! {offset} papers in Pinecone")
83
84# Update retriever to use Pinecone
85class HybridRetrieverWithPinecone:
86    def __init__(self, neo4j_driver, pinecone_index_name="research-papers"):
87        self.neo4j_driver = neo4j_driver
88        self.pinecone_index = pinecone.Index(pinecone_index_name)
89    
90    def retrieve_context(self, query: str, limit: int = 5) -> list[dict]:
91        """Hybrid retrieval using Pinecone + Neo4j graph"""
92        
93        # 1. Vector search with Pinecone (fast!)
94        query_embedding = ollama.embeddings(
95            model='nomic-embed-text',
96            prompt=query
97        )['embedding']
98        
99        pinecone_results = self.pinecone_index.query(
100            vector=query_embedding,
101            top_k=limit * 2,
102            include_metadata=True,
103            namespace="papers"
104        )
105        
106        # 2. Get Neo4j IDs from Pinecone results
107        neo4j_ids = [
108            int(match['metadata']['neo4j_id']) 
109            for match in pinecone_results['matches']
110        ]
111        
112        # 3. Enrich with graph relationships from Neo4j
113        with self.neo4j_driver.session() as session:
114            enriched = session.run("""
115                UNWIND $neo4j_ids AS paper_id
116                MATCH (p:Paper) WHERE ID(p) = paper_id
117                OPTIONAL MATCH (p)<-[:AUTHORED]-(a:Author)
118                OPTIONAL MATCH (p)-[:DISCUSSES]->(c:Concept)
119                OPTIONAL MATCH (p)-[:CITES]->(cited:Paper)
120                
121                RETURN 
122                    p.title AS title,
123                    p.abstract AS abstract,
124                    p.year AS year,
125                    collect(DISTINCT a.name) AS authors,
126                    collect(DISTINCT c.name) AS concepts,
127                    collect(DISTINCT cited.title) AS citations
128                """,
129                neo4j_ids=neo4j_ids
130            ).data()
131        
132        # 4. Combine Pinecone scores with Neo4j metadata
133        results = []
134        for i, match in enumerate(pinecone_results['matches']):
135            neo4j_data = enriched[i] if i < len(enriched) else {}
136            
137            results.append({
138                'title': neo4j_data.get('title', match['metadata']['title']),
139                'abstract': neo4j_data.get('abstract', match['metadata']['abstract']),
140                'year': neo4j_data.get('year', match['metadata']['year']),
141                'authors': neo4j_data.get('authors', []),
142                'concepts': neo4j_data.get('concepts', []),
143                'citations': neo4j_data.get('citations', []),
144                'relevance_score': match['score'],
145                'retrieval_method': 'pinecone_vector_search'
146            })
147        
148        return results[:limit]

Benefits of this architecture:

Pinecone handles 10M+ vectors easily
Neo4j focuses on graph relationships (citations, authorship)
Best of both worlds: fast vector search + rich graph traversal

2. Implement Query Result Caching

python
1# lib/query_cache.py
2
3import redis
4import hashlib
5import json
6from datetime import timedelta
7
8class QueryCache:
9    def __init__(self, redis_url: str = "redis://localhost:6379"):
10        self.redis = redis.from_url(redis_url)
11        self.ttl = timedelta(hours=24)  # Cache for 24 hours
12    
13    def get_cached_response(self, query: str, persona_config: dict) -> dict | None:
14        """
15        Check if we have a cached response for this query+persona combination.
16        """
17        
18        # Create cache key from query + persona config
19        cache_key = self._create_cache_key(query, persona_config)
20        
21        cached = self.redis.get(cache_key)
22        if cached:
23            print(f"✓ Cache hit for query: {query[:50]}...")
24            return json.loads(cached)
25        
26        return None
27    
28    def cache_response(self, query: str, persona_config: dict, response: dict):
29        """Store response in cache"""
30        
31        cache_key = self._create_cache_key(query, persona_config)
32        
33        self.redis.setex(
34            cache_key,
35            self.ttl,
36            json.dumps(response)
37        )
38    
39    def _create_cache_key(self, query: str, persona_config: dict) -> str:
40        """Create deterministic cache key"""
41        
42        # Include relevant persona config aspects
43        persona_hash = hashlib.md5(
44            json.dumps(persona_config, sort_keys=True).encode()
45        ).hexdigest()
46        
47        query_hash = hashlib.md5(query.encode()).hexdigest()
48        
49        return f"query_cache:{query_hash}:{persona_hash}"
50    
51    def invalidate_cache(self):
52        """Invalidate all cached queries (e.g., after persona update)"""
53        
54        keys = self.redis.keys("query_cache:*")
55        if keys:
56            self.redis.delete(*keys)
57            print(f"✓ Invalidated {len(keys)} cached queries")
58
59# Integrate into reasoning agent
60class CachedReasoningAgent(PersonaReasoningAgent):
61    def __init__(self, *args, **kwargs):
62        super().__init__(*args, **kwargs)
63        self.cache = QueryCache()
64    
65    def generate_response(self, query: str, chat_history: list = None) -> dict:
66        """Generate response with caching"""
67        
68        # Check cache first
69        cached = self.cache.get_cached_response(query, self.persona_config)
70        if cached:
71            return cached
72        
73        # Generate fresh response
74        response = super().generate_response(query, chat_history)
75        
76        # Cache if quality is good
77        if response['quality_grade'] > 0.7:
78            self.cache.cache_response(query, self.persona_config, response)
79        
80        return response

3. Batch Embedding Generation

When ingesting large collections:

python
1# scripts/batch_embedding_generator.py
2
3from concurrent.futures import ThreadPoolExecutor
4import ollama
5import time
6
7class BatchEmbeddingGenerator:
8    def __init__(self, model: str = 'nomic-embed-text', max_workers: int = 4):
9        self.model = model
10        self.max_workers = max_workers
11        self.rate_limit_delay = 0.1  # 100ms between requests
12    
13    def generate_embeddings_batch(self, texts: list[str]) -> list[list[float]]:
14        """
15        Generate embeddings for multiple texts in parallel with rate limiting.
16        """
17        
18        embeddings = []
19        
20        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
21            # Submit all tasks
22            futures = []
23            for i, text in enumerate(texts):
24                future = executor.submit(self._generate_single, text, i)
25                futures.append(future)
26                
27                # Rate limiting
28                time.sleep(self.rate_limit_delay)
29            
30            # Collect results in order
31            for future in futures:
32                embedding, index = future.result()
33                embeddings.append((index, embedding))
34        
35        # Sort by original index
36        embeddings.sort(key=lambda x: x[0])
37        
38        return [emb for _, emb in embeddings]
39    
40    def _generate_single(self, text: str, index: int) -> tuple[list[float], int]:
41        """Generate single embedding with retry logic"""
42        
43        max_retries = 3
44        for attempt in range(max_retries):
45            try:
46                response = ollama.embeddings(
47                    model=self.model,
48                    prompt=text[:8192]  # Truncate to model limit
49                )
50                return response['embedding'], index
51            
52            except Exception as e:
53                if attempt == max_retries - 1:
54                    raise
55                
56                print(f"⚠️  Retry {attempt+1}/{max_retries} for text {index}: {e}")
57                time.sleep(2 ** attempt)  # Exponential backoff
58
59# Use in ingestion pipeline
60def ingest_large_collection(papers: list[Path]):
61    """Efficiently ingest 1000+ papers"""
62    
63    generator = BatchEmbeddingGenerator(max_workers=8)
64    
65    # Process in batches of 50
66    batch_size = 50
67    
68    for i in range(0, len(papers), batch_size):
69        batch = papers[i:i+batch_size]
70        
71        print(f"Processing batch {i//batch_size + 1}/{len(papers)//batch_size + 1}")
72        
73        # Extract abstracts
74        abstracts = []
75        metadata_list = []
76        for paper_path in batch:
77            metadata = extract_paper_metadata(paper_path)
78            abstracts.append(metadata['abstract'])
79            metadata_list.append(metadata)
80        
81        # Generate embeddings in parallel
82        embeddings = generator.generate_embeddings_batch(abstracts)
83        
84        # Insert into database
85        with neo4j_driver.session() as session:
86            for metadata, embedding in zip(metadata_list, embeddings):
87                metadata['abstract_embedding'] = embedding
88                create_paper_node(session, metadata)
89        
90        print(f"✓ Ingested batch {i//batch_size + 1}")

Part 13: Real-World Production Case Study

Let's walk through a complete example from a hypothetical research lab:

Scenario: Computational Biology Research Lab

Requirements:

5,000 existing papers in their collection
Weekly updates with new publications
15 active researchers with different expertise levels
Need to find cross-domain connections (CS ↔ Biology)
High precision required (wrong papers waste researcher time)

Implementation:

python
1# config/bio_lab_config.py
2
3RESEARCH_LAB_CONFIG = {
4    'name': 'Computational Biology Lab',
5    'paper_sources': [
6        'local_collection',  # Existing 5K papers
7        'pubmed_api',        # Weekly updates
8        'biorxiv_api',      # Preprints
9        'arxiv_bio'         # CS bio papers
10    ],
11    'personas': {
12        'wet_lab_biologist': {
13            'description': 'Bench scientists with limited CS background',
14            'rlhf_thresholds': {
15                'technical_detail': 0.3,  # Less technical jargon
16                'methodology_depth': 0.8,  # High experimental detail
17                'formality': 0.5
18            },
19            'preferred_sources': ['Nature', 'Cell', 'Science']
20        },
21        'computational_biologist': {
22            'description': 'Hybrid CS/Bio expertise',
23            'rlhf_thresholds': {
24                'technical_detail': 0.8,  # Can handle complexity
25                'methodology_depth': 0.9,  # Wants algorithm details
26                'formality': 0.7
27            },
28            'preferred_sources': ['Nature Methods', 'Bioinformatics', 'PLOS Comp Bio']
29        },
30        'pi_researcher': {
31            'description': 'Principal investigator, needs big picture',
32            'rlhf_thresholds': {
33                'technical_detail': 0.5,  # Balanced
34                'methodology_depth': 0.4,  # Focus on conclusions
35                'formality': 0.9           # Very formal
36            },
37            'preferred_sources': ['High-impact journals', 'Review articles']
38        }
39    },
40    'quality_requirements': {
41        'min_precision': 0.85,  # Must retrieve >85% relevant papers
42        'min_faithfulness': 0.90,  # Responses must be 90% faithful to sources
43        'max_latency': 3.0  # 3 second max response time
44    }
45}

Setup Script:

bash
1#!/bin/bash
2# setup_bio_lab.sh
3
4echo "🧬 Setting up Computational Biology Research Assistant"
5
6# 1. Ingest existing collection
7echo "📚 Ingesting 5,000 existing papers..."
8python scripts/ingest_research_data.py \
9    --directory /data/lab_papers \
10    --batch-size 50 \
11    --parallel-workers 8
12
13# 2. Set up automated paper updates
14echo "📰 Configuring automated updates..."
15python scripts/setup_paper_updates.py \
16    --sources pubmed,biorxiv,arxiv \
17    --schedule daily \
18    --filter "computational biology OR bioinformatics"
19
20# 3. Generate persona-specific test datasets
21echo "🧪 Generating evaluation datasets..."
22python evaluation/generate_test_dataset.py \
23    --personas wet_lab,computational,pi \
24    --queries-per-persona 50
25
26# 4. Run initial evaluation
27echo "📊 Running baseline evaluation..."
28python evaluation/run_evaluation.py \
29    --config config/bio_lab_config.py
30
31# 5. Deploy to production
32echo "🚀 Deploying to production..."
33docker-compose -f docker-compose.bio-lab.yml up -d
34
35echo "✅ Setup complete!"
36echo "   Dashboard: http://lab-research-assistant.local"
37echo "   Monitoring: http://lab-research-assistant.local/metrics"

Weekly Evaluation Report Email:

python
1# scripts/weekly_report.py
2
3from vero.report import ReportGenerator
4import smtplib
5from email.mime.multipart import MIMEMultipart
6from email.mime.text import MIMEText
7from email.mime.image import MIMEImage
8import matplotlib.pyplot as plt
9
10def generate_weekly_report():
11    """
12    Automated weekly report sent to PI and lab members.
13    """
14    
15    # Generate vero-eval report
16    generator = ReportGenerator(
17        trace_db_path='evaluation/trace.db',
18        results_path='evaluation/results/weekly.json'
19    )
20    
21    # Create visualizations
22    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
23    
24    # 1. Precision trends by persona
25    axes[0, 0].plot(
26        weekly_data['wet_lab_precision'],
27        label='Wet Lab',
28        marker='o'
29    )
30    axes[0, 0].plot(
31        weekly_data['computational_precision'],
32        label='Computational',
33        marker='s'
34    )
35    axes[0, 0].plot(
36        weekly_data['pi_precision'],
37        label='PI',
38        marker='^'
39    )
40    axes[0, 0].set_title('Retrieval Precision by Persona')
41    axes[0, 0].set_xlabel('Week')
42    axes[0, 0].set_ylabel('Precision@5')
43    axes[0, 0].legend()
44    axes[0, 0].grid(True, alpha=0.3)
45    
46    # 2. Faithfulness over time
47    axes[0, 1].plot(
48        weekly_data['faithfulness'],
49        color='green',
50        marker='o'
51    )
52    axes[0, 1].axhline(y=0.90, color='r', linestyle='--', 
53                       label='Target (90%)')
54    axes[0, 1].set_title('Response Faithfulness')
55    axes[0, 1].set_xlabel('Week')
56    axes[0, 1].set_ylabel('Faithfulness Score')
57    axes[0, 1].legend()
58    axes[0, 1].grid(True, alpha=0.3)
59    
60    # 3. Query latency distribution
61    axes[1, 0].hist(
62        weekly_data['latencies'],
63        bins=30,
64        edgecolor='black'
65    )
66    axes[1, 0].axvline(x=3.0, color='r', linestyle='--',
67                       label='Max Latency (3s)')
68    axes[1, 0].set_title('Query Latency Distribution')
69    axes[1, 0].set_xlabel('Latency (seconds)')
70    axes[1, 0].set_ylabel('Frequency')
71    axes[1, 0].legend()
72    
73    # 4. Top failure categories
74    failure_categories = weekly_data['failure_categories']
75    axes[1, 1].barh(
76        list(failure_categories.keys()),
77        list(failure_categories.values())
78    )
79    axes[1, 1].set_title('Top Failure Categories')
80    axes[1, 1].set_xlabel('Count')
81    
82    plt.tight_layout()
83    plt.savefig('evaluation/results/weekly_report.png', dpi=150)
84    
85    # Create email
86    msg = MIMEMultipart()
87    msg['Subject'] = f'Research Assistant Weekly Report - Week {week_number}'
88    msg['From'] = 'research-assistant@lab.edu'
89    msg['To'] = 'pi@lab.edu, lab-members@lab.edu'
90    
91    # Email body
92    html_body = f"""
93    <html>
94    <body>
95    <h2>Research Assistant Performance Report</h2>
96    <h3>Week {week_number} - {date_range}</h3>
97    
98    <h4>📊 Key Metrics</h4>
99    <table border="1" cellpadding="10">
100        <tr>
101            <th>Metric</th>
102            <th>This Week</th>
103            <th>Last Week</th>
104            <th>Change</th>
105        </tr>
106        <tr>
107            <td>Avg Precision@5</td>
108            <td>{current_precision:.2%}</td>
109            <td>{last_precision:.2%}</td>
110            <td style="color: {'green' if change > 0 else 'red'};">
111                {change:+.2%}
112            </td>
113        </tr>
114        <tr>
115            <td>Faithfulness</td>
116            <td>{current_faithfulness:.2%}</td>
117            <td>{last_faithfulness:.2%}</td>
118            <td style="color: {'green' if faith_change > 0 else 'red'};">
119                {faith_change:+.2%}
120            </td>
121        </tr>
122        <tr>
123            <td>Avg Latency</td>
124            <td>{current_latency:.2f}s</td>
125            <td>{last_latency:.2f}s</td>
126            <td style="color: {'green' if latency_change < 0 else 'red'};">
127                {latency_change:+.2f}s
128            </td>
129        </tr>
130        <tr>
131            <td>Queries Served</td>
132            <td>{current_queries}</td>
133            <td>{last_queries}</td>
134            <td>{queries_change:+d}</td>
135        </tr>
136    </table>
137    
138    <h4>🎯 Performance by Persona</h4>
139    <ul>
140        <li><strong>Wet Lab Biologists:</strong> 
141            Precision: {wet_lab_precision:.2%} 
142            (Target: >85% ✓)
143        </li>
144        <li><strong>Computational Biologists:</strong> 
145            Precision: {comp_bio_precision:.2%}
146            (Target: >85% ✓)
147        </li>
148        <li><strong>PI Queries:</strong> 
149            Precision: {pi_precision:.2%}
150            (Target: >85% ⚠️ Below target)
151        </li>
152    </ul>
153    
154    <h4>⚠️ Issues & Recommendations</h4>
155    <ul>
156        <li>{issue_1}</li>
157        <li>{issue_2}</li>
158    </ul>
159    
160    <p>See attached visualization for detailed trends.</p>
161    
162    <p>
163    <a href="http://lab-research-assistant.local/dashboard">
164        View Interactive Dashboard
165    </a>
166    </p>
167    
168    </body>
169    </html>
170    """
171    
172    msg.attach(MIMEText(html_body, 'html'))
173    
174    # Attach visualization
175    with open('evaluation/results/weekly_report.png', 'rb') as f:
176        img = MIMEImage(f.read())
177        img.add_header('Content-Disposition', 'attachment', 
178                      filename='weekly_trends.png')
179        msg.attach(img)
180    
181    # Send email
182    with smtplib.SMTP('smtp.lab.edu', 587) as smtp:
183        smtp.starttls()
184        smtp.login('research-assistant@lab.edu', os.getenv('EMAIL_PASSWORD'))
185        smtp.send_message(msg)
186    
187    print("✓ Weekly report sent to lab members")
188
189if __name__ == "__main__":
190    generate_weekly_report()

Ollama Local LLM Integration Setup

Conclusion: The Complete Picture

You now have everything needed to build, evaluate, and deploy a production-ready Research Assistant:

Core Architecture: ✅ Neo4j knowledge graph for research papers
✅ Ollama for local LLM inference
✅ Hybrid retrieval (vector + graph)
✅ Persona-driven responses with RLHF

Evaluation & Quality: ✅ vero-eval for rigorous testing
✅ Automated adversarial testing
✅ Continuous monitoring with alerts
✅ Weekly performance reports

Production Features: ✅ Caching for performance
✅ Batch processing for scale
✅ Automated paper updates
✅ Multi-persona support

The vero-eval Advantage:

What makes this system production-ready is the evaluation framework. Unlike traditional RAG systems that rely on gut feeling and spot-checking, we have:

Systematic edge case testing - adversarial queries expose weaknesses
Persona stress testing - ensures all user types are served well
Automated regression detection - alerts when quality degrades
Actionable metrics - precision/recall/faithfulness directly inform improvements
Continuous learning - RLHF loop closes based on real performance data

This is the difference between a demo and a system you'd trust with real research workflows.

Next Steps:

Clone the starter repo and follow the setup script
Ingest your first 100 papers to test the pipeline
Run vero-eval to establish your baseline
Iterate on retrieval and persona prompts
Deploy to staging and gather feedback
Use weekly reports to drive improvements

Remember: The goal isn't perfect accuracy on day one. It's building a system that measurably improves over time through evaluation-driven iteration.

Now go build something that makes research more efficient! 🚀

Resources:

Questions? Open an issue in the repo or reach out to the community.

Sovereign AI: Building Local-First Intelligent Systems

by Daniel Kliewer · Paperback · 72 pages

The hands-on guide to building AI that runs on your hardware, keeps your data private, and eliminates cloud dependence. Working code included.

Buy on Amazon — $88 See Inside

← Back to all posts