·39 min

Building and Evaluating a Local-First Research Assistant with GraphRAG and vero-eval

Complete technical guide to building a production-ready research assistant using GraphRAG, Neo4j knowledge graphs, Ollama local LLMs, and vero-eval evaluation framework for rigorous AI system testing.

DK

Daniel Kliewer

Author, Sovereign AI

AIGraphRAGLocal LLMNeo4jOllamavero-evalResearch AssistantKnowledge GraphRAGAI Evaluation
Sovereign AI book cover

From the Book

This is from Sovereign AI: Building Local-First Intelligent Systems.

Get the Book — $88
Building and Evaluating a Local-First Research Assistant with GraphRAG and vero-eval

Building and Evaluating a Local-First Research Assistant with GraphRAG and vero-eval

A comprehensive guide to creating a persona-driven AI assistant with rigorous evaluation using Neo4j, Ollama, and the vero-eval framework

Introduction: Why Local GraphRAG Matters for Research Workflows

If you're building AI-powered applications in 2025, you've likely hit two major pain points: context limitations and lack of systematic evaluation. Large Language Models are powerful, but they struggle with long-term memory and consistent performance across edge cases. Enter GraphRAG—a methodology that combines knowledge graphs with retrieval-augmented generation to give your AI genuine memory and contextual awareness.

In this guide, we'll build a Local Research Assistant that:

  • Stores and retrieves research papers, notes, and conversations in a Neo4j knowledge graph
  • Uses Ollama for completely local inference (no API costs, full privacy)
  • Implements persona-driven responses that adapt based on RLHF feedback
  • Most importantly: Measures performance rigorously using the vero-eval framework

This isn't another "hello world" tutorial. We're building production-ready infrastructure that you can deploy for real research workflows, with proper testing and evaluation baked in from day one.

Prerequisites and Starting Point

Before we dive in, you'll need:

System Requirements:

  • Python 3.9+
  • Node.js 18+
  • Docker (for Neo4j)
  • 16GB+ RAM recommended

Core Technologies:

  • Ollama for local LLM inference
  • Neo4j for graph database
  • vero-eval for evaluation
  • Next.js + FastAPI (from the starter template)

Clone the Starter Repository:

bash
1git clone https://github.com/kliewerdaniel/chrisbot.git research-assistant
2cd research-assistant

This gives us a solid foundation with the frontend, basic chat interface, and project structure already in place. We'll extend it to build our research-focused GraphRAG system.

Part 1: Understanding the Architecture

Our Research Assistant follows the PersonaGen architecture pattern outlined by Daniel Kliewer, but applied to academic research workflows:

text
1┌─────────────────────────────────────────────────────────┐
2│ User Interface │
3│ (Next.js Chat Interface) │
4└────────────────────┬────────────────────────────────────┘
5
6
7┌─────────────────────────────────────────────────────────┐
8│ Reasoning Agent │
9│ (Tool Calling + RLHF Threshold Logic) │
10└────────────────────┬────────────────────────────────────┘
11
12 ┌──────────┴──────────┐
13 ▼ ▼
14┌──────────────────┐ ┌──────────────────┐
15│ Neo4j Graph │ │ Ollama LLM │
16│ RAG System │ │ (Mistral/Llama) │
17│ │ │ │
18│ • Papers │ │ • Generation │
19│ • Authors │ │ • Embeddings │
20│ • Concepts │ │ • Extraction │
21│ • Citations │ │ │
22└──────────────────┘ └──────────────────┘
23
24
25┌─────────────────────────────────────────────────────────┐
26│ vero-eval Framework │
27│ • Test Dataset Generation │
28│ • Retrieval Metrics (Precision, Recall, MRR) │
29│ • Generation Metrics (Faithfulness, BERTScore) │
30│ • Persona Stress Testing │
31└─────────────────────────────────────────────────────────┘

Key Insight: The persona system adapts its behavior based on evaluation feedback. If vero-eval shows poor retrieval for technical queries, the RLHF thresholds adjust to require more context before responding.

Part 2: Setting Up Neo4j GraphRAG

Neo4j is our memory layer. Following the official Neo4j GenAI integration patterns, we'll create a graph schema optimized for research.

Installing Neo4j GraphRAG for Python

bash
1# Install the official Neo4j GraphRAG package
2pip install neo4j-graphrag
3
4# Install Ollama integration
5pip install "neo4j-graphrag[ollama]"
6
7# Start Neo4j (using Docker)
8docker run \
9 --name research-neo4j \
10 -p 7474:7474 -p 7687:7687 \
11 -e NEO4J_AUTH=neo4j/research2025 \
12 -v $PWD/neo4j-data:/data \
13 neo4j:latest

Neo4j Knowledge Graph Setup for Research Assistant

Defining the Research Knowledge Schema

Create scripts/graph_schema.py:

python
1from neo4j_graphrag import GraphSchema
2from dataclasses import dataclass
3
4@dataclass
5class ResearchSchema(GraphSchema):
6 """
7 Knowledge graph schema for research assistant.
8
9 Nodes:
10 - Paper: Research papers with metadata
11 - Author: Paper authors with affiliation
12 - Concept: Extracted key concepts/topics
13 - Note: User's research notes
14 - Question: User queries with context
15
16 Relationships:
17 - AUTHORED: Author -> Paper
18 - CITES: Paper -> Paper
19 - DISCUSSES: Paper -> Concept
20 - RELATES_TO: Concept -> Concept
21 - ANSWERS: Paper -> Question
22 """
23
24 node_types = {
25 'Paper': {
26 'properties': ['title', 'abstract', 'year', 'doi', 'pdf_path'],
27 'embedding_property': 'abstract_embedding'
28 },
29 'Author': {
30 'properties': ['name', 'affiliation', 'h_index'],
31 'embedding_property': None
32 },
33 'Concept': {
34 'properties': ['name', 'definition', 'domain'],
35 'embedding_property': 'definition_embedding'
36 },
37 'Note': {
38 'properties': ['content', 'timestamp', 'tags'],
39 'embedding_property': 'content_embedding'
40 },
41 'Question': {
42 'properties': ['query', 'timestamp', 'answered'],
43 'embedding_property': 'query_embedding'
44 }
45 }
46
47 relationship_types = {
48 'AUTHORED': ('Author', 'Paper'),
49 'CITES': ('Paper', 'Paper'),
50 'DISCUSSES': ('Paper', 'Concept'),
51 'RELATES_TO': ('Concept', 'Concept'),
52 'ANSWERS': ('Paper', 'Question'),
53 'ANNOTATES': ('Note', 'Paper')
54 }

Why this schema? Research workflows have natural graph structures:

  • Papers cite each other (transitive relationships)
  • Concepts relate to multiple papers
  • Authors collaborate across papers
  • User notes connect to specific papers

This lets us traverse the graph to find: "What papers discussing transformer architectures were cited by papers on RAG systems after 2023?"

Building the Graph Ingestion Pipeline

Create scripts/ingest_research_data.py:

python
1import ollama
2from neo4j import GraphDatabase
3from neo4j_graphrag import GraphRAG
4from pathlib import Path
5import PyPDF2
6
7class ResearchGraphBuilder:
8 def __init__(self, neo4j_uri="bolt://localhost:7687",
9 neo4j_user="neo4j",
10 neo4j_password="research2025",
11 ollama_model="mistral"):
12
13 self.driver = GraphDatabase.driver(neo4j_uri,
14 auth=(neo4j_user, neo4j_password))
15 self.ollama_model = ollama_model
16 self.graph_rag = GraphRAG(self.driver)
17
18 def extract_paper_metadata(self, pdf_path: Path) -> dict:
19 """Extract title, abstract, and key sections from PDF"""
20 with open(pdf_path, 'rb') as file:
21 reader = PyPDF2.PdfReader(file)
22
23 # Extract first 3 pages (usually contains abstract)
24 text = ""
25 for page in reader.pages[:3]:
26 text += page.extract_text()
27
28 # Use Ollama to extract structured metadata
29 prompt = f"""Extract from this research paper excerpt:
30 1. Title
31 2. Authors (list)
32 3. Abstract
33 4. Key concepts (5-7 main topics)
34
35 Text: {text[:4000]}
36
37 Return as JSON."""
38
39 response = ollama.generate(
40 model=self.ollama_model,
41 prompt=prompt,
42 format='json'
43 )
44
45 return json.loads(response['response'])
46
47 def create_paper_node(self, metadata: dict, pdf_path: Path):
48 """Create Paper node with embeddings"""
49
50 # Generate embedding for abstract
51 abstract_embedding = ollama.embeddings(
52 model='nomic-embed-text',
53 prompt=metadata['abstract']
54 )['embedding']
55
56 with self.driver.session() as session:
57 session.run("""
58 CREATE (p:Paper {
59 title: $title,
60 abstract: $abstract,
61 year: $year,
62 pdf_path: $pdf_path,
63 abstract_embedding: $embedding
64 })
65 WITH p
66 UNWIND $authors AS author_name
67 MERGE (a:Author {name: author_name})
68 CREATE (a)-[:AUTHORED]->(p)
69
70 WITH p
71 UNWIND $concepts AS concept_name
72 MERGE (c:Concept {name: concept_name})
73 CREATE (p)-[:DISCUSSES]->(c)
74 """,
75 title=metadata['title'],
76 abstract=metadata['abstract'],
77 year=metadata.get('year', 2024),
78 pdf_path=str(pdf_path),
79 embedding=abstract_embedding,
80 authors=metadata['authors'],
81 concepts=metadata['concepts']
82 )
83
84 def ingest_directory(self, papers_dir: Path):
85 """Ingest all PDFs in a directory"""
86 pdf_files = list(papers_dir.glob("*.pdf"))
87
88 print(f"Found {len(pdf_files)} papers to ingest...")
89
90 for pdf_path in pdf_files:
91 print(f"Processing: {pdf_path.name}")
92 try:
93 metadata = self.extract_paper_metadata(pdf_path)
94 self.create_paper_node(metadata, pdf_path)
95 print(f"✓ Ingested: {metadata['title']}")
96 except Exception as e:
97 print(f"✗ Failed {pdf_path.name}: {e}")

Key Pattern: We're using Ollama for both extraction (via generate) and embeddings (via embeddings). This keeps everything local. For production, you might cache embeddings in a vector index.

Creating Vector Indexes for Hybrid Search

Following Neo4j's GenAI integration guide, we create vector indexes:

python
1def create_vector_indexes(self):
2 """Create vector indexes for similarity search"""
3 with self.driver.session() as session:
4 # Abstract embeddings (4096 dimensions for nomic-embed-text)
5 session.run("""
6 CREATE VECTOR INDEX paper_abstracts IF NOT EXISTS
7 FOR (p:Paper)
8 ON p.abstract_embedding
9 OPTIONS {
10 indexConfig: {
11 `vector.dimensions`: 4096,
12 `vector.similarity_function`: 'cosine'
13 }
14 }
15 """)
16
17 # Concept embeddings
18 session.run("""
19 CREATE VECTOR INDEX concept_definitions IF NOT EXISTS
20 FOR (c:Concept)
21 ON c.definition_embedding
22 OPTIONS {
23 indexConfig: {
24 `vector.dimensions`: 4096,
25 `vector.similarity_function`: 'cosine'
26 }
27 }
28 """)
29
30 # Note embeddings
31 session.run("""
32 CREATE VECTOR INDEX note_contents IF NOT EXISTS
33 FOR (n:Note)
34 ON n.content_embedding
35 OPTIONS {
36 indexConfig: {
37 `vector.dimensions`: 4096,
38 `vector.similarity_function`: 'cosine'
39 }
40 }
41 """)

Critical: The dimension count (4096) must match your embedding model. nomic-embed-text uses 4096, but if you switch to all-MiniLM-L6-v2, you'd need 384.

Part 3: Implementing Hybrid Retrieval

Now we implement the retrieval layer that combines vector similarity with graph traversal:

Hybrid Retrieval System Architecture

python
1class HybridRetriever:
2 def __init__(self, driver, ollama_model="mistral"):
3 self.driver = driver
4 self.ollama_model = ollama_model
5
6 def retrieve_context(self, query: str, limit: int = 5) -> list[dict]:
7 """
8 Hybrid retrieval combining:
9 1. Vector similarity search
10 2. Graph traversal for related concepts
11 3. Citation network expansion
12 """
13
14 # Generate query embedding
15 query_embedding = ollama.embeddings(
16 model='nomic-embed-text',
17 prompt=query
18 )['embedding']
19
20 with self.driver.session() as session:
21 # Vector similarity search
22 vector_results = session.run("""
23 CALL db.index.vector.queryNodes(
24 'paper_abstracts',
25 $limit,
26 $query_embedding
27 )
28 YIELD node, score
29 MATCH (node)<-[:AUTHORED]-(author:Author)
30 MATCH (node)-[:DISCUSSES]->(concept:Concept)
31
32 RETURN
33 node.title AS title,
34 node.abstract AS abstract,
35 node.year AS year,
36 score AS relevance_score,
37 collect(DISTINCT author.name) AS authors,
38 collect(DISTINCT concept.name) AS concepts,
39 'vector_search' AS retrieval_method
40 ORDER BY score DESC
41 """,
42 query_embedding=query_embedding,
43 limit=limit
44 ).data()
45
46 # Graph traversal for cited papers
47 graph_results = []
48 if vector_results:
49 top_paper_title = vector_results[0]['title']
50
51 graph_results = session.run("""
52 MATCH (seed:Paper {title: $seed_title})
53 MATCH (seed)-[:CITES]->(cited:Paper)
54 MATCH (cited)<-[:AUTHORED]-(author:Author)
55 MATCH (cited)-[:DISCUSSES]->(concept:Concept)
56 WHERE any(c IN $query_concepts WHERE c IN collect(concept.name))
57
58 RETURN
59 cited.title AS title,
60 cited.abstract AS abstract,
61 cited.year AS year,
62 0.7 AS relevance_score,
63 collect(DISTINCT author.name) AS authors,
64 collect(DISTINCT concept.name) AS concepts,
65 'citation_traversal' AS retrieval_method
66 LIMIT $limit
67 """,
68 seed_title=top_paper_title,
69 query_concepts=self._extract_query_concepts(query),
70 limit=limit // 2
71 ).data()
72
73 # Combine and deduplicate
74 all_results = vector_results + graph_results
75 seen_titles = set()
76 unique_results = []
77
78 for result in all_results:
79 if result['title'] not in seen_titles:
80 seen_titles.add(result['title'])
81 unique_results.append(result)
82
83 return sorted(unique_results,
84 key=lambda x: x['relevance_score'],
85 reverse=True)[:limit]
86
87 def _extract_query_concepts(self, query: str) -> list[str]:
88 """Extract key concepts from query using LLM"""
89 response = ollama.generate(
90 model=self.ollama_model,
91 prompt=f"Extract 3-5 key technical concepts from this query: {query}. Return as comma-separated list.",
92 options={'temperature': 0.1}
93 )
94 return [c.strip() for c in response['response'].split(',')]

Why hybrid? Pure vector search might miss important papers that don't match semantically but are cited by relevant papers. Graph traversal captures these relationships.

Part 4: The Reasoning Agent and Persona Layer

The reasoning agent decides when to query the graph and how to format responses based on RLHF-adjusted thresholds:

Persona-Driven Response System Architecture

python
1# In scripts/reasoning_agent.py
2
3import json
4from pathlib import Path
5
6class PersonaReasoningAgent:
7 def __init__(self, persona_config_path: Path = Path("data/persona.json")):
8 self.persona_config = self._load_persona(persona_config_path)
9 self.retriever = HybridRetriever(driver, ollama_model)
10
11 def _load_persona(self, config_path: Path) -> dict:
12 """Load persona configuration with RLHF thresholds"""
13 with open(config_path) as f:
14 return json.load(f)
15
16 def should_retrieve_context(self, query: str) -> bool:
17 """
18 Decide if we need to retrieve context based on:
19 1. Query complexity
20 2. RLHF confidence threshold
21 3. Recent retrieval success rate
22 """
23
24 # Simple heuristic: technical terms or specific paper requests
25 technical_indicators = [
26 'paper', 'research', 'study', 'findings',
27 'method', 'algorithm', 'experiment', 'results'
28 ]
29
30 needs_retrieval = any(term in query.lower()
31 for term in technical_indicators)
32
33 # Check RLHF threshold
34 confidence_threshold = self.persona_config['rlhf_thresholds']['retrieval_required']
35
36 # If recent queries had low-quality responses, lower threshold
37 if self.persona_config['recent_success_rate'] < 0.7:
38 confidence_threshold *= 0.8
39
40 return needs_retrieval or confidence_threshold > 0.5
41
42 def generate_response(self, query: str, chat_history: list = None) -> dict:
43 """
44 Main orchestration logic:
45 1. Decide if retrieval needed
46 2. Retrieve context if necessary
47 3. Generate response with persona coloring
48 4. Grade output (RLHF scoring)
49 """
50
51 # Step 1: Retrieval decision
52 needs_context = self.should_retrieve_context(query)
53
54 context_docs = []
55 if needs_context:
56 context_docs = self.retriever.retrieve_context(query, limit=5)
57
58 # Step 2: Format context for LLM
59 context_str = self._format_context(context_docs)
60
61 # Step 3: Generate with persona
62 system_prompt = self._build_persona_prompt(context_str)
63
64 response = ollama.generate(
65 model='mistral',
66 prompt=query,
67 system=system_prompt,
68 context=chat_history
69 )
70
71 # Step 4: RLHF grading
72 quality_grade = self._grade_response(query, response['response'], context_docs)
73
74 # Update RLHF thresholds based on grade
75 self._update_persona_thresholds(quality_grade)
76
77 return {
78 'response': response['response'],
79 'context_used': context_docs,
80 'quality_grade': quality_grade,
81 'retrieval_method': context_docs[0]['retrieval_method'] if context_docs else None
82 }
83
84 def _build_persona_prompt(self, context: str) -> str:
85 """
86 Build system prompt from persona configuration.
87 This is the 'coloring' step mentioned in the architecture.
88 """
89 base_template = self.persona_config['system_prompt_template']
90
91 # Insert context if available
92 if context:
93 base_template += f"\n\nRelevant Research Context:\n{context}"
94
95 # Add persona modifiers based on RLHF values
96 formality = self.persona_config['rlhf_thresholds']['formality_level']
97 if formality > 0.7:
98 base_template += "\n\nUse academic, formal language with proper citations."
99 else:
100 base_template += "\n\nExplain concepts clearly and conversationally."
101
102 return base_template
103
104 def _grade_response(self, query: str, response: str, context: list) -> float:
105 """
106 RLHF grading: 0 (needs improvement) to 1 (excellent).
107 In production, this would be human feedback, but we start with heuristics.
108 """
109
110 # Heuristic checks:
111 # 1. Did we use retrieved context?
112 used_context = any(
113 doc['title'].lower() in response.lower()
114 for doc in context
115 ) if context else True
116
117 # 2. Is response substantive (not too short)?
118 is_substantive = len(response.split()) > 50
119
120 # 3. Does response directly address query?
121 query_terms = set(query.lower().split())
122 response_terms = set(response.lower().split())
123 overlap = len(query_terms & response_terms) / len(query_terms)
124
125 # Weighted score
126 score = (
127 0.4 * float(used_context) +
128 0.3 * float(is_substantive) +
129 0.3 * overlap
130 )
131
132 return min(1.0, score)
133
134 def _update_persona_thresholds(self, quality_grade: float):
135 """
136 Update RLHF thresholds based on response quality.
137 This is the adaptive learning mechanism.
138 """
139
140 # If grade < 0.5, we need more context
141 if quality_grade < 0.5:
142 self.persona_config['rlhf_thresholds']['retrieval_required'] += 0.05
143 else:
144 # Successful response, can relax threshold slightly
145 self.persona_config['rlhf_thresholds']['retrieval_required'] -= 0.02
146
147 # Clamp values
148 self.persona_config['rlhf_thresholds']['retrieval_required'] = max(
149 0.0,
150 min(1.0, self.persona_config['rlhf_thresholds']['retrieval_required'])
151 )
152
153 # Save updated config
154 with open("data/persona.json", 'w') as f:
155 json.dump(self.persona_config, f, indent=2)

Key Insight: The persona adapts over time. If vero-eval (which we'll integrate next) shows poor performance, these thresholds shift to require more evidence before responding.

Part 5: Integrating vero-eval for Rigorous Testing

This is where the magic happens. vero-eval provides production-grade evaluation that goes far beyond simple accuracy metrics. It tests edge cases, persona stress scenarios, and real-world failure modes.

vero-eval Testing Framework for AI Research Assistant

Installing and Configuring vero-eval

bash
1# Install vero-eval
2pip install vero-eval
3
4# Initialize evaluation directory
5mkdir -p evaluation/datasets evaluation/results

Generating a Research-Specific Test Dataset

vero-eval can generate test datasets tailored to your domain:

python
1# evaluation/generate_test_dataset.py
2
3from vero.test_dataset_generator import generate_and_save
4from pathlib import Path
5
6def generate_research_test_dataset():
7 """
8 Generate challenging test queries for research assistant.
9 vero-eval creates persona-based edge cases automatically.
10 """
11
12 # Point to your research papers directory
13 data_path = Path('data/research_papers')
14
15 # Define the use case
16 use_case = """
17 This is a research assistant that helps academics:
18 - Find relevant papers on specific topics
19 - Understand connections between research areas
20 - Get summaries of complex papers
21 - Discover citation networks
22 - Answer technical questions about methodologies
23
24 Edge cases to test:
25 - Queries about very recent papers (after knowledge cutoff)
26 - Multi-hop reasoning (papers that cite papers that discuss X)
27 - Ambiguous author names
28 - Requests for specific experimental results
29 - Cross-domain queries (e.g., physics papers relevant to biology)
30 """
31
32 # Generate dataset with persona variations
33 generate_and_save(
34 data_path=str(data_path),
35 usecase=use_case,
36 save_path_dir='evaluation/datasets/research_assistant_v1',
37 n_queries=150, # Generate 150 test queries
38
39 # Persona variations
40 personas=[
41 {
42 'name': 'PhD Student',
43 'characteristics': 'Detail-oriented, asks follow-up questions, wants methodology details'
44 },
45 {
46 'name': 'Senior Researcher',
47 'characteristics': 'Broad queries, interested in connections, asks about citations'
48 },
49 {
50 'name': 'Industry Practitioner',
51 'characteristics': 'Practical focus, wants applicable results, less theory'
52 }
53 ],
54
55 # vero-eval will use Ollama for generation
56 llm_provider='ollama',
57 model_name='mistral'
58 )
59
60 print("✓ Generated test dataset with persona variations")
61 print(" Check: evaluation/datasets/research_assistant_v1/")
62
63if __name__ == "__main__":
64 generate_research_test_dataset()

Run this:

bash
1python evaluation/generate_test_dataset.py

This creates a JSON file with queries like:

json
1{
2 "query": "What papers discuss attention mechanisms in the context of graph neural networks published after 2022?",
3 "persona": "Senior Researcher",
4 "expected_characteristics": ["multi-hop", "temporal_constraint", "domain_crossing"],
5 "ground_truth_chunk_ids": ["paper_47", "paper_89", "paper_102"],
6 "complexity_score": 0.85
7}

Running the Evaluation Suite

Now we test our system against this dataset:

python
1# evaluation/run_evaluation.py
2
3from vero.evaluator import Evaluator
4from vero.metrics import (
5 PrecisionMetric, RecallMetric, SufficiencyMetric,
6 FaithfulnessMetric, BERTScoreMetric, RougeMetric,
7 MRRMetric, MAPMetric, NDCGMetric
8)
9from reasoning_agent import PersonaReasoningAgent
10import json
11
12def run_full_evaluation():
13 """
14 Run comprehensive evaluation using vero-eval framework.
15 Tests both retrieval and generation quality.
16 """
17
18 # Initialize our system
19 agent = PersonaReasoningAgent()
20
21 # Load test dataset
22 with open('evaluation/datasets/research_assistant_v1/queries.json') as f:
23 test_queries = json.load(f)
24
25 # Initialize vero-eval
26 evaluator = Evaluator(
27 test_dataset=test_queries,
28 trace_db_path='evaluation/trace.db' # Logs all queries
29 )
30
31 # Define evaluation metrics
32 retrieval_metrics = [
33 PrecisionMetric(k=5),
34 RecallMetric(k=5),
35 SufficiencyMetric(), # Are retrieved docs sufficient to answer?
36 ]
37
38 generation_metrics = [
39 FaithfulnessMetric(), # Is response faithful to retrieved docs?
40 BERTScoreMetric(), # Semantic similarity to reference answers
41 RougeMetric() # Token overlap with references
42 ]
43
44 ranking_metrics = [
45 MRRMetric(), # Mean Reciprocal Rank
46 MAPMetric(), # Mean Average Precision
47 NDCGMetric() # Normalized Discounted Cumulative Gain
48 ]
49
50 results = {
51 'retrieval': {},
52 'generation': {},
53 'ranking': {},
54 'per_persona': {}
55 }
56
57 # Run evaluation for each query
58 for query_data in test_queries:
59 query = query_data['query']
60 persona = query_data['persona']
61 ground_truth = query_data['ground_truth_chunk_ids']
62
63 # Generate response using our system
64 response_data = agent.generate_response(query)
65
66 # Extract retrieved document IDs
67 retrieved_ids = [
68 doc.get('paper_id', doc['title'])
69 for doc in response_data['context_used']
70 ]
71
72 # Log to vero-eval's trace database
73 evaluator.log_query(
74 query=query,
75 retrieved_docs=retrieved_ids,
76 generated_response=response_data['response'],
77 metadata={'persona': persona}
78 )
79
80 # Evaluate retrieval
81 for metric in retrieval_metrics:
82 score = metric.compute(
83 retrieved=retrieved_ids,
84 relevant=ground_truth
85 )
86
87 metric_name = metric.__class__.__name__
88 if metric_name not in results['retrieval']:
89 results['retrieval'][metric_name] = []
90 results['retrieval'][metric_name].append(score)
91
92 # Evaluate generation
93 for metric in generation_metrics:
94 score = metric.compute(
95 generated=response_data['response'],
96 reference=query_data.get('reference_answer', ''),
97 context=response_data['context_used']
98 )
99
100 metric_name = metric.__class__.__name__
101 if metric_name not in results['generation']:
102 results['generation'][metric_name] = []
103 results['generation'][metric_name].append(score)
104
105 # Track per-persona performance
106 if persona not in results['per_persona']:
107 results['per_persona'][persona] = {
108 'precision': [],
109 'faithfulness': []
110 }
111
112 results['per_persona'][persona]['precision'].append(
113 results['retrieval']['PrecisionMetric'][-1]
114 )
115 results['per_persona'][persona]['faithfulness'].append(
116 results['generation']['FaithfulnessMetric'][-1]
117 )
118
119 # Aggregate results
120 for category in ['retrieval', 'generation']:
121 for metric_name, scores in results[category].items():
122 results[category][metric_name] = {
123 'mean': sum(scores) / len(scores),
124 'min': min(scores),
125 'max': max(scores),
126 'std': np.std(scores)
127 }
128
129 # Save results
130 with open('evaluation/results/full_evaluation.json', 'w') as f:
131 json.dump(results, f, indent=2)
132
133 print("✓ Evaluation complete!")
134 print(f" Retrieval Precision@5: {results['retrieval']['PrecisionMetric']['mean']:.3f}")
135 print(f" Retrieval Recall@5: {results['retrieval']['RecallMetric']['mean']:.3f}")
136 print(f" Generation Faithfulness: {results['generation']['FaithfulnessMetric']['mean']:.3f}")
137
138 return results
139
140if __name__ == "__main__":
141 results = run_full_evaluation()

Run the evaluation:

bash
1python evaluation/run_evaluation.py

Generating Performance Reports

vero-eval includes a report generator:

python
1from vero.report import ReportGenerator
2
3# Generate comprehensive HTML report
4generator = ReportGenerator(
5 trace_db_path='evaluation/trace.db',
6 results_path='evaluation/results/full_evaluation.json'
7)
8
9generator.generate_report(
10 output_path='evaluation/results/performance_report.html',
11 include_sections=[
12 'executive_summary',
13 'retrieval_analysis',
14 'generation_analysis',
15 'persona_breakdown',
16 'failure_cases',
17 'recommendations'
18 ]
19)
20
21print("✓ Report generated: evaluation/results/performance_report.html")

This creates an interactive HTML report showing:

  • Overall metrics with confidence intervals
  • Per-persona performance breakdown
  • Failure case analysis (queries where system performed poorly)
  • Recommendations for improvement

Part 6: The RLHF Feedback Loop

Now we close the loop: use vero-eval results to update the persona's RLHF thresholds:

python
1# evaluation/update_persona_from_results.py
2
3import json
4
5def update_persona_thresholds(evaluation_results: dict):
6 """
7 Analyze vero-eval results and adjust persona thresholds.
8 This is the core RLHF mechanism.
9 """
10
11 # Load current persona config
12 with open('data/persona.json') as f:
13 persona_config = json.load(f)
14
15 # Analyze retrieval performance
16 retrieval_recall = evaluation_results['retrieval']['RecallMetric']['mean']
17
18 if retrieval_recall < 0.6:
19 # Low recall → need to retrieve more documents
20 persona_config['rlhf_thresholds']['retrieval_limit'] += 2
21 persona_config['rlhf_thresholds']['retrieval_required'] += 0.1
22
23 print("⚠️ Low recall detected. Increasing retrieval aggressiveness.")
24
25 # Analyze generation faithfulness
26 faithfulness = evaluation_results['generation']['FaithfulnessMetric']['mean']
27
28 if faithfulness < 0.7:
29 # Responses not faithful to sources → need stronger grounding
30 persona_config['rlhf_thresholds']['minimum_context_overlap'] = 0.4
31 persona_config['system_prompt_template'] += (
32 "\n\nIMPORTANT: Always cite specific papers when making claims. "
33 "Do not speculate beyond what the retrieved papers state."
34 )
35
36 print("⚠️ Low faithfulness detected. Strengthening citation requirements.")
37
38 # Per-persona adjustments
39 for persona_name, metrics in evaluation_results['per_persona'].items():
40 avg_precision = sum(metrics['precision']) / len(metrics['precision'])
41
42 if avg_precision < 0.5:
43 print(f"⚠️ {persona_name} persona underperforming (Precision: {avg_precision:.2f})")
44
45 # Could adjust persona-specific prompts here
46 # For now, log for manual review
47
48 # Save updated config
49 with open('data/persona.json', 'w') as f:
50 json.dump(persona_config, f, indent=2)
51
52 print("✓ Persona thresholds updated based on evaluation results")
53
54# Usage after evaluation
55with open('evaluation/results/full_evaluation.json') as f:
56 results = json.load(f)
57
58update_persona_thresholds(results)

The workflow becomes:

  1. Run system on test queries
  2. vero-eval measures performance
  3. Script analyzes metrics
  4. Persona thresholds adjust automatically
  5. Re-evaluate to confirm improvement

This is reinforcement learning through human feedback (RLHF) in action, but guided by rigorous automated evaluation rather than ad-hoc human ratings.

Part 7: Integrating with the Frontend

Now we wire this into the Next.js chat interface. Update src/app/api/chat/route.ts:

typescript
1import { NextRequest } from 'next/server'
2import { spawn } from 'child_process'
3import path from 'path'
4
5export async function POST(request: NextRequest) {
6 const { message, messages, graphRAG = true } = await request.json()
7
8 if (!graphRAG) {
9 // Regular chat without RAG
10 return handleRegularChat(message, messages)
11 }
12
13 // Call our Python reasoning agent
14 const agentPath = path.join(process.cwd(), 'scripts', 'reasoning_agent.py')
15
16 const result = await new Promise<{response: string, context: any[]}>((resolve, reject) => {
17 const pythonProcess = spawn('python3', [
18 agentPath,
19 'generate',
20 JSON.stringify({ query: message, chat_history: messages })
21 ])
22
23 let stdout = ''
24 let stderr = ''
25
26 pythonProcess.stdout.on('data', (data) => {
27 stdout += data.toString()
28 })
29
30 pythonProcess.stderr.on('data', (data) => {
31 stderr += data.toString()
32 })
33
34 pythonProcess.on('close', (code) => {
35 if (code === 0) {
36 try {
37 const result = JSON.parse(stdout)
38 resolve(result)
39 } catch (e) {
40 reject(new Error(`Failed to parse response: ${e}`))
41 }
42 } else {
43 reject(new Error(`Agent failed: ${stderr}`))
44 }
45 })
46 })
47
48 // Stream response back to client
49 const stream = new ReadableStream({
50 start(controller) {
51 // Send response with context metadata
52 const formatted = `${result.response}\n\n---\n**Sources:**\n${
53 result.context.map((doc, i) =>
54 `[${i+1}] ${doc.title} (${doc.year})`
55 ).join('\n')
56 }`
57
58 controller.enqueue(new TextEncoder().encode(formatted))
59 controller.close()
60 }
61 })
62
63 return new Response(stream, {
64 headers: {
65 'Content-Type': 'text/plain; charset=utf-8',
66 },
67 })
68}

Update the chat UI to show retrieval metadata:

typescript
1// In src/components/Chat.tsx
2
3{message.role === 'assistant' && message.context && (
4 <div className="mt-2 text-xs text-muted-foreground">
5 <details>
6 <summary className="cursor-pointer hover:text-foreground">
7 📚 {message.context.length} sources retrieved
8 </summary>
9 <ul className="mt-2 space-y-1">
10 {message.context.map((doc, i) => (
11 <li key={i} className="flex items-center gap-2">
12 <span className="font-mono">
13 {doc.retrieval_method === 'vector_search' ? '🔍' : '🔗'}
14 </span>
15 <span>{doc.title}</span>
16 <span className="text-muted-foreground">
17 (relevance: {(doc.relevance_score * 100).toFixed(0)}%)
18 </span>
19 </li>
20 ))}
21 </ul>
22 </details>
23 </div>
24)}

Now users can see which papers were retrieved and how (vector search vs. citation traversal).

Part 8: Running the Complete System

Setup Script

Create setup.sh:

bash
1#!/bin/bash
2
3echo "🔬 Setting up Research Assistant GraphRAG System"
4
5# 1. Install Python dependencies
6echo "📦 Installing Python dependencies..."
7pip install -r requirements.txt
8
9# 2. Start Neo4j
10echo "🗄️ Starting Neo4j..."
11docker-compose up -d neo4j
12
13# Wait for Neo4j to be ready
14echo "⏳ Waiting for Neo4j..."
15until curl -s http://localhost:7474 > /dev/null; do
16 sleep 2
17done
18echo "✓ Neo4j ready"
19
20# 3. Start Ollama
21echo "🤖 Checking Ollama..."
22if ! command -v ollama &> /dev/null; then
23 echo "Please install Ollama from https://ollama.ai"
24 exit 1
25fi
26
27ollama serve &
28sleep 5
29
30# Pull required models
31ollama pull mistral
32ollama pull nomic-embed-text
33
34# 4. Initialize Neo4j graph schema
35echo "📊 Initializing graph schema..."
36python scripts/init_graph_schema.py
37
38# 5. Ingest sample research papers
39echo "📚 Ingesting sample papers..."
40python scripts/ingest_research_data.py --directory data/sample_papers
41
42# 6. Generate test dataset
43echo "🧪 Generating evaluation dataset..."
44python evaluation/generate_test_dataset.py
45
46# 7. Run initial evaluation
47echo "📈 Running initial evaluation..."
48python evaluation/run_evaluation.py
49
50# 8. Start Next.js frontend
51echo "🌐 Starting frontend..."
52npm install
53npm run dev &
54
55echo ""
56echo "✅ Setup complete!"
57echo ""
58echo "🔗 Access points:"
59echo " Frontend: http://localhost:3000"
60echo " Neo4j Browser: http://localhost:7474"
61echo " Evaluation Reports: evaluation/results/"
62echo ""
63echo "📖 Next steps:"
64echo " 1. Add your research papers to data/research_papers/"
65echo " 2. Run: python scripts/ingest_research_data.py"
66echo " 3. Chat with your research assistant at localhost:3000"
67echo " 4. Check evaluation results in evaluation/results/"

Run it:

bash
1chmod +x setup.sh
2./setup.sh

Part 9: Practical Use Cases and Patterns

Use Case 1: Literature Review Assistant

python
1# Example query patterns for literature reviews
2
3queries = [
4 "What are the main approaches to attention mechanisms in transformers since 2020?",
5 "Find papers that cite Vaswani et al. 2017 and discuss efficiency improvements",
6 "What experimental setups are common in graph neural network papers?",
7 "Compare the methodologies used in top-cited RAG papers"
8]
9
10for query in queries:
11 response = agent.generate_response(query)
12
13 # System automatically:
14 # 1. Retrieves relevant papers using hybrid search
15 # 2. Traverses citation network
16 # 3. Formats response with proper attributions
17 # 4. Logs everything to vero-eval trace DB

Use Case 2: Cross-Domain Research Discovery

python
1# Finding connections between domains
2
3query = """
4Are there any techniques from computer vision that have been
5successfully applied to natural language processing in the last 3 years?
6"""
7
8# The graph traversal will:
9# 1. Find CV papers discussing specific techniques
10# 2. Find NLP papers citing those CV papers
11# 3. Identify the bridging concepts
12# 4. Present a coherent narrative
13
14response = agent.generate_response(query)

Use Case 3: Methodology Extraction

python
1# Extracting specific methodological details
2
3query = """
4What evaluation metrics are most commonly used in papers about
5few-shot learning for NLP tasks?
6"""
7
8# Behind the scenes:
9# 1. Retrieve few-shot NLP papers
10# 2. Extract methodology sections (using LLM)
11# 3. Aggregate metrics across papers
12# 4. Present frequency analysis

Part 10: Measuring Success with vero-eval

After running the system for a while, check the vero-eval dashboard:

python
1# evaluation/generate_dashboard.py
2
3from vero.dashboard import create_dashboard
4from vero.trace_db import TraceDB
5
6# Load trace database
7trace_db = TraceDB('evaluation/trace.db')
8
9# Create interactive dashboard
10create_dashboard(
11 trace_db=trace_db,
12 output_path='evaluation/dashboard.html',
13 metrics=[
14 'retrieval_precision',
15 'retrieval_recall',
16 'generation_faithfulness',
17 'response_time',
18 'context_sufficiency'
19 ],
20 groupby=['persona', 'query_complexity']
21)

This generates an interactive Plotly dashboard showing:

  • Metric trends over time (is the system improving?)
  • Persona performance comparison (which user types are we serving well?)
  • Query complexity vs. accuracy (where do we struggle?)
  • Retrieval method effectiveness (vector vs. graph traversal success rates)

Advanced Patterns and Optimizations

Pattern 1: Caching Embeddings

For production, cache embeddings to avoid recomputation:

python
1import pickle
2from pathlib import Path
3
4class EmbeddingCache:
5 def __init__(self, cache_dir: Path = Path('cache/embeddings')):
6 self.cache_dir = cache_dir
7 self.cache_dir.mkdir(parents=True, exist_ok=True)
8
9 def get_embedding(self, text: str, model: str = 'nomic-embed-text') -> list[float]:
10 # Create hash of text for cache key
11 cache_key = hashlib.md5(text.encode()).hexdigest()
12 cache_path = self.cache_dir / f"{cache_key}_{model}.pkl"
13
14 if cache_path.exists():
15 with open(cache_path, 'rb') as f:
16 return pickle.load(f)
17
18 # Generate new embedding
19 embedding = ollama.embeddings(model=model, prompt=text)['embedding']
20
21 # Cache it
22 with open(cache_path, 'wb') as f:
23 pickle.dump(embedding, f)
24
25 return embedding

Pattern 2: Batch Processing for Large Collections

When ingesting 1000+ papers:

python
1def ingest_batch(papers: list[Path], batch_size: int = 10):
2 """Process papers in batches to manage memory"""
3
4 for i in range(0, len(papers), batch_size):
5 batch = papers[i:i+batch_size]
6
7 # Extract metadata in parallel
8 with ThreadPoolExecutor(max_workers=batch_size) as executor:
9 metadata_list = executor.map(extract_paper_metadata, batch)
10
11 # Insert into Neo4j in single transaction
12 with driver.session() as session:
13 with session.begin_transaction() as tx:
14 for metadata, pdf_path in zip(metadata_list, batch):
15 create_paper_node(tx, metadata, pdf_path)
16
17 tx.commit()
18
19 print(f"✓ Processed {i+batch_size}/{len(papers)} papers")

Pattern 3: Incremental Evaluation

Don't wait to run full evaluation. Track metrics continuously:

python
1class ContinuousEvaluator:
2 def __init__(self, alert_threshold: float = 0.6):
3 self.alert_threshold = alert_threshold
4 self.recent_scores = []
5
6 def evaluate_response(self, query: str, response: dict):
7 # Quick evaluation on the fly
8 score = self._quick_score(response)
9 self.recent_scores.append(score)
10
11 # Keep only last 50 queries
12 if len(self.recent_scores) > 50:
13 self.recent_scores.pop(0)
14
15 # Alert if average drops
16 if len(self.recent_scores) >= 10:
17 avg = sum(self.recent_scores) / len(self.recent_scores)
18 if avg < self.alert_threshold:
19 self._send_alert(avg)
20
21 def _quick_score(self, response: dict) -> float:
22 # Lightweight scoring
23 has_context = len(response['context_used']) > 0
24 response_length = len(response['response'].split())
25
26 return 0.7 * has_context + 0.3 * min(1.0, response_length / 100)

Troubleshooting Common Issues

Issue 1: Neo4j Connection Errors

python
1# Test Neo4j connection
2from neo4j import GraphDatabase
3
4def test_connection():
5 try:
6 driver = GraphDatabase.driver(
7 "bolt://localhost:7687",
8 auth=("neo4j", "research2025")
9 )
10
11 with driver.session() as session:
12 result = session.run("RETURN 1 AS num")
13 print("✓ Neo4j connection successful")
14
15 except Exception as e:
16 print(f"✗ Connection failed: {e}")
17 print(" Make sure Neo4j is running: docker ps")

Issue 2: Ollama Model Not Found

bash
1# Check available models
2ollama list
3
4# Pull missing models
5ollama pull mistral
6ollama pull nomic-embed-text
7
8# Verify they work
9ollama run mistral "Test query"

Issue 3: Low Retrieval Scores

Check your embeddings:

python
1# Verify embeddings are being generated correctly
2from ingest_research_data import ResearchGraphBuilder
3
4builder = ResearchGraphBuilder()
5
6# Test on a sample paper
7test_text = "Transformers are a type of neural network architecture..."
8embedding = builder.graph_rag.generate_embedding(test_text)
9
10print(f"Embedding dimension: {len(embedding)}") # Should be 4096
11print(f"Sample values: {embedding[:5]}")

Conclusion and Next Steps

You now have a production-ready Research Assistant with:

Local-first architecture (no API costs, full privacy)
Neo4j knowledge graph (papers, authors, concepts, citations)
Hybrid retrieval (vector similarity + graph traversal)
Persona-driven responses with RLHF adaptation
Comprehensive evaluation via vero-eval framework
Automated improvement through feedback loops

Recommended Next Steps:

  1. Expand the Dataset: Ingest your actual research papers

    bash
    1python scripts/ingest_research_data.py --directory ~/Documents/Research
  2. Run Weekly Evaluations: Set up a cron job

    bash
    10 2 * * 0 cd /path/to/research-assistant && python evaluation/run_evaluation.py
  3. Fine-tune Personas: Create persona configs for different user types:

    • PhD Student persona (detail-oriented, wants methodology)
    • Senior Researcher persona (big picture, cross-domain)
    • Industry persona (practical applications)
  4. Integrate Additional Sources:

    • arXiv API for latest papers
    • Connected Papers for visualization
    • Semantic Scholar for citation data
  5. Scale Up:

    • Use a vector database (Pinecone, Weaviate) for 10K+ papers
    • Implement query result caching
    • Add paper summarization pipeline

Resources for Going Deeper

Production Deployment Checklist

Before deploying to production, ensure you've addressed:

python
1# deployment/production_checklist.py
2
3PRODUCTION_CHECKLIST = {
4 'Infrastructure': [
5 '☐ Neo4j running with persistent volumes',
6 '☐ Ollama configured with appropriate model cache',
7 '☐ Redis/Memcached for query result caching',
8 '☐ Load balancer for API endpoints',
9 '☐ CDN for static assets'
10 ],
11 'Security': [
12 '☐ API authentication implemented',
13 '☐ Rate limiting configured (per user/IP)',
14 '☐ Input sanitization for all user queries',
15 '☐ Neo4j credentials rotated and secured',
16 '☐ HTTPS enabled with valid certificates'
17 ],
18 'Monitoring': [
19 '☐ Prometheus metrics exported',
20 '☐ Grafana dashboards for system health',
21 '☐ vero-eval continuous evaluation running',
22 '☐ Error tracking (Sentry/Rollbar)',
23 '☐ Query latency monitoring'
24 ],
25 'Data Management': [
26 '☐ Automated backups of Neo4j database',
27 '☐ Embedding cache backup strategy',
28 '☐ Data retention policies defined',
29 '☐ GDPR compliance for user queries',
30 '☐ Paper metadata update pipeline'
31 ],
32 'Performance': [
33 '☐ Embedding generation batched/cached',
34 '☐ Neo4j indexes optimized',
35 '☐ Query result caching implemented',
36 '☐ Connection pooling configured',
37 '☐ Async processing for long-running queries'
38 ]
39}

Part 11: Advanced vero-eval Techniques

Now let's dive deeper into what makes vero-eval exceptional for production AI systems.

Stress Testing with Adversarial Queries

vero-eval can generate adversarial test cases that expose edge cases:

python
1# evaluation/adversarial_testing.py
2
3from vero.adversarial import AdversarialGenerator
4from reasoning_agent import PersonaReasoningAgent
5
6def run_adversarial_tests():
7 """
8 Generate adversarial queries designed to break the system.
9 This reveals weaknesses before users find them.
10 """
11
12 agent = PersonaReasoningAgent()
13
14 # Initialize adversarial generator
15 adv_gen = AdversarialGenerator(
16 base_queries=load_valid_queries(),
17 attack_types=[
18 'jailbreak', # Try to bypass safety guardrails
19 'context_overflow', # Queries requiring huge context
20 'ambiguous_reference', # "the paper mentioned earlier" without context
21 'temporal_confusion', # Mixing past/future tenses
22 'multi_hop_complex', # Require 3+ reasoning steps
23 'contradictory', # Ask for contradicting information
24 'out_of_domain' # Queries completely outside research
25 ]
26 )
27
28 adversarial_queries = adv_gen.generate(n=50)
29
30 failures = []
31
32 for query_data in adversarial_queries:
33 query = query_data['query']
34 attack_type = query_data['attack_type']
35
36 print(f"Testing: {attack_type} - {query[:60]}...")
37
38 try:
39 response = agent.generate_response(query)
40
41 # Check for failure modes
42 if len(response['response']) < 10:
43 failures.append({
44 'query': query,
45 'attack_type': attack_type,
46 'failure_mode': 'empty_response'
47 })
48
49 elif 'hallucination' in detect_hallucinations(
50 response['response'],
51 response['context_used']
52 ):
53 failures.append({
54 'query': query,
55 'attack_type': attack_type,
56 'failure_mode': 'hallucination'
57 })
58
59 except Exception as e:
60 failures.append({
61 'query': query,
62 'attack_type': attack_type,
63 'failure_mode': 'exception',
64 'error': str(e)
65 })
66
67 # Generate failure report
68 with open('evaluation/results/adversarial_failures.json', 'w') as f:
69 json.dump(failures, f, indent=2)
70
71 print(f"\n⚠️ Found {len(failures)} failure cases out of 50 adversarial queries")
72 print(f" Failure rate: {len(failures)/50*100:.1f}%")
73
74 # Categorize failures
75 failure_by_type = {}
76 for failure in failures:
77 attack_type = failure['attack_type']
78 failure_by_type[attack_type] = failure_by_type.get(attack_type, 0) + 1
79
80 print("\n📊 Failures by attack type:")
81 for attack_type, count in sorted(failure_by_type.items(),
82 key=lambda x: x[1],
83 reverse=True):
84 print(f" {attack_type}: {count}")
85
86 return failures
87
88def detect_hallucinations(response: str, context_docs: list) -> list:
89 """
90 Detect potential hallucinations by checking if claims in response
91 are supported by retrieved context.
92 """
93
94 hallucinations = []
95
96 # Extract claims from response (sentences making factual statements)
97 claims = extract_claims(response)
98
99 # Create context text corpus
100 context_text = "\n".join([doc['abstract'] for doc in context_docs])
101
102 for claim in claims:
103 # Check if claim is substantiated by context
104 # Use simple token overlap for now (could use entailment model)
105 claim_tokens = set(claim.lower().split())
106 context_tokens = set(context_text.lower().split())
107
108 overlap = len(claim_tokens & context_tokens) / len(claim_tokens)
109
110 if overlap < 0.3: # Less than 30% overlap suggests hallucination
111 hallucinations.append({
112 'claim': claim,
113 'overlap_score': overlap,
114 'severity': 'high' if overlap < 0.1 else 'medium'
115 })
116
117 return hallucinations
118
119def extract_claims(response: str) -> list[str]:
120 """Extract factual claims from response."""
121 # Simple heuristic: sentences with "is", "are", "shows", "demonstrates"
122 sentences = response.split('.')
123
124 claim_indicators = ['is', 'are', 'shows', 'demonstrates', 'found', 'reports']
125
126 claims = [
127 sent.strip() for sent in sentences
128 if any(indicator in sent.lower() for indicator in claim_indicators)
129 and len(sent.split()) > 5 # Substantial claim
130 ]
131
132 return claims
133
134if __name__ == "__main__":
135 failures = run_adversarial_tests()

Run this regularly:

bash
1# Weekly adversarial testing
20 3 * * 1 cd /path/to/research-assistant && python evaluation/adversarial_testing.py

Continuous Monitoring with vero-eval

Set up real-time quality monitoring:

python
1# evaluation/continuous_monitor.py
2
3from vero.monitor import QualityMonitor
4from datetime import datetime, timedelta
5import smtplib
6from email.mime.text import MIMEText
7
8class ProductionMonitor:
9 def __init__(self, trace_db_path: str):
10 self.monitor = QualityMonitor(trace_db_path)
11 self.alert_thresholds = {
12 'precision_drop': 0.15, # Alert if precision drops by 15%
13 'latency_spike': 2.0, # Alert if latency > 2 seconds
14 'error_rate': 0.05, # Alert if error rate > 5%
15 'faithfulness_drop': 0.20 # Alert if faithfulness drops by 20%
16 }
17
18 def check_system_health(self):
19 """
20 Run every hour to check if system performance is degrading.
21 """
22
23 # Get metrics for last 24 hours
24 recent_metrics = self.monitor.get_metrics(
25 start_time=datetime.now() - timedelta(hours=24),
26 end_time=datetime.now()
27 )
28
29 # Get baseline metrics (last week average)
30 baseline_metrics = self.monitor.get_metrics(
31 start_time=datetime.now() - timedelta(days=7),
32 end_time=datetime.now() - timedelta(days=1)
33 )
34
35 alerts = []
36
37 # Check for precision drop
38 precision_drop = (
39 baseline_metrics['precision'] - recent_metrics['precision']
40 )
41 if precision_drop > self.alert_thresholds['precision_drop']:
42 alerts.append({
43 'severity': 'high',
44 'metric': 'precision',
45 'message': f"Precision dropped by {precision_drop:.2%}",
46 'baseline': baseline_metrics['precision'],
47 'current': recent_metrics['precision']
48 })
49
50 # Check for latency spikes
51 if recent_metrics['avg_latency'] > self.alert_thresholds['latency_spike']:
52 alerts.append({
53 'severity': 'medium',
54 'metric': 'latency',
55 'message': f"Average latency: {recent_metrics['avg_latency']:.2f}s",
56 'baseline': baseline_metrics['avg_latency'],
57 'current': recent_metrics['avg_latency']
58 })
59
60 # Check error rate
61 if recent_metrics['error_rate'] > self.alert_thresholds['error_rate']:
62 alerts.append({
63 'severity': 'critical',
64 'metric': 'error_rate',
65 'message': f"Error rate: {recent_metrics['error_rate']:.2%}",
66 'baseline': baseline_metrics['error_rate'],
67 'current': recent_metrics['error_rate']
68 })
69
70 # Check faithfulness
71 faithfulness_drop = (
72 baseline_metrics['faithfulness'] - recent_metrics['faithfulness']
73 )
74 if faithfulness_drop > self.alert_thresholds['faithfulness_drop']:
75 alerts.append({
76 'severity': 'high',
77 'metric': 'faithfulness',
78 'message': f"Faithfulness dropped by {faithfulness_drop:.2%}",
79 'baseline': baseline_metrics['faithfulness'],
80 'current': recent_metrics['faithfulness']
81 })
82
83 # Send alerts if any
84 if alerts:
85 self.send_alerts(alerts)
86
87 # Log to monitoring system
88 self.log_health_check(recent_metrics, alerts)
89
90 return alerts
91
92 def send_alerts(self, alerts: list):
93 """Send alerts via email/Slack/PagerDuty"""
94
95 critical_alerts = [a for a in alerts if a['severity'] == 'critical']
96
97 if critical_alerts:
98 # Page on-call engineer
99 self.page_oncall(critical_alerts)
100
101 # Email summary
102 email_body = self.format_alert_email(alerts)
103 self.send_email(
104 to='team@example.com',
105 subject=f"🚨 Research Assistant Quality Alert - {len(alerts)} issues",
106 body=email_body
107 )
108
109 def format_alert_email(self, alerts: list) -> str:
110 """Format alerts as HTML email"""
111
112 html = """
113 <h2>Research Assistant Quality Alerts</h2>
114 <p>The following performance degradations were detected:</p>
115 <table border="1" cellpadding="10">
116 <tr>
117 <th>Severity</th>
118 <th>Metric</th>
119 <th>Baseline</th>
120 <th>Current</th>
121 <th>Message</th>
122 </tr>
123 """
124
125 for alert in alerts:
126 severity_color = {
127 'critical': '#ff0000',
128 'high': '#ff6600',
129 'medium': '#ffaa00'
130 }[alert['severity']]
131
132 html += f"""
133 <tr>
134 <td style="background-color: {severity_color}; color: white;">
135 {alert['severity'].upper()}
136 </td>
137 <td>{alert['metric']}</td>
138 <td>{alert['baseline']:.3f}</td>
139 <td>{alert['current']:.3f}</td>
140 <td>{alert['message']}</td>
141 </tr>
142 """
143
144 html += """
145 </table>
146 <p>
147 <a href="http://your-monitoring-url/dashboard">View Full Dashboard</a>
148 </p>
149 """
150
151 return html
152
153 def log_health_check(self, metrics: dict, alerts: list):
154 """Log to your monitoring system (Prometheus/Datadog/etc)"""
155
156 # Example: Push to Prometheus Pushgateway
157 # In production, you'd use actual client library
158
159 print(f"[{datetime.now()}] Health Check:")
160 print(f" Precision: {metrics['precision']:.3f}")
161 print(f" Recall: {metrics['recall']:.3f}")
162 print(f" Faithfulness: {metrics['faithfulness']:.3f}")
163 print(f" Avg Latency: {metrics['avg_latency']:.2f}s")
164 print(f" Error Rate: {metrics['error_rate']:.2%}")
165
166 if alerts:
167 print(f" ⚠️ {len(alerts)} alerts triggered")
168 else:
169 print(f" ✓ All metrics within normal range")
170
171# Run as scheduled job
172if __name__ == "__main__":
173 monitor = ProductionMonitor('evaluation/trace.db')
174 alerts = monitor.check_system_health()
175
176 if alerts:
177 exit(1) # Non-zero exit code for alerting systems

Set up as cron job:

bash
1# Check every hour
20 * * * * cd /path/to/research-assistant && python evaluation/continuous_monitor.py

Part 12: Scaling Beyond 10K Papers

As your research collection grows, you'll need to optimize:

1. Migrate to a Dedicated Vector Database

For 10K+ papers, Neo4j's vector indexes can become slow. Use a specialized vector DB:

python
1# scripts/migrate_to_pinecone.py
2
3import pinecone
4from neo4j import GraphDatabase
5import os
6
7def migrate_embeddings_to_pinecone():
8 """
9 Migrate embeddings from Neo4j to Pinecone for faster retrieval.
10 Keep Neo4j for graph relationships, Pinecone for vector search.
11 """
12
13 # Initialize Pinecone
14 pinecone.init(
15 api_key=os.getenv("PINECONE_API_KEY"),
16 environment="us-west1-gcp"
17 )
18
19 # Create index if doesn't exist
20 if "research-papers" not in pinecone.list_indexes():
21 pinecone.create_index(
22 name="research-papers",
23 dimension=4096, # nomic-embed-text
24 metric="cosine",
25 pods=2,
26 replicas=1,
27 pod_type="p1.x1"
28 )
29
30 index = pinecone.Index("research-papers")
31
32 # Extract embeddings from Neo4j
33 driver = GraphDatabase.driver(
34 "bolt://localhost:7687",
35 auth=("neo4j", "research2025")
36 )
37
38 with driver.session() as session:
39 # Get papers in batches
40 batch_size = 100
41 offset = 0
42
43 while True:
44 papers = session.run("""
45 MATCH (p:Paper)
46 RETURN p.title AS title,
47 p.abstract AS abstract,
48 p.abstract_embedding AS embedding,
49 p.year AS year,
50 ID(p) AS neo4j_id
51 ORDER BY p.year DESC
52 SKIP $offset
53 LIMIT $batch_size
54 """,
55 offset=offset,
56 batch_size=batch_size
57 ).data()
58
59 if not papers:
60 break
61
62 # Prepare vectors for Pinecone
63 vectors = []
64 for paper in papers:
65 vectors.append({
66 'id': str(paper['neo4j_id']),
67 'values': paper['embedding'],
68 'metadata': {
69 'title': paper['title'],
70 'abstract': paper['abstract'][:500], # Truncate
71 'year': paper['year'],
72 'neo4j_id': paper['neo4j_id']
73 }
74 })
75
76 # Upsert to Pinecone
77 index.upsert(vectors=vectors, namespace="papers")
78
79 print(f"✓ Migrated {offset + len(papers)} papers")
80 offset += batch_size
81
82 print(f"\n✅ Migration complete! {offset} papers in Pinecone")
83
84# Update retriever to use Pinecone
85class HybridRetrieverWithPinecone:
86 def __init__(self, neo4j_driver, pinecone_index_name="research-papers"):
87 self.neo4j_driver = neo4j_driver
88 self.pinecone_index = pinecone.Index(pinecone_index_name)
89
90 def retrieve_context(self, query: str, limit: int = 5) -> list[dict]:
91 """Hybrid retrieval using Pinecone + Neo4j graph"""
92
93 # 1. Vector search with Pinecone (fast!)
94 query_embedding = ollama.embeddings(
95 model='nomic-embed-text',
96 prompt=query
97 )['embedding']
98
99 pinecone_results = self.pinecone_index.query(
100 vector=query_embedding,
101 top_k=limit * 2,
102 include_metadata=True,
103 namespace="papers"
104 )
105
106 # 2. Get Neo4j IDs from Pinecone results
107 neo4j_ids = [
108 int(match['metadata']['neo4j_id'])
109 for match in pinecone_results['matches']
110 ]
111
112 # 3. Enrich with graph relationships from Neo4j
113 with self.neo4j_driver.session() as session:
114 enriched = session.run("""
115 UNWIND $neo4j_ids AS paper_id
116 MATCH (p:Paper) WHERE ID(p) = paper_id
117 OPTIONAL MATCH (p)<-[:AUTHORED]-(a:Author)
118 OPTIONAL MATCH (p)-[:DISCUSSES]->(c:Concept)
119 OPTIONAL MATCH (p)-[:CITES]->(cited:Paper)
120
121 RETURN
122 p.title AS title,
123 p.abstract AS abstract,
124 p.year AS year,
125 collect(DISTINCT a.name) AS authors,
126 collect(DISTINCT c.name) AS concepts,
127 collect(DISTINCT cited.title) AS citations
128 """,
129 neo4j_ids=neo4j_ids
130 ).data()
131
132 # 4. Combine Pinecone scores with Neo4j metadata
133 results = []
134 for i, match in enumerate(pinecone_results['matches']):
135 neo4j_data = enriched[i] if i < len(enriched) else {}
136
137 results.append({
138 'title': neo4j_data.get('title', match['metadata']['title']),
139 'abstract': neo4j_data.get('abstract', match['metadata']['abstract']),
140 'year': neo4j_data.get('year', match['metadata']['year']),
141 'authors': neo4j_data.get('authors', []),
142 'concepts': neo4j_data.get('concepts', []),
143 'citations': neo4j_data.get('citations', []),
144 'relevance_score': match['score'],
145 'retrieval_method': 'pinecone_vector_search'
146 })
147
148 return results[:limit]

Benefits of this architecture:

  • Pinecone handles 10M+ vectors easily
  • Neo4j focuses on graph relationships (citations, authorship)
  • Best of both worlds: fast vector search + rich graph traversal

2. Implement Query Result Caching

python
1# lib/query_cache.py
2
3import redis
4import hashlib
5import json
6from datetime import timedelta
7
8class QueryCache:
9 def __init__(self, redis_url: str = "redis://localhost:6379"):
10 self.redis = redis.from_url(redis_url)
11 self.ttl = timedelta(hours=24) # Cache for 24 hours
12
13 def get_cached_response(self, query: str, persona_config: dict) -> dict | None:
14 """
15 Check if we have a cached response for this query+persona combination.
16 """
17
18 # Create cache key from query + persona config
19 cache_key = self._create_cache_key(query, persona_config)
20
21 cached = self.redis.get(cache_key)
22 if cached:
23 print(f"✓ Cache hit for query: {query[:50]}...")
24 return json.loads(cached)
25
26 return None
27
28 def cache_response(self, query: str, persona_config: dict, response: dict):
29 """Store response in cache"""
30
31 cache_key = self._create_cache_key(query, persona_config)
32
33 self.redis.setex(
34 cache_key,
35 self.ttl,
36 json.dumps(response)
37 )
38
39 def _create_cache_key(self, query: str, persona_config: dict) -> str:
40 """Create deterministic cache key"""
41
42 # Include relevant persona config aspects
43 persona_hash = hashlib.md5(
44 json.dumps(persona_config, sort_keys=True).encode()
45 ).hexdigest()
46
47 query_hash = hashlib.md5(query.encode()).hexdigest()
48
49 return f"query_cache:{query_hash}:{persona_hash}"
50
51 def invalidate_cache(self):
52 """Invalidate all cached queries (e.g., after persona update)"""
53
54 keys = self.redis.keys("query_cache:*")
55 if keys:
56 self.redis.delete(*keys)
57 print(f"✓ Invalidated {len(keys)} cached queries")
58
59# Integrate into reasoning agent
60class CachedReasoningAgent(PersonaReasoningAgent):
61 def __init__(self, *args, **kwargs):
62 super().__init__(*args, **kwargs)
63 self.cache = QueryCache()
64
65 def generate_response(self, query: str, chat_history: list = None) -> dict:
66 """Generate response with caching"""
67
68 # Check cache first
69 cached = self.cache.get_cached_response(query, self.persona_config)
70 if cached:
71 return cached
72
73 # Generate fresh response
74 response = super().generate_response(query, chat_history)
75
76 # Cache if quality is good
77 if response['quality_grade'] > 0.7:
78 self.cache.cache_response(query, self.persona_config, response)
79
80 return response

3. Batch Embedding Generation

When ingesting large collections:

python
1# scripts/batch_embedding_generator.py
2
3from concurrent.futures import ThreadPoolExecutor
4import ollama
5import time
6
7class BatchEmbeddingGenerator:
8 def __init__(self, model: str = 'nomic-embed-text', max_workers: int = 4):
9 self.model = model
10 self.max_workers = max_workers
11 self.rate_limit_delay = 0.1 # 100ms between requests
12
13 def generate_embeddings_batch(self, texts: list[str]) -> list[list[float]]:
14 """
15 Generate embeddings for multiple texts in parallel with rate limiting.
16 """
17
18 embeddings = []
19
20 with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
21 # Submit all tasks
22 futures = []
23 for i, text in enumerate(texts):
24 future = executor.submit(self._generate_single, text, i)
25 futures.append(future)
26
27 # Rate limiting
28 time.sleep(self.rate_limit_delay)
29
30 # Collect results in order
31 for future in futures:
32 embedding, index = future.result()
33 embeddings.append((index, embedding))
34
35 # Sort by original index
36 embeddings.sort(key=lambda x: x[0])
37
38 return [emb for _, emb in embeddings]
39
40 def _generate_single(self, text: str, index: int) -> tuple[list[float], int]:
41 """Generate single embedding with retry logic"""
42
43 max_retries = 3
44 for attempt in range(max_retries):
45 try:
46 response = ollama.embeddings(
47 model=self.model,
48 prompt=text[:8192] # Truncate to model limit
49 )
50 return response['embedding'], index
51
52 except Exception as e:
53 if attempt == max_retries - 1:
54 raise
55
56 print(f"⚠️ Retry {attempt+1}/{max_retries} for text {index}: {e}")
57 time.sleep(2 ** attempt) # Exponential backoff
58
59# Use in ingestion pipeline
60def ingest_large_collection(papers: list[Path]):
61 """Efficiently ingest 1000+ papers"""
62
63 generator = BatchEmbeddingGenerator(max_workers=8)
64
65 # Process in batches of 50
66 batch_size = 50
67
68 for i in range(0, len(papers), batch_size):
69 batch = papers[i:i+batch_size]
70
71 print(f"Processing batch {i//batch_size + 1}/{len(papers)//batch_size + 1}")
72
73 # Extract abstracts
74 abstracts = []
75 metadata_list = []
76 for paper_path in batch:
77 metadata = extract_paper_metadata(paper_path)
78 abstracts.append(metadata['abstract'])
79 metadata_list.append(metadata)
80
81 # Generate embeddings in parallel
82 embeddings = generator.generate_embeddings_batch(abstracts)
83
84 # Insert into database
85 with neo4j_driver.session() as session:
86 for metadata, embedding in zip(metadata_list, embeddings):
87 metadata['abstract_embedding'] = embedding
88 create_paper_node(session, metadata)
89
90 print(f"✓ Ingested batch {i//batch_size + 1}")

Part 13: Real-World Production Case Study

Let's walk through a complete example from a hypothetical research lab:

Scenario: Computational Biology Research Lab

Requirements:

  • 5,000 existing papers in their collection
  • Weekly updates with new publications
  • 15 active researchers with different expertise levels
  • Need to find cross-domain connections (CS ↔ Biology)
  • High precision required (wrong papers waste researcher time)

Implementation:

python
1# config/bio_lab_config.py
2
3RESEARCH_LAB_CONFIG = {
4 'name': 'Computational Biology Lab',
5 'paper_sources': [
6 'local_collection', # Existing 5K papers
7 'pubmed_api', # Weekly updates
8 'biorxiv_api', # Preprints
9 'arxiv_bio' # CS bio papers
10 ],
11 'personas': {
12 'wet_lab_biologist': {
13 'description': 'Bench scientists with limited CS background',
14 'rlhf_thresholds': {
15 'technical_detail': 0.3, # Less technical jargon
16 'methodology_depth': 0.8, # High experimental detail
17 'formality': 0.5
18 },
19 'preferred_sources': ['Nature', 'Cell', 'Science']
20 },
21 'computational_biologist': {
22 'description': 'Hybrid CS/Bio expertise',
23 'rlhf_thresholds': {
24 'technical_detail': 0.8, # Can handle complexity
25 'methodology_depth': 0.9, # Wants algorithm details
26 'formality': 0.7
27 },
28 'preferred_sources': ['Nature Methods', 'Bioinformatics', 'PLOS Comp Bio']
29 },
30 'pi_researcher': {
31 'description': 'Principal investigator, needs big picture',
32 'rlhf_thresholds': {
33 'technical_detail': 0.5, # Balanced
34 'methodology_depth': 0.4, # Focus on conclusions
35 'formality': 0.9 # Very formal
36 },
37 'preferred_sources': ['High-impact journals', 'Review articles']
38 }
39 },
40 'quality_requirements': {
41 'min_precision': 0.85, # Must retrieve >85% relevant papers
42 'min_faithfulness': 0.90, # Responses must be 90% faithful to sources
43 'max_latency': 3.0 # 3 second max response time
44 }
45}

Setup Script:

bash
1#!/bin/bash
2# setup_bio_lab.sh
3
4echo "🧬 Setting up Computational Biology Research Assistant"
5
6# 1. Ingest existing collection
7echo "📚 Ingesting 5,000 existing papers..."
8python scripts/ingest_research_data.py \
9 --directory /data/lab_papers \
10 --batch-size 50 \
11 --parallel-workers 8
12
13# 2. Set up automated paper updates
14echo "📰 Configuring automated updates..."
15python scripts/setup_paper_updates.py \
16 --sources pubmed,biorxiv,arxiv \
17 --schedule daily \
18 --filter "computational biology OR bioinformatics"
19
20# 3. Generate persona-specific test datasets
21echo "🧪 Generating evaluation datasets..."
22python evaluation/generate_test_dataset.py \
23 --personas wet_lab,computational,pi \
24 --queries-per-persona 50
25
26# 4. Run initial evaluation
27echo "📊 Running baseline evaluation..."
28python evaluation/run_evaluation.py \
29 --config config/bio_lab_config.py
30
31# 5. Deploy to production
32echo "🚀 Deploying to production..."
33docker-compose -f docker-compose.bio-lab.yml up -d
34
35echo "✅ Setup complete!"
36echo " Dashboard: http://lab-research-assistant.local"
37echo " Monitoring: http://lab-research-assistant.local/metrics"

Weekly Evaluation Report Email:

python
1# scripts/weekly_report.py
2
3from vero.report import ReportGenerator
4import smtplib
5from email.mime.multipart import MIMEMultipart
6from email.mime.text import MIMEText
7from email.mime.image import MIMEImage
8import matplotlib.pyplot as plt
9
10def generate_weekly_report():
11 """
12 Automated weekly report sent to PI and lab members.
13 """
14
15 # Generate vero-eval report
16 generator = ReportGenerator(
17 trace_db_path='evaluation/trace.db',
18 results_path='evaluation/results/weekly.json'
19 )
20
21 # Create visualizations
22 fig, axes = plt.subplots(2, 2, figsize=(12, 10))
23
24 # 1. Precision trends by persona
25 axes[0, 0].plot(
26 weekly_data['wet_lab_precision'],
27 label='Wet Lab',
28 marker='o'
29 )
30 axes[0, 0].plot(
31 weekly_data['computational_precision'],
32 label='Computational',
33 marker='s'
34 )
35 axes[0, 0].plot(
36 weekly_data['pi_precision'],
37 label='PI',
38 marker='^'
39 )
40 axes[0, 0].set_title('Retrieval Precision by Persona')
41 axes[0, 0].set_xlabel('Week')
42 axes[0, 0].set_ylabel('Precision@5')
43 axes[0, 0].legend()
44 axes[0, 0].grid(True, alpha=0.3)
45
46 # 2. Faithfulness over time
47 axes[0, 1].plot(
48 weekly_data['faithfulness'],
49 color='green',
50 marker='o'
51 )
52 axes[0, 1].axhline(y=0.90, color='r', linestyle='--',
53 label='Target (90%)')
54 axes[0, 1].set_title('Response Faithfulness')
55 axes[0, 1].set_xlabel('Week')
56 axes[0, 1].set_ylabel('Faithfulness Score')
57 axes[0, 1].legend()
58 axes[0, 1].grid(True, alpha=0.3)
59
60 # 3. Query latency distribution
61 axes[1, 0].hist(
62 weekly_data['latencies'],
63 bins=30,
64 edgecolor='black'
65 )
66 axes[1, 0].axvline(x=3.0, color='r', linestyle='--',
67 label='Max Latency (3s)')
68 axes[1, 0].set_title('Query Latency Distribution')
69 axes[1, 0].set_xlabel('Latency (seconds)')
70 axes[1, 0].set_ylabel('Frequency')
71 axes[1, 0].legend()
72
73 # 4. Top failure categories
74 failure_categories = weekly_data['failure_categories']
75 axes[1, 1].barh(
76 list(failure_categories.keys()),
77 list(failure_categories.values())
78 )
79 axes[1, 1].set_title('Top Failure Categories')
80 axes[1, 1].set_xlabel('Count')
81
82 plt.tight_layout()
83 plt.savefig('evaluation/results/weekly_report.png', dpi=150)
84
85 # Create email
86 msg = MIMEMultipart()
87 msg['Subject'] = f'Research Assistant Weekly Report - Week {week_number}'
88 msg['From'] = 'research-assistant@lab.edu'
89 msg['To'] = 'pi@lab.edu, lab-members@lab.edu'
90
91 # Email body
92 html_body = f"""
93 <html>
94 <body>
95 <h2>Research Assistant Performance Report</h2>
96 <h3>Week {week_number} - {date_range}</h3>
97
98 <h4>📊 Key Metrics</h4>
99 <table border="1" cellpadding="10">
100 <tr>
101 <th>Metric</th>
102 <th>This Week</th>
103 <th>Last Week</th>
104 <th>Change</th>
105 </tr>
106 <tr>
107 <td>Avg Precision@5</td>
108 <td>{current_precision:.2%}</td>
109 <td>{last_precision:.2%}</td>
110 <td style="color: {'green' if change > 0 else 'red'};">
111 {change:+.2%}
112 </td>
113 </tr>
114 <tr>
115 <td>Faithfulness</td>
116 <td>{current_faithfulness:.2%}</td>
117 <td>{last_faithfulness:.2%}</td>
118 <td style="color: {'green' if faith_change > 0 else 'red'};">
119 {faith_change:+.2%}
120 </td>
121 </tr>
122 <tr>
123 <td>Avg Latency</td>
124 <td>{current_latency:.2f}s</td>
125 <td>{last_latency:.2f}s</td>
126 <td style="color: {'green' if latency_change < 0 else 'red'};">
127 {latency_change:+.2f}s
128 </td>
129 </tr>
130 <tr>
131 <td>Queries Served</td>
132 <td>{current_queries}</td>
133 <td>{last_queries}</td>
134 <td>{queries_change:+d}</td>
135 </tr>
136 </table>
137
138 <h4>🎯 Performance by Persona</h4>
139 <ul>
140 <li><strong>Wet Lab Biologists:</strong>
141 Precision: {wet_lab_precision:.2%}
142 (Target: >85% ✓)
143 </li>
144 <li><strong>Computational Biologists:</strong>
145 Precision: {comp_bio_precision:.2%}
146 (Target: >85% ✓)
147 </li>
148 <li><strong>PI Queries:</strong>
149 Precision: {pi_precision:.2%}
150 (Target: >85% ⚠️ Below target)
151 </li>
152 </ul>
153
154 <h4>⚠️ Issues & Recommendations</h4>
155 <ul>
156 <li>{issue_1}</li>
157 <li>{issue_2}</li>
158 </ul>
159
160 <p>See attached visualization for detailed trends.</p>
161
162 <p>
163 <a href="http://lab-research-assistant.local/dashboard">
164 View Interactive Dashboard
165 </a>
166 </p>
167
168 </body>
169 </html>
170 """
171
172 msg.attach(MIMEText(html_body, 'html'))
173
174 # Attach visualization
175 with open('evaluation/results/weekly_report.png', 'rb') as f:
176 img = MIMEImage(f.read())
177 img.add_header('Content-Disposition', 'attachment',
178 filename='weekly_trends.png')
179 msg.attach(img)
180
181 # Send email
182 with smtplib.SMTP('smtp.lab.edu', 587) as smtp:
183 smtp.starttls()
184 smtp.login('research-assistant@lab.edu', os.getenv('EMAIL_PASSWORD'))
185 smtp.send_message(msg)
186
187 print("✓ Weekly report sent to lab members")
188
189if __name__ == "__main__":
190 generate_weekly_report()

Ollama Local LLM Integration Setup

Conclusion: The Complete Picture

You now have everything needed to build, evaluate, and deploy a production-ready Research Assistant:

Core Architecture: ✅ Neo4j knowledge graph for research papers
✅ Ollama for local LLM inference
✅ Hybrid retrieval (vector + graph)
✅ Persona-driven responses with RLHF

Evaluation & Quality: ✅ vero-eval for rigorous testing
✅ Automated adversarial testing
✅ Continuous monitoring with alerts
✅ Weekly performance reports

Production Features: ✅ Caching for performance
✅ Batch processing for scale
✅ Automated paper updates
✅ Multi-persona support

The vero-eval Advantage:

What makes this system production-ready is the evaluation framework. Unlike traditional RAG systems that rely on gut feeling and spot-checking, we have:

  1. Systematic edge case testing - adversarial queries expose weaknesses
  2. Persona stress testing - ensures all user types are served well
  3. Automated regression detection - alerts when quality degrades
  4. Actionable metrics - precision/recall/faithfulness directly inform improvements
  5. Continuous learning - RLHF loop closes based on real performance data

This is the difference between a demo and a system you'd trust with real research workflows.

Next Steps:

  1. Clone the starter repo and follow the setup script
  2. Ingest your first 100 papers to test the pipeline
  3. Run vero-eval to establish your baseline
  4. Iterate on retrieval and persona prompts
  5. Deploy to staging and gather feedback
  6. Use weekly reports to drive improvements

Remember: The goal isn't perfect accuracy on day one. It's building a system that measurably improves over time through evaluation-driven iteration.

Now go build something that makes research more efficient! 🚀


Resources:

Questions? Open an issue in the repo or reach out to the community.

Sovereign AI book cover

Sovereign AI: Building Local-First Intelligent Systems

by Daniel Kliewer · Paperback · 72 pages

The hands-on guide to building AI that runs on your hardware, keeps your data private, and eliminates cloud dependence. Working code included.