Building and Evaluating a Local-First Research Assistant with GraphRAG and vero-eval
Complete technical guide to building a production-ready research assistant using GraphRAG, Neo4j knowledge graphs, Ollama local LLMs, and vero-eval evaluation framework for rigorous AI system testing.
Daniel Kliewer
Author, Sovereign AI

Building and Evaluating a Local-First Research Assistant with GraphRAG and vero-eval
A comprehensive guide to creating a persona-driven AI assistant with rigorous evaluation using Neo4j, Ollama, and the vero-eval framework
Introduction: Why Local GraphRAG Matters for Research Workflows
If you're building AI-powered applications in 2025, you've likely hit two major pain points: context limitations and lack of systematic evaluation. Large Language Models are powerful, but they struggle with long-term memory and consistent performance across edge cases. Enter GraphRAG—a methodology that combines knowledge graphs with retrieval-augmented generation to give your AI genuine memory and contextual awareness.
In this guide, we'll build a Local Research Assistant that:
- Stores and retrieves research papers, notes, and conversations in a Neo4j knowledge graph
- Uses Ollama for completely local inference (no API costs, full privacy)
- Implements persona-driven responses that adapt based on RLHF feedback
- Most importantly: Measures performance rigorously using the vero-eval framework
This isn't another "hello world" tutorial. We're building production-ready infrastructure that you can deploy for real research workflows, with proper testing and evaluation baked in from day one.
Prerequisites and Starting Point
Before we dive in, you'll need:
System Requirements:
- Python 3.9+
- Node.js 18+
- Docker (for Neo4j)
- 16GB+ RAM recommended
Core Technologies:
- Ollama for local LLM inference
- Neo4j for graph database
- vero-eval for evaluation
- Next.js + FastAPI (from the starter template)
Clone the Starter Repository:
bash1git clone https://github.com/kliewerdaniel/chrisbot.git research-assistant2cd research-assistant
This gives us a solid foundation with the frontend, basic chat interface, and project structure already in place. We'll extend it to build our research-focused GraphRAG system.
Part 1: Understanding the Architecture
Our Research Assistant follows the PersonaGen architecture pattern outlined by Daniel Kliewer, but applied to academic research workflows:
text1┌─────────────────────────────────────────────────────────┐2│ User Interface │3│ (Next.js Chat Interface) │4└────────────────────┬────────────────────────────────────┘5 │6 ▼7┌─────────────────────────────────────────────────────────┐8│ Reasoning Agent │9│ (Tool Calling + RLHF Threshold Logic) │10└────────────────────┬────────────────────────────────────┘11 │12 ┌──────────┴──────────┐13 ▼ ▼14┌──────────────────┐ ┌──────────────────┐15│ Neo4j Graph │ │ Ollama LLM │16│ RAG System │ │ (Mistral/Llama) │17│ │ │ │18│ • Papers │ │ • Generation │19│ • Authors │ │ • Embeddings │20│ • Concepts │ │ • Extraction │21│ • Citations │ │ │22└──────────────────┘ └──────────────────┘23 │24 ▼25┌─────────────────────────────────────────────────────────┐26│ vero-eval Framework │27│ • Test Dataset Generation │28│ • Retrieval Metrics (Precision, Recall, MRR) │29│ • Generation Metrics (Faithfulness, BERTScore) │30│ • Persona Stress Testing │31└─────────────────────────────────────────────────────────┘
Key Insight: The persona system adapts its behavior based on evaluation feedback. If vero-eval shows poor retrieval for technical queries, the RLHF thresholds adjust to require more context before responding.
Part 2: Setting Up Neo4j GraphRAG
Neo4j is our memory layer. Following the official Neo4j GenAI integration patterns, we'll create a graph schema optimized for research.
Installing Neo4j GraphRAG for Python
bash1# Install the official Neo4j GraphRAG package2pip install neo4j-graphrag34# Install Ollama integration5pip install "neo4j-graphrag[ollama]"67# Start Neo4j (using Docker)8docker run \9 --name research-neo4j \10 -p 7474:7474 -p 7687:7687 \11 -e NEO4J_AUTH=neo4j/research2025 \12 -v $PWD/neo4j-data:/data \13 neo4j:latest

Defining the Research Knowledge Schema
Create scripts/graph_schema.py:
python1from neo4j_graphrag import GraphSchema2from dataclasses import dataclass34@dataclass5class ResearchSchema(GraphSchema):6 """7 Knowledge graph schema for research assistant.89 Nodes:10 - Paper: Research papers with metadata11 - Author: Paper authors with affiliation12 - Concept: Extracted key concepts/topics13 - Note: User's research notes14 - Question: User queries with context1516 Relationships:17 - AUTHORED: Author -> Paper18 - CITES: Paper -> Paper19 - DISCUSSES: Paper -> Concept20 - RELATES_TO: Concept -> Concept21 - ANSWERS: Paper -> Question22 """2324 node_types = {25 'Paper': {26 'properties': ['title', 'abstract', 'year', 'doi', 'pdf_path'],27 'embedding_property': 'abstract_embedding'28 },29 'Author': {30 'properties': ['name', 'affiliation', 'h_index'],31 'embedding_property': None32 },33 'Concept': {34 'properties': ['name', 'definition', 'domain'],35 'embedding_property': 'definition_embedding'36 },37 'Note': {38 'properties': ['content', 'timestamp', 'tags'],39 'embedding_property': 'content_embedding'40 },41 'Question': {42 'properties': ['query', 'timestamp', 'answered'],43 'embedding_property': 'query_embedding'44 }45 }4647 relationship_types = {48 'AUTHORED': ('Author', 'Paper'),49 'CITES': ('Paper', 'Paper'),50 'DISCUSSES': ('Paper', 'Concept'),51 'RELATES_TO': ('Concept', 'Concept'),52 'ANSWERS': ('Paper', 'Question'),53 'ANNOTATES': ('Note', 'Paper')54 }
Why this schema? Research workflows have natural graph structures:
- Papers cite each other (transitive relationships)
- Concepts relate to multiple papers
- Authors collaborate across papers
- User notes connect to specific papers
This lets us traverse the graph to find: "What papers discussing transformer architectures were cited by papers on RAG systems after 2023?"
Building the Graph Ingestion Pipeline
Create scripts/ingest_research_data.py:
python1import ollama2from neo4j import GraphDatabase3from neo4j_graphrag import GraphRAG4from pathlib import Path5import PyPDF267class ResearchGraphBuilder:8 def __init__(self, neo4j_uri="bolt://localhost:7687",9 neo4j_user="neo4j",10 neo4j_password="research2025",11 ollama_model="mistral"):1213 self.driver = GraphDatabase.driver(neo4j_uri,14 auth=(neo4j_user, neo4j_password))15 self.ollama_model = ollama_model16 self.graph_rag = GraphRAG(self.driver)1718 def extract_paper_metadata(self, pdf_path: Path) -> dict:19 """Extract title, abstract, and key sections from PDF"""20 with open(pdf_path, 'rb') as file:21 reader = PyPDF2.PdfReader(file)2223 # Extract first 3 pages (usually contains abstract)24 text = ""25 for page in reader.pages[:3]:26 text += page.extract_text()2728 # Use Ollama to extract structured metadata29 prompt = f"""Extract from this research paper excerpt:30 1. Title31 2. Authors (list)32 3. Abstract33 4. Key concepts (5-7 main topics)3435 Text: {text[:4000]}3637 Return as JSON."""3839 response = ollama.generate(40 model=self.ollama_model,41 prompt=prompt,42 format='json'43 )4445 return json.loads(response['response'])4647 def create_paper_node(self, metadata: dict, pdf_path: Path):48 """Create Paper node with embeddings"""4950 # Generate embedding for abstract51 abstract_embedding = ollama.embeddings(52 model='nomic-embed-text',53 prompt=metadata['abstract']54 )['embedding']5556 with self.driver.session() as session:57 session.run("""58 CREATE (p:Paper {59 title: $title,60 abstract: $abstract,61 year: $year,62 pdf_path: $pdf_path,63 abstract_embedding: $embedding64 })65 WITH p66 UNWIND $authors AS author_name67 MERGE (a:Author {name: author_name})68 CREATE (a)-[:AUTHORED]->(p)6970 WITH p71 UNWIND $concepts AS concept_name72 MERGE (c:Concept {name: concept_name})73 CREATE (p)-[:DISCUSSES]->(c)74 """,75 title=metadata['title'],76 abstract=metadata['abstract'],77 year=metadata.get('year', 2024),78 pdf_path=str(pdf_path),79 embedding=abstract_embedding,80 authors=metadata['authors'],81 concepts=metadata['concepts']82 )8384 def ingest_directory(self, papers_dir: Path):85 """Ingest all PDFs in a directory"""86 pdf_files = list(papers_dir.glob("*.pdf"))8788 print(f"Found {len(pdf_files)} papers to ingest...")8990 for pdf_path in pdf_files:91 print(f"Processing: {pdf_path.name}")92 try:93 metadata = self.extract_paper_metadata(pdf_path)94 self.create_paper_node(metadata, pdf_path)95 print(f"✓ Ingested: {metadata['title']}")96 except Exception as e:97 print(f"✗ Failed {pdf_path.name}: {e}")
Key Pattern: We're using Ollama for both extraction (via generate) and embeddings (via embeddings). This keeps everything local. For production, you might cache embeddings in a vector index.
Creating Vector Indexes for Hybrid Search
Following Neo4j's GenAI integration guide, we create vector indexes:
python1def create_vector_indexes(self):2 """Create vector indexes for similarity search"""3 with self.driver.session() as session:4 # Abstract embeddings (4096 dimensions for nomic-embed-text)5 session.run("""6 CREATE VECTOR INDEX paper_abstracts IF NOT EXISTS7 FOR (p:Paper)8 ON p.abstract_embedding9 OPTIONS {10 indexConfig: {11 `vector.dimensions`: 4096,12 `vector.similarity_function`: 'cosine'13 }14 }15 """)1617 # Concept embeddings18 session.run("""19 CREATE VECTOR INDEX concept_definitions IF NOT EXISTS20 FOR (c:Concept)21 ON c.definition_embedding22 OPTIONS {23 indexConfig: {24 `vector.dimensions`: 4096,25 `vector.similarity_function`: 'cosine'26 }27 }28 """)2930 # Note embeddings31 session.run("""32 CREATE VECTOR INDEX note_contents IF NOT EXISTS33 FOR (n:Note)34 ON n.content_embedding35 OPTIONS {36 indexConfig: {37 `vector.dimensions`: 4096,38 `vector.similarity_function`: 'cosine'39 }40 }41 """)
Critical: The dimension count (4096) must match your embedding model. nomic-embed-text uses 4096, but if you switch to all-MiniLM-L6-v2, you'd need 384.
Part 3: Implementing Hybrid Retrieval
Now we implement the retrieval layer that combines vector similarity with graph traversal:

python1class HybridRetriever:2 def __init__(self, driver, ollama_model="mistral"):3 self.driver = driver4 self.ollama_model = ollama_model56 def retrieve_context(self, query: str, limit: int = 5) -> list[dict]:7 """8 Hybrid retrieval combining:9 1. Vector similarity search10 2. Graph traversal for related concepts11 3. Citation network expansion12 """1314 # Generate query embedding15 query_embedding = ollama.embeddings(16 model='nomic-embed-text',17 prompt=query18 )['embedding']1920 with self.driver.session() as session:21 # Vector similarity search22 vector_results = session.run("""23 CALL db.index.vector.queryNodes(24 'paper_abstracts',25 $limit,26 $query_embedding27 )28 YIELD node, score29 MATCH (node)<-[:AUTHORED]-(author:Author)30 MATCH (node)-[:DISCUSSES]->(concept:Concept)3132 RETURN33 node.title AS title,34 node.abstract AS abstract,35 node.year AS year,36 score AS relevance_score,37 collect(DISTINCT author.name) AS authors,38 collect(DISTINCT concept.name) AS concepts,39 'vector_search' AS retrieval_method40 ORDER BY score DESC41 """,42 query_embedding=query_embedding,43 limit=limit44 ).data()4546 # Graph traversal for cited papers47 graph_results = []48 if vector_results:49 top_paper_title = vector_results[0]['title']5051 graph_results = session.run("""52 MATCH (seed:Paper {title: $seed_title})53 MATCH (seed)-[:CITES]->(cited:Paper)54 MATCH (cited)<-[:AUTHORED]-(author:Author)55 MATCH (cited)-[:DISCUSSES]->(concept:Concept)56 WHERE any(c IN $query_concepts WHERE c IN collect(concept.name))5758 RETURN59 cited.title AS title,60 cited.abstract AS abstract,61 cited.year AS year,62 0.7 AS relevance_score,63 collect(DISTINCT author.name) AS authors,64 collect(DISTINCT concept.name) AS concepts,65 'citation_traversal' AS retrieval_method66 LIMIT $limit67 """,68 seed_title=top_paper_title,69 query_concepts=self._extract_query_concepts(query),70 limit=limit // 271 ).data()7273 # Combine and deduplicate74 all_results = vector_results + graph_results75 seen_titles = set()76 unique_results = []7778 for result in all_results:79 if result['title'] not in seen_titles:80 seen_titles.add(result['title'])81 unique_results.append(result)8283 return sorted(unique_results,84 key=lambda x: x['relevance_score'],85 reverse=True)[:limit]8687 def _extract_query_concepts(self, query: str) -> list[str]:88 """Extract key concepts from query using LLM"""89 response = ollama.generate(90 model=self.ollama_model,91 prompt=f"Extract 3-5 key technical concepts from this query: {query}. Return as comma-separated list.",92 options={'temperature': 0.1}93 )94 return [c.strip() for c in response['response'].split(',')]
Why hybrid? Pure vector search might miss important papers that don't match semantically but are cited by relevant papers. Graph traversal captures these relationships.
Part 4: The Reasoning Agent and Persona Layer
The reasoning agent decides when to query the graph and how to format responses based on RLHF-adjusted thresholds:

python1# In scripts/reasoning_agent.py23import json4from pathlib import Path56class PersonaReasoningAgent:7 def __init__(self, persona_config_path: Path = Path("data/persona.json")):8 self.persona_config = self._load_persona(persona_config_path)9 self.retriever = HybridRetriever(driver, ollama_model)1011 def _load_persona(self, config_path: Path) -> dict:12 """Load persona configuration with RLHF thresholds"""13 with open(config_path) as f:14 return json.load(f)1516 def should_retrieve_context(self, query: str) -> bool:17 """18 Decide if we need to retrieve context based on:19 1. Query complexity20 2. RLHF confidence threshold21 3. Recent retrieval success rate22 """2324 # Simple heuristic: technical terms or specific paper requests25 technical_indicators = [26 'paper', 'research', 'study', 'findings',27 'method', 'algorithm', 'experiment', 'results'28 ]2930 needs_retrieval = any(term in query.lower()31 for term in technical_indicators)3233 # Check RLHF threshold34 confidence_threshold = self.persona_config['rlhf_thresholds']['retrieval_required']3536 # If recent queries had low-quality responses, lower threshold37 if self.persona_config['recent_success_rate'] < 0.7:38 confidence_threshold *= 0.83940 return needs_retrieval or confidence_threshold > 0.54142 def generate_response(self, query: str, chat_history: list = None) -> dict:43 """44 Main orchestration logic:45 1. Decide if retrieval needed46 2. Retrieve context if necessary47 3. Generate response with persona coloring48 4. Grade output (RLHF scoring)49 """5051 # Step 1: Retrieval decision52 needs_context = self.should_retrieve_context(query)5354 context_docs = []55 if needs_context:56 context_docs = self.retriever.retrieve_context(query, limit=5)5758 # Step 2: Format context for LLM59 context_str = self._format_context(context_docs)6061 # Step 3: Generate with persona62 system_prompt = self._build_persona_prompt(context_str)6364 response = ollama.generate(65 model='mistral',66 prompt=query,67 system=system_prompt,68 context=chat_history69 )7071 # Step 4: RLHF grading72 quality_grade = self._grade_response(query, response['response'], context_docs)7374 # Update RLHF thresholds based on grade75 self._update_persona_thresholds(quality_grade)7677 return {78 'response': response['response'],79 'context_used': context_docs,80 'quality_grade': quality_grade,81 'retrieval_method': context_docs[0]['retrieval_method'] if context_docs else None82 }8384 def _build_persona_prompt(self, context: str) -> str:85 """86 Build system prompt from persona configuration.87 This is the 'coloring' step mentioned in the architecture.88 """89 base_template = self.persona_config['system_prompt_template']9091 # Insert context if available92 if context:93 base_template += f"\n\nRelevant Research Context:\n{context}"9495 # Add persona modifiers based on RLHF values96 formality = self.persona_config['rlhf_thresholds']['formality_level']97 if formality > 0.7:98 base_template += "\n\nUse academic, formal language with proper citations."99 else:100 base_template += "\n\nExplain concepts clearly and conversationally."101102 return base_template103104 def _grade_response(self, query: str, response: str, context: list) -> float:105 """106 RLHF grading: 0 (needs improvement) to 1 (excellent).107 In production, this would be human feedback, but we start with heuristics.108 """109110 # Heuristic checks:111 # 1. Did we use retrieved context?112 used_context = any(113 doc['title'].lower() in response.lower()114 for doc in context115 ) if context else True116117 # 2. Is response substantive (not too short)?118 is_substantive = len(response.split()) > 50119120 # 3. Does response directly address query?121 query_terms = set(query.lower().split())122 response_terms = set(response.lower().split())123 overlap = len(query_terms & response_terms) / len(query_terms)124125 # Weighted score126 score = (127 0.4 * float(used_context) +128 0.3 * float(is_substantive) +129 0.3 * overlap130 )131132 return min(1.0, score)133134 def _update_persona_thresholds(self, quality_grade: float):135 """136 Update RLHF thresholds based on response quality.137 This is the adaptive learning mechanism.138 """139140 # If grade < 0.5, we need more context141 if quality_grade < 0.5:142 self.persona_config['rlhf_thresholds']['retrieval_required'] += 0.05143 else:144 # Successful response, can relax threshold slightly145 self.persona_config['rlhf_thresholds']['retrieval_required'] -= 0.02146147 # Clamp values148 self.persona_config['rlhf_thresholds']['retrieval_required'] = max(149 0.0,150 min(1.0, self.persona_config['rlhf_thresholds']['retrieval_required'])151 )152153 # Save updated config154 with open("data/persona.json", 'w') as f:155 json.dump(self.persona_config, f, indent=2)
Key Insight: The persona adapts over time. If vero-eval (which we'll integrate next) shows poor performance, these thresholds shift to require more evidence before responding.
Part 5: Integrating vero-eval for Rigorous Testing
This is where the magic happens. vero-eval provides production-grade evaluation that goes far beyond simple accuracy metrics. It tests edge cases, persona stress scenarios, and real-world failure modes.

Installing and Configuring vero-eval
bash1# Install vero-eval2pip install vero-eval34# Initialize evaluation directory5mkdir -p evaluation/datasets evaluation/results
Generating a Research-Specific Test Dataset
vero-eval can generate test datasets tailored to your domain:
python1# evaluation/generate_test_dataset.py23from vero.test_dataset_generator import generate_and_save4from pathlib import Path56def generate_research_test_dataset():7 """8 Generate challenging test queries for research assistant.9 vero-eval creates persona-based edge cases automatically.10 """1112 # Point to your research papers directory13 data_path = Path('data/research_papers')1415 # Define the use case16 use_case = """17 This is a research assistant that helps academics:18 - Find relevant papers on specific topics19 - Understand connections between research areas20 - Get summaries of complex papers21 - Discover citation networks22 - Answer technical questions about methodologies2324 Edge cases to test:25 - Queries about very recent papers (after knowledge cutoff)26 - Multi-hop reasoning (papers that cite papers that discuss X)27 - Ambiguous author names28 - Requests for specific experimental results29 - Cross-domain queries (e.g., physics papers relevant to biology)30 """3132 # Generate dataset with persona variations33 generate_and_save(34 data_path=str(data_path),35 usecase=use_case,36 save_path_dir='evaluation/datasets/research_assistant_v1',37 n_queries=150, # Generate 150 test queries3839 # Persona variations40 personas=[41 {42 'name': 'PhD Student',43 'characteristics': 'Detail-oriented, asks follow-up questions, wants methodology details'44 },45 {46 'name': 'Senior Researcher',47 'characteristics': 'Broad queries, interested in connections, asks about citations'48 },49 {50 'name': 'Industry Practitioner',51 'characteristics': 'Practical focus, wants applicable results, less theory'52 }53 ],5455 # vero-eval will use Ollama for generation56 llm_provider='ollama',57 model_name='mistral'58 )5960 print("✓ Generated test dataset with persona variations")61 print(" Check: evaluation/datasets/research_assistant_v1/")6263if __name__ == "__main__":64 generate_research_test_dataset()
Run this:
bash1python evaluation/generate_test_dataset.py
This creates a JSON file with queries like:
json1{2 "query": "What papers discuss attention mechanisms in the context of graph neural networks published after 2022?",3 "persona": "Senior Researcher",4 "expected_characteristics": ["multi-hop", "temporal_constraint", "domain_crossing"],5 "ground_truth_chunk_ids": ["paper_47", "paper_89", "paper_102"],6 "complexity_score": 0.857}
Running the Evaluation Suite
Now we test our system against this dataset:
python1# evaluation/run_evaluation.py23from vero.evaluator import Evaluator4from vero.metrics import (5 PrecisionMetric, RecallMetric, SufficiencyMetric,6 FaithfulnessMetric, BERTScoreMetric, RougeMetric,7 MRRMetric, MAPMetric, NDCGMetric8)9from reasoning_agent import PersonaReasoningAgent10import json1112def run_full_evaluation():13 """14 Run comprehensive evaluation using vero-eval framework.15 Tests both retrieval and generation quality.16 """1718 # Initialize our system19 agent = PersonaReasoningAgent()2021 # Load test dataset22 with open('evaluation/datasets/research_assistant_v1/queries.json') as f:23 test_queries = json.load(f)2425 # Initialize vero-eval26 evaluator = Evaluator(27 test_dataset=test_queries,28 trace_db_path='evaluation/trace.db' # Logs all queries29 )3031 # Define evaluation metrics32 retrieval_metrics = [33 PrecisionMetric(k=5),34 RecallMetric(k=5),35 SufficiencyMetric(), # Are retrieved docs sufficient to answer?36 ]3738 generation_metrics = [39 FaithfulnessMetric(), # Is response faithful to retrieved docs?40 BERTScoreMetric(), # Semantic similarity to reference answers41 RougeMetric() # Token overlap with references42 ]4344 ranking_metrics = [45 MRRMetric(), # Mean Reciprocal Rank46 MAPMetric(), # Mean Average Precision47 NDCGMetric() # Normalized Discounted Cumulative Gain48 ]4950 results = {51 'retrieval': {},52 'generation': {},53 'ranking': {},54 'per_persona': {}55 }5657 # Run evaluation for each query58 for query_data in test_queries:59 query = query_data['query']60 persona = query_data['persona']61 ground_truth = query_data['ground_truth_chunk_ids']6263 # Generate response using our system64 response_data = agent.generate_response(query)6566 # Extract retrieved document IDs67 retrieved_ids = [68 doc.get('paper_id', doc['title'])69 for doc in response_data['context_used']70 ]7172 # Log to vero-eval's trace database73 evaluator.log_query(74 query=query,75 retrieved_docs=retrieved_ids,76 generated_response=response_data['response'],77 metadata={'persona': persona}78 )7980 # Evaluate retrieval81 for metric in retrieval_metrics:82 score = metric.compute(83 retrieved=retrieved_ids,84 relevant=ground_truth85 )8687 metric_name = metric.__class__.__name__88 if metric_name not in results['retrieval']:89 results['retrieval'][metric_name] = []90 results['retrieval'][metric_name].append(score)9192 # Evaluate generation93 for metric in generation_metrics:94 score = metric.compute(95 generated=response_data['response'],96 reference=query_data.get('reference_answer', ''),97 context=response_data['context_used']98 )99100 metric_name = metric.__class__.__name__101 if metric_name not in results['generation']:102 results['generation'][metric_name] = []103 results['generation'][metric_name].append(score)104105 # Track per-persona performance106 if persona not in results['per_persona']:107 results['per_persona'][persona] = {108 'precision': [],109 'faithfulness': []110 }111112 results['per_persona'][persona]['precision'].append(113 results['retrieval']['PrecisionMetric'][-1]114 )115 results['per_persona'][persona]['faithfulness'].append(116 results['generation']['FaithfulnessMetric'][-1]117 )118119 # Aggregate results120 for category in ['retrieval', 'generation']:121 for metric_name, scores in results[category].items():122 results[category][metric_name] = {123 'mean': sum(scores) / len(scores),124 'min': min(scores),125 'max': max(scores),126 'std': np.std(scores)127 }128129 # Save results130 with open('evaluation/results/full_evaluation.json', 'w') as f:131 json.dump(results, f, indent=2)132133 print("✓ Evaluation complete!")134 print(f" Retrieval Precision@5: {results['retrieval']['PrecisionMetric']['mean']:.3f}")135 print(f" Retrieval Recall@5: {results['retrieval']['RecallMetric']['mean']:.3f}")136 print(f" Generation Faithfulness: {results['generation']['FaithfulnessMetric']['mean']:.3f}")137138 return results139140if __name__ == "__main__":141 results = run_full_evaluation()
Run the evaluation:
bash1python evaluation/run_evaluation.py
Generating Performance Reports
vero-eval includes a report generator:
python1from vero.report import ReportGenerator23# Generate comprehensive HTML report4generator = ReportGenerator(5 trace_db_path='evaluation/trace.db',6 results_path='evaluation/results/full_evaluation.json'7)89generator.generate_report(10 output_path='evaluation/results/performance_report.html',11 include_sections=[12 'executive_summary',13 'retrieval_analysis',14 'generation_analysis',15 'persona_breakdown',16 'failure_cases',17 'recommendations'18 ]19)2021print("✓ Report generated: evaluation/results/performance_report.html")
This creates an interactive HTML report showing:
- Overall metrics with confidence intervals
- Per-persona performance breakdown
- Failure case analysis (queries where system performed poorly)
- Recommendations for improvement
Part 6: The RLHF Feedback Loop
Now we close the loop: use vero-eval results to update the persona's RLHF thresholds:
python1# evaluation/update_persona_from_results.py23import json45def update_persona_thresholds(evaluation_results: dict):6 """7 Analyze vero-eval results and adjust persona thresholds.8 This is the core RLHF mechanism.9 """1011 # Load current persona config12 with open('data/persona.json') as f:13 persona_config = json.load(f)1415 # Analyze retrieval performance16 retrieval_recall = evaluation_results['retrieval']['RecallMetric']['mean']1718 if retrieval_recall < 0.6:19 # Low recall → need to retrieve more documents20 persona_config['rlhf_thresholds']['retrieval_limit'] += 221 persona_config['rlhf_thresholds']['retrieval_required'] += 0.12223 print("⚠️ Low recall detected. Increasing retrieval aggressiveness.")2425 # Analyze generation faithfulness26 faithfulness = evaluation_results['generation']['FaithfulnessMetric']['mean']2728 if faithfulness < 0.7:29 # Responses not faithful to sources → need stronger grounding30 persona_config['rlhf_thresholds']['minimum_context_overlap'] = 0.431 persona_config['system_prompt_template'] += (32 "\n\nIMPORTANT: Always cite specific papers when making claims. "33 "Do not speculate beyond what the retrieved papers state."34 )3536 print("⚠️ Low faithfulness detected. Strengthening citation requirements.")3738 # Per-persona adjustments39 for persona_name, metrics in evaluation_results['per_persona'].items():40 avg_precision = sum(metrics['precision']) / len(metrics['precision'])4142 if avg_precision < 0.5:43 print(f"⚠️ {persona_name} persona underperforming (Precision: {avg_precision:.2f})")4445 # Could adjust persona-specific prompts here46 # For now, log for manual review4748 # Save updated config49 with open('data/persona.json', 'w') as f:50 json.dump(persona_config, f, indent=2)5152 print("✓ Persona thresholds updated based on evaluation results")5354# Usage after evaluation55with open('evaluation/results/full_evaluation.json') as f:56 results = json.load(f)5758update_persona_thresholds(results)
The workflow becomes:
- Run system on test queries
- vero-eval measures performance
- Script analyzes metrics
- Persona thresholds adjust automatically
- Re-evaluate to confirm improvement
This is reinforcement learning through human feedback (RLHF) in action, but guided by rigorous automated evaluation rather than ad-hoc human ratings.
Part 7: Integrating with the Frontend
Now we wire this into the Next.js chat interface. Update src/app/api/chat/route.ts:
typescript1import { NextRequest } from 'next/server'2import { spawn } from 'child_process'3import path from 'path'45export async function POST(request: NextRequest) {6 const { message, messages, graphRAG = true } = await request.json()78 if (!graphRAG) {9 // Regular chat without RAG10 return handleRegularChat(message, messages)11 }1213 // Call our Python reasoning agent14 const agentPath = path.join(process.cwd(), 'scripts', 'reasoning_agent.py')1516 const result = await new Promise<{response: string, context: any[]}>((resolve, reject) => {17 const pythonProcess = spawn('python3', [18 agentPath,19 'generate',20 JSON.stringify({ query: message, chat_history: messages })21 ])2223 let stdout = ''24 let stderr = ''2526 pythonProcess.stdout.on('data', (data) => {27 stdout += data.toString()28 })2930 pythonProcess.stderr.on('data', (data) => {31 stderr += data.toString()32 })3334 pythonProcess.on('close', (code) => {35 if (code === 0) {36 try {37 const result = JSON.parse(stdout)38 resolve(result)39 } catch (e) {40 reject(new Error(`Failed to parse response: ${e}`))41 }42 } else {43 reject(new Error(`Agent failed: ${stderr}`))44 }45 })46 })4748 // Stream response back to client49 const stream = new ReadableStream({50 start(controller) {51 // Send response with context metadata52 const formatted = `${result.response}\n\n---\n**Sources:**\n${53 result.context.map((doc, i) =>54 `[${i+1}] ${doc.title} (${doc.year})`55 ).join('\n')56 }`5758 controller.enqueue(new TextEncoder().encode(formatted))59 controller.close()60 }61 })6263 return new Response(stream, {64 headers: {65 'Content-Type': 'text/plain; charset=utf-8',66 },67 })68}
Update the chat UI to show retrieval metadata:
typescript1// In src/components/Chat.tsx23{message.role === 'assistant' && message.context && (4 <div className="mt-2 text-xs text-muted-foreground">5 <details>6 <summary className="cursor-pointer hover:text-foreground">7 📚 {message.context.length} sources retrieved8 </summary>9 <ul className="mt-2 space-y-1">10 {message.context.map((doc, i) => (11 <li key={i} className="flex items-center gap-2">12 <span className="font-mono">13 {doc.retrieval_method === 'vector_search' ? '🔍' : '🔗'}14 </span>15 <span>{doc.title}</span>16 <span className="text-muted-foreground">17 (relevance: {(doc.relevance_score * 100).toFixed(0)}%)18 </span>19 </li>20 ))}21 </ul>22 </details>23 </div>24)}
Now users can see which papers were retrieved and how (vector search vs. citation traversal).
Part 8: Running the Complete System
Setup Script
Create setup.sh:
bash1#!/bin/bash23echo "🔬 Setting up Research Assistant GraphRAG System"45# 1. Install Python dependencies6echo "📦 Installing Python dependencies..."7pip install -r requirements.txt89# 2. Start Neo4j10echo "🗄️ Starting Neo4j..."11docker-compose up -d neo4j1213# Wait for Neo4j to be ready14echo "⏳ Waiting for Neo4j..."15until curl -s http://localhost:7474 > /dev/null; do16 sleep 217done18echo "✓ Neo4j ready"1920# 3. Start Ollama21echo "🤖 Checking Ollama..."22if ! command -v ollama &> /dev/null; then23 echo "Please install Ollama from https://ollama.ai"24 exit 125fi2627ollama serve &28sleep 52930# Pull required models31ollama pull mistral32ollama pull nomic-embed-text3334# 4. Initialize Neo4j graph schema35echo "📊 Initializing graph schema..."36python scripts/init_graph_schema.py3738# 5. Ingest sample research papers39echo "📚 Ingesting sample papers..."40python scripts/ingest_research_data.py --directory data/sample_papers4142# 6. Generate test dataset43echo "🧪 Generating evaluation dataset..."44python evaluation/generate_test_dataset.py4546# 7. Run initial evaluation47echo "📈 Running initial evaluation..."48python evaluation/run_evaluation.py4950# 8. Start Next.js frontend51echo "🌐 Starting frontend..."52npm install53npm run dev &5455echo ""56echo "✅ Setup complete!"57echo ""58echo "🔗 Access points:"59echo " Frontend: http://localhost:3000"60echo " Neo4j Browser: http://localhost:7474"61echo " Evaluation Reports: evaluation/results/"62echo ""63echo "📖 Next steps:"64echo " 1. Add your research papers to data/research_papers/"65echo " 2. Run: python scripts/ingest_research_data.py"66echo " 3. Chat with your research assistant at localhost:3000"67echo " 4. Check evaluation results in evaluation/results/"
Run it:
bash1chmod +x setup.sh2./setup.sh
Part 9: Practical Use Cases and Patterns
Use Case 1: Literature Review Assistant
python1# Example query patterns for literature reviews23queries = [4 "What are the main approaches to attention mechanisms in transformers since 2020?",5 "Find papers that cite Vaswani et al. 2017 and discuss efficiency improvements",6 "What experimental setups are common in graph neural network papers?",7 "Compare the methodologies used in top-cited RAG papers"8]910for query in queries:11 response = agent.generate_response(query)1213 # System automatically:14 # 1. Retrieves relevant papers using hybrid search15 # 2. Traverses citation network16 # 3. Formats response with proper attributions17 # 4. Logs everything to vero-eval trace DB
Use Case 2: Cross-Domain Research Discovery
python1# Finding connections between domains23query = """4Are there any techniques from computer vision that have been5successfully applied to natural language processing in the last 3 years?6"""78# The graph traversal will:9# 1. Find CV papers discussing specific techniques10# 2. Find NLP papers citing those CV papers11# 3. Identify the bridging concepts12# 4. Present a coherent narrative1314response = agent.generate_response(query)
Use Case 3: Methodology Extraction
python1# Extracting specific methodological details23query = """4What evaluation metrics are most commonly used in papers about5few-shot learning for NLP tasks?6"""78# Behind the scenes:9# 1. Retrieve few-shot NLP papers10# 2. Extract methodology sections (using LLM)11# 3. Aggregate metrics across papers12# 4. Present frequency analysis
Part 10: Measuring Success with vero-eval
After running the system for a while, check the vero-eval dashboard:
python1# evaluation/generate_dashboard.py23from vero.dashboard import create_dashboard4from vero.trace_db import TraceDB56# Load trace database7trace_db = TraceDB('evaluation/trace.db')89# Create interactive dashboard10create_dashboard(11 trace_db=trace_db,12 output_path='evaluation/dashboard.html',13 metrics=[14 'retrieval_precision',15 'retrieval_recall',16 'generation_faithfulness',17 'response_time',18 'context_sufficiency'19 ],20 groupby=['persona', 'query_complexity']21)
This generates an interactive Plotly dashboard showing:
- Metric trends over time (is the system improving?)
- Persona performance comparison (which user types are we serving well?)
- Query complexity vs. accuracy (where do we struggle?)
- Retrieval method effectiveness (vector vs. graph traversal success rates)
Advanced Patterns and Optimizations
Pattern 1: Caching Embeddings
For production, cache embeddings to avoid recomputation:
python1import pickle2from pathlib import Path34class EmbeddingCache:5 def __init__(self, cache_dir: Path = Path('cache/embeddings')):6 self.cache_dir = cache_dir7 self.cache_dir.mkdir(parents=True, exist_ok=True)89 def get_embedding(self, text: str, model: str = 'nomic-embed-text') -> list[float]:10 # Create hash of text for cache key11 cache_key = hashlib.md5(text.encode()).hexdigest()12 cache_path = self.cache_dir / f"{cache_key}_{model}.pkl"1314 if cache_path.exists():15 with open(cache_path, 'rb') as f:16 return pickle.load(f)1718 # Generate new embedding19 embedding = ollama.embeddings(model=model, prompt=text)['embedding']2021 # Cache it22 with open(cache_path, 'wb') as f:23 pickle.dump(embedding, f)2425 return embedding
Pattern 2: Batch Processing for Large Collections
When ingesting 1000+ papers:
python1def ingest_batch(papers: list[Path], batch_size: int = 10):2 """Process papers in batches to manage memory"""34 for i in range(0, len(papers), batch_size):5 batch = papers[i:i+batch_size]67 # Extract metadata in parallel8 with ThreadPoolExecutor(max_workers=batch_size) as executor:9 metadata_list = executor.map(extract_paper_metadata, batch)1011 # Insert into Neo4j in single transaction12 with driver.session() as session:13 with session.begin_transaction() as tx:14 for metadata, pdf_path in zip(metadata_list, batch):15 create_paper_node(tx, metadata, pdf_path)1617 tx.commit()1819 print(f"✓ Processed {i+batch_size}/{len(papers)} papers")
Pattern 3: Incremental Evaluation
Don't wait to run full evaluation. Track metrics continuously:
python1class ContinuousEvaluator:2 def __init__(self, alert_threshold: float = 0.6):3 self.alert_threshold = alert_threshold4 self.recent_scores = []56 def evaluate_response(self, query: str, response: dict):7 # Quick evaluation on the fly8 score = self._quick_score(response)9 self.recent_scores.append(score)1011 # Keep only last 50 queries12 if len(self.recent_scores) > 50:13 self.recent_scores.pop(0)1415 # Alert if average drops16 if len(self.recent_scores) >= 10:17 avg = sum(self.recent_scores) / len(self.recent_scores)18 if avg < self.alert_threshold:19 self._send_alert(avg)2021 def _quick_score(self, response: dict) -> float:22 # Lightweight scoring23 has_context = len(response['context_used']) > 024 response_length = len(response['response'].split())2526 return 0.7 * has_context + 0.3 * min(1.0, response_length / 100)
Troubleshooting Common Issues
Issue 1: Neo4j Connection Errors
python1# Test Neo4j connection2from neo4j import GraphDatabase34def test_connection():5 try:6 driver = GraphDatabase.driver(7 "bolt://localhost:7687",8 auth=("neo4j", "research2025")9 )1011 with driver.session() as session:12 result = session.run("RETURN 1 AS num")13 print("✓ Neo4j connection successful")1415 except Exception as e:16 print(f"✗ Connection failed: {e}")17 print(" Make sure Neo4j is running: docker ps")
Issue 2: Ollama Model Not Found
bash1# Check available models2ollama list34# Pull missing models5ollama pull mistral6ollama pull nomic-embed-text78# Verify they work9ollama run mistral "Test query"
Issue 3: Low Retrieval Scores
Check your embeddings:
python1# Verify embeddings are being generated correctly2from ingest_research_data import ResearchGraphBuilder34builder = ResearchGraphBuilder()56# Test on a sample paper7test_text = "Transformers are a type of neural network architecture..."8embedding = builder.graph_rag.generate_embedding(test_text)910print(f"Embedding dimension: {len(embedding)}") # Should be 409611print(f"Sample values: {embedding[:5]}")
Conclusion and Next Steps
You now have a production-ready Research Assistant with:
✅ Local-first architecture (no API costs, full privacy)
✅ Neo4j knowledge graph (papers, authors, concepts, citations)
✅ Hybrid retrieval (vector similarity + graph traversal)
✅ Persona-driven responses with RLHF adaptation
✅ Comprehensive evaluation via vero-eval framework
✅ Automated improvement through feedback loops
Recommended Next Steps:
-
Expand the Dataset: Ingest your actual research papers
bash1python scripts/ingest_research_data.py --directory ~/Documents/Research -
Run Weekly Evaluations: Set up a cron job
bash10 2 * * 0 cd /path/to/research-assistant && python evaluation/run_evaluation.py -
Fine-tune Personas: Create persona configs for different user types:
- PhD Student persona (detail-oriented, wants methodology)
- Senior Researcher persona (big picture, cross-domain)
- Industry persona (practical applications)
-
Integrate Additional Sources:
- arXiv API for latest papers
- Connected Papers for visualization
- Semantic Scholar for citation data
-
Scale Up:
- Use a vector database (Pinecone, Weaviate) for 10K+ papers
- Implement query result caching
- Add paper summarization pipeline
Resources for Going Deeper
- Neo4j GenAI Integration: Official Documentation
- llama.cpp: Mastering Local LLM Integration
- vero-eval Framework: GitHub Repository
Production Deployment Checklist
Before deploying to production, ensure you've addressed:
python1# deployment/production_checklist.py23PRODUCTION_CHECKLIST = {4 'Infrastructure': [5 '☐ Neo4j running with persistent volumes',6 '☐ Ollama configured with appropriate model cache',7 '☐ Redis/Memcached for query result caching',8 '☐ Load balancer for API endpoints',9 '☐ CDN for static assets'10 ],11 'Security': [12 '☐ API authentication implemented',13 '☐ Rate limiting configured (per user/IP)',14 '☐ Input sanitization for all user queries',15 '☐ Neo4j credentials rotated and secured',16 '☐ HTTPS enabled with valid certificates'17 ],18 'Monitoring': [19 '☐ Prometheus metrics exported',20 '☐ Grafana dashboards for system health',21 '☐ vero-eval continuous evaluation running',22 '☐ Error tracking (Sentry/Rollbar)',23 '☐ Query latency monitoring'24 ],25 'Data Management': [26 '☐ Automated backups of Neo4j database',27 '☐ Embedding cache backup strategy',28 '☐ Data retention policies defined',29 '☐ GDPR compliance for user queries',30 '☐ Paper metadata update pipeline'31 ],32 'Performance': [33 '☐ Embedding generation batched/cached',34 '☐ Neo4j indexes optimized',35 '☐ Query result caching implemented',36 '☐ Connection pooling configured',37 '☐ Async processing for long-running queries'38 ]39}
Part 11: Advanced vero-eval Techniques
Now let's dive deeper into what makes vero-eval exceptional for production AI systems.
Stress Testing with Adversarial Queries
vero-eval can generate adversarial test cases that expose edge cases:
python1# evaluation/adversarial_testing.py23from vero.adversarial import AdversarialGenerator4from reasoning_agent import PersonaReasoningAgent56def run_adversarial_tests():7 """8 Generate adversarial queries designed to break the system.9 This reveals weaknesses before users find them.10 """1112 agent = PersonaReasoningAgent()1314 # Initialize adversarial generator15 adv_gen = AdversarialGenerator(16 base_queries=load_valid_queries(),17 attack_types=[18 'jailbreak', # Try to bypass safety guardrails19 'context_overflow', # Queries requiring huge context20 'ambiguous_reference', # "the paper mentioned earlier" without context21 'temporal_confusion', # Mixing past/future tenses22 'multi_hop_complex', # Require 3+ reasoning steps23 'contradictory', # Ask for contradicting information24 'out_of_domain' # Queries completely outside research25 ]26 )2728 adversarial_queries = adv_gen.generate(n=50)2930 failures = []3132 for query_data in adversarial_queries:33 query = query_data['query']34 attack_type = query_data['attack_type']3536 print(f"Testing: {attack_type} - {query[:60]}...")3738 try:39 response = agent.generate_response(query)4041 # Check for failure modes42 if len(response['response']) < 10:43 failures.append({44 'query': query,45 'attack_type': attack_type,46 'failure_mode': 'empty_response'47 })4849 elif 'hallucination' in detect_hallucinations(50 response['response'],51 response['context_used']52 ):53 failures.append({54 'query': query,55 'attack_type': attack_type,56 'failure_mode': 'hallucination'57 })5859 except Exception as e:60 failures.append({61 'query': query,62 'attack_type': attack_type,63 'failure_mode': 'exception',64 'error': str(e)65 })6667 # Generate failure report68 with open('evaluation/results/adversarial_failures.json', 'w') as f:69 json.dump(failures, f, indent=2)7071 print(f"\n⚠️ Found {len(failures)} failure cases out of 50 adversarial queries")72 print(f" Failure rate: {len(failures)/50*100:.1f}%")7374 # Categorize failures75 failure_by_type = {}76 for failure in failures:77 attack_type = failure['attack_type']78 failure_by_type[attack_type] = failure_by_type.get(attack_type, 0) + 17980 print("\n📊 Failures by attack type:")81 for attack_type, count in sorted(failure_by_type.items(),82 key=lambda x: x[1],83 reverse=True):84 print(f" {attack_type}: {count}")8586 return failures8788def detect_hallucinations(response: str, context_docs: list) -> list:89 """90 Detect potential hallucinations by checking if claims in response91 are supported by retrieved context.92 """9394 hallucinations = []9596 # Extract claims from response (sentences making factual statements)97 claims = extract_claims(response)9899 # Create context text corpus100 context_text = "\n".join([doc['abstract'] for doc in context_docs])101102 for claim in claims:103 # Check if claim is substantiated by context104 # Use simple token overlap for now (could use entailment model)105 claim_tokens = set(claim.lower().split())106 context_tokens = set(context_text.lower().split())107108 overlap = len(claim_tokens & context_tokens) / len(claim_tokens)109110 if overlap < 0.3: # Less than 30% overlap suggests hallucination111 hallucinations.append({112 'claim': claim,113 'overlap_score': overlap,114 'severity': 'high' if overlap < 0.1 else 'medium'115 })116117 return hallucinations118119def extract_claims(response: str) -> list[str]:120 """Extract factual claims from response."""121 # Simple heuristic: sentences with "is", "are", "shows", "demonstrates"122 sentences = response.split('.')123124 claim_indicators = ['is', 'are', 'shows', 'demonstrates', 'found', 'reports']125126 claims = [127 sent.strip() for sent in sentences128 if any(indicator in sent.lower() for indicator in claim_indicators)129 and len(sent.split()) > 5 # Substantial claim130 ]131132 return claims133134if __name__ == "__main__":135 failures = run_adversarial_tests()
Run this regularly:
bash1# Weekly adversarial testing20 3 * * 1 cd /path/to/research-assistant && python evaluation/adversarial_testing.py
Continuous Monitoring with vero-eval
Set up real-time quality monitoring:
python1# evaluation/continuous_monitor.py23from vero.monitor import QualityMonitor4from datetime import datetime, timedelta5import smtplib6from email.mime.text import MIMEText78class ProductionMonitor:9 def __init__(self, trace_db_path: str):10 self.monitor = QualityMonitor(trace_db_path)11 self.alert_thresholds = {12 'precision_drop': 0.15, # Alert if precision drops by 15%13 'latency_spike': 2.0, # Alert if latency > 2 seconds14 'error_rate': 0.05, # Alert if error rate > 5%15 'faithfulness_drop': 0.20 # Alert if faithfulness drops by 20%16 }1718 def check_system_health(self):19 """20 Run every hour to check if system performance is degrading.21 """2223 # Get metrics for last 24 hours24 recent_metrics = self.monitor.get_metrics(25 start_time=datetime.now() - timedelta(hours=24),26 end_time=datetime.now()27 )2829 # Get baseline metrics (last week average)30 baseline_metrics = self.monitor.get_metrics(31 start_time=datetime.now() - timedelta(days=7),32 end_time=datetime.now() - timedelta(days=1)33 )3435 alerts = []3637 # Check for precision drop38 precision_drop = (39 baseline_metrics['precision'] - recent_metrics['precision']40 )41 if precision_drop > self.alert_thresholds['precision_drop']:42 alerts.append({43 'severity': 'high',44 'metric': 'precision',45 'message': f"Precision dropped by {precision_drop:.2%}",46 'baseline': baseline_metrics['precision'],47 'current': recent_metrics['precision']48 })4950 # Check for latency spikes51 if recent_metrics['avg_latency'] > self.alert_thresholds['latency_spike']:52 alerts.append({53 'severity': 'medium',54 'metric': 'latency',55 'message': f"Average latency: {recent_metrics['avg_latency']:.2f}s",56 'baseline': baseline_metrics['avg_latency'],57 'current': recent_metrics['avg_latency']58 })5960 # Check error rate61 if recent_metrics['error_rate'] > self.alert_thresholds['error_rate']:62 alerts.append({63 'severity': 'critical',64 'metric': 'error_rate',65 'message': f"Error rate: {recent_metrics['error_rate']:.2%}",66 'baseline': baseline_metrics['error_rate'],67 'current': recent_metrics['error_rate']68 })6970 # Check faithfulness71 faithfulness_drop = (72 baseline_metrics['faithfulness'] - recent_metrics['faithfulness']73 )74 if faithfulness_drop > self.alert_thresholds['faithfulness_drop']:75 alerts.append({76 'severity': 'high',77 'metric': 'faithfulness',78 'message': f"Faithfulness dropped by {faithfulness_drop:.2%}",79 'baseline': baseline_metrics['faithfulness'],80 'current': recent_metrics['faithfulness']81 })8283 # Send alerts if any84 if alerts:85 self.send_alerts(alerts)8687 # Log to monitoring system88 self.log_health_check(recent_metrics, alerts)8990 return alerts9192 def send_alerts(self, alerts: list):93 """Send alerts via email/Slack/PagerDuty"""9495 critical_alerts = [a for a in alerts if a['severity'] == 'critical']9697 if critical_alerts:98 # Page on-call engineer99 self.page_oncall(critical_alerts)100101 # Email summary102 email_body = self.format_alert_email(alerts)103 self.send_email(104 to='team@example.com',105 subject=f"🚨 Research Assistant Quality Alert - {len(alerts)} issues",106 body=email_body107 )108109 def format_alert_email(self, alerts: list) -> str:110 """Format alerts as HTML email"""111112 html = """113 <h2>Research Assistant Quality Alerts</h2>114 <p>The following performance degradations were detected:</p>115 <table border="1" cellpadding="10">116 <tr>117 <th>Severity</th>118 <th>Metric</th>119 <th>Baseline</th>120 <th>Current</th>121 <th>Message</th>122 </tr>123 """124125 for alert in alerts:126 severity_color = {127 'critical': '#ff0000',128 'high': '#ff6600',129 'medium': '#ffaa00'130 }[alert['severity']]131132 html += f"""133 <tr>134 <td style="background-color: {severity_color}; color: white;">135 {alert['severity'].upper()}136 </td>137 <td>{alert['metric']}</td>138 <td>{alert['baseline']:.3f}</td>139 <td>{alert['current']:.3f}</td>140 <td>{alert['message']}</td>141 </tr>142 """143144 html += """145 </table>146 <p>147 <a href="http://your-monitoring-url/dashboard">View Full Dashboard</a>148 </p>149 """150151 return html152153 def log_health_check(self, metrics: dict, alerts: list):154 """Log to your monitoring system (Prometheus/Datadog/etc)"""155156 # Example: Push to Prometheus Pushgateway157 # In production, you'd use actual client library158159 print(f"[{datetime.now()}] Health Check:")160 print(f" Precision: {metrics['precision']:.3f}")161 print(f" Recall: {metrics['recall']:.3f}")162 print(f" Faithfulness: {metrics['faithfulness']:.3f}")163 print(f" Avg Latency: {metrics['avg_latency']:.2f}s")164 print(f" Error Rate: {metrics['error_rate']:.2%}")165166 if alerts:167 print(f" ⚠️ {len(alerts)} alerts triggered")168 else:169 print(f" ✓ All metrics within normal range")170171# Run as scheduled job172if __name__ == "__main__":173 monitor = ProductionMonitor('evaluation/trace.db')174 alerts = monitor.check_system_health()175176 if alerts:177 exit(1) # Non-zero exit code for alerting systems
Set up as cron job:
bash1# Check every hour20 * * * * cd /path/to/research-assistant && python evaluation/continuous_monitor.py
Part 12: Scaling Beyond 10K Papers
As your research collection grows, you'll need to optimize:
1. Migrate to a Dedicated Vector Database
For 10K+ papers, Neo4j's vector indexes can become slow. Use a specialized vector DB:
python1# scripts/migrate_to_pinecone.py23import pinecone4from neo4j import GraphDatabase5import os67def migrate_embeddings_to_pinecone():8 """9 Migrate embeddings from Neo4j to Pinecone for faster retrieval.10 Keep Neo4j for graph relationships, Pinecone for vector search.11 """1213 # Initialize Pinecone14 pinecone.init(15 api_key=os.getenv("PINECONE_API_KEY"),16 environment="us-west1-gcp"17 )1819 # Create index if doesn't exist20 if "research-papers" not in pinecone.list_indexes():21 pinecone.create_index(22 name="research-papers",23 dimension=4096, # nomic-embed-text24 metric="cosine",25 pods=2,26 replicas=1,27 pod_type="p1.x1"28 )2930 index = pinecone.Index("research-papers")3132 # Extract embeddings from Neo4j33 driver = GraphDatabase.driver(34 "bolt://localhost:7687",35 auth=("neo4j", "research2025")36 )3738 with driver.session() as session:39 # Get papers in batches40 batch_size = 10041 offset = 04243 while True:44 papers = session.run("""45 MATCH (p:Paper)46 RETURN p.title AS title,47 p.abstract AS abstract,48 p.abstract_embedding AS embedding,49 p.year AS year,50 ID(p) AS neo4j_id51 ORDER BY p.year DESC52 SKIP $offset53 LIMIT $batch_size54 """,55 offset=offset,56 batch_size=batch_size57 ).data()5859 if not papers:60 break6162 # Prepare vectors for Pinecone63 vectors = []64 for paper in papers:65 vectors.append({66 'id': str(paper['neo4j_id']),67 'values': paper['embedding'],68 'metadata': {69 'title': paper['title'],70 'abstract': paper['abstract'][:500], # Truncate71 'year': paper['year'],72 'neo4j_id': paper['neo4j_id']73 }74 })7576 # Upsert to Pinecone77 index.upsert(vectors=vectors, namespace="papers")7879 print(f"✓ Migrated {offset + len(papers)} papers")80 offset += batch_size8182 print(f"\n✅ Migration complete! {offset} papers in Pinecone")8384# Update retriever to use Pinecone85class HybridRetrieverWithPinecone:86 def __init__(self, neo4j_driver, pinecone_index_name="research-papers"):87 self.neo4j_driver = neo4j_driver88 self.pinecone_index = pinecone.Index(pinecone_index_name)8990 def retrieve_context(self, query: str, limit: int = 5) -> list[dict]:91 """Hybrid retrieval using Pinecone + Neo4j graph"""9293 # 1. Vector search with Pinecone (fast!)94 query_embedding = ollama.embeddings(95 model='nomic-embed-text',96 prompt=query97 )['embedding']9899 pinecone_results = self.pinecone_index.query(100 vector=query_embedding,101 top_k=limit * 2,102 include_metadata=True,103 namespace="papers"104 )105106 # 2. Get Neo4j IDs from Pinecone results107 neo4j_ids = [108 int(match['metadata']['neo4j_id'])109 for match in pinecone_results['matches']110 ]111112 # 3. Enrich with graph relationships from Neo4j113 with self.neo4j_driver.session() as session:114 enriched = session.run("""115 UNWIND $neo4j_ids AS paper_id116 MATCH (p:Paper) WHERE ID(p) = paper_id117 OPTIONAL MATCH (p)<-[:AUTHORED]-(a:Author)118 OPTIONAL MATCH (p)-[:DISCUSSES]->(c:Concept)119 OPTIONAL MATCH (p)-[:CITES]->(cited:Paper)120121 RETURN122 p.title AS title,123 p.abstract AS abstract,124 p.year AS year,125 collect(DISTINCT a.name) AS authors,126 collect(DISTINCT c.name) AS concepts,127 collect(DISTINCT cited.title) AS citations128 """,129 neo4j_ids=neo4j_ids130 ).data()131132 # 4. Combine Pinecone scores with Neo4j metadata133 results = []134 for i, match in enumerate(pinecone_results['matches']):135 neo4j_data = enriched[i] if i < len(enriched) else {}136137 results.append({138 'title': neo4j_data.get('title', match['metadata']['title']),139 'abstract': neo4j_data.get('abstract', match['metadata']['abstract']),140 'year': neo4j_data.get('year', match['metadata']['year']),141 'authors': neo4j_data.get('authors', []),142 'concepts': neo4j_data.get('concepts', []),143 'citations': neo4j_data.get('citations', []),144 'relevance_score': match['score'],145 'retrieval_method': 'pinecone_vector_search'146 })147148 return results[:limit]
Benefits of this architecture:
- Pinecone handles 10M+ vectors easily
- Neo4j focuses on graph relationships (citations, authorship)
- Best of both worlds: fast vector search + rich graph traversal
2. Implement Query Result Caching
python1# lib/query_cache.py23import redis4import hashlib5import json6from datetime import timedelta78class QueryCache:9 def __init__(self, redis_url: str = "redis://localhost:6379"):10 self.redis = redis.from_url(redis_url)11 self.ttl = timedelta(hours=24) # Cache for 24 hours1213 def get_cached_response(self, query: str, persona_config: dict) -> dict | None:14 """15 Check if we have a cached response for this query+persona combination.16 """1718 # Create cache key from query + persona config19 cache_key = self._create_cache_key(query, persona_config)2021 cached = self.redis.get(cache_key)22 if cached:23 print(f"✓ Cache hit for query: {query[:50]}...")24 return json.loads(cached)2526 return None2728 def cache_response(self, query: str, persona_config: dict, response: dict):29 """Store response in cache"""3031 cache_key = self._create_cache_key(query, persona_config)3233 self.redis.setex(34 cache_key,35 self.ttl,36 json.dumps(response)37 )3839 def _create_cache_key(self, query: str, persona_config: dict) -> str:40 """Create deterministic cache key"""4142 # Include relevant persona config aspects43 persona_hash = hashlib.md5(44 json.dumps(persona_config, sort_keys=True).encode()45 ).hexdigest()4647 query_hash = hashlib.md5(query.encode()).hexdigest()4849 return f"query_cache:{query_hash}:{persona_hash}"5051 def invalidate_cache(self):52 """Invalidate all cached queries (e.g., after persona update)"""5354 keys = self.redis.keys("query_cache:*")55 if keys:56 self.redis.delete(*keys)57 print(f"✓ Invalidated {len(keys)} cached queries")5859# Integrate into reasoning agent60class CachedReasoningAgent(PersonaReasoningAgent):61 def __init__(self, *args, **kwargs):62 super().__init__(*args, **kwargs)63 self.cache = QueryCache()6465 def generate_response(self, query: str, chat_history: list = None) -> dict:66 """Generate response with caching"""6768 # Check cache first69 cached = self.cache.get_cached_response(query, self.persona_config)70 if cached:71 return cached7273 # Generate fresh response74 response = super().generate_response(query, chat_history)7576 # Cache if quality is good77 if response['quality_grade'] > 0.7:78 self.cache.cache_response(query, self.persona_config, response)7980 return response
3. Batch Embedding Generation
When ingesting large collections:
python1# scripts/batch_embedding_generator.py23from concurrent.futures import ThreadPoolExecutor4import ollama5import time67class BatchEmbeddingGenerator:8 def __init__(self, model: str = 'nomic-embed-text', max_workers: int = 4):9 self.model = model10 self.max_workers = max_workers11 self.rate_limit_delay = 0.1 # 100ms between requests1213 def generate_embeddings_batch(self, texts: list[str]) -> list[list[float]]:14 """15 Generate embeddings for multiple texts in parallel with rate limiting.16 """1718 embeddings = []1920 with ThreadPoolExecutor(max_workers=self.max_workers) as executor:21 # Submit all tasks22 futures = []23 for i, text in enumerate(texts):24 future = executor.submit(self._generate_single, text, i)25 futures.append(future)2627 # Rate limiting28 time.sleep(self.rate_limit_delay)2930 # Collect results in order31 for future in futures:32 embedding, index = future.result()33 embeddings.append((index, embedding))3435 # Sort by original index36 embeddings.sort(key=lambda x: x[0])3738 return [emb for _, emb in embeddings]3940 def _generate_single(self, text: str, index: int) -> tuple[list[float], int]:41 """Generate single embedding with retry logic"""4243 max_retries = 344 for attempt in range(max_retries):45 try:46 response = ollama.embeddings(47 model=self.model,48 prompt=text[:8192] # Truncate to model limit49 )50 return response['embedding'], index5152 except Exception as e:53 if attempt == max_retries - 1:54 raise5556 print(f"⚠️ Retry {attempt+1}/{max_retries} for text {index}: {e}")57 time.sleep(2 ** attempt) # Exponential backoff5859# Use in ingestion pipeline60def ingest_large_collection(papers: list[Path]):61 """Efficiently ingest 1000+ papers"""6263 generator = BatchEmbeddingGenerator(max_workers=8)6465 # Process in batches of 5066 batch_size = 506768 for i in range(0, len(papers), batch_size):69 batch = papers[i:i+batch_size]7071 print(f"Processing batch {i//batch_size + 1}/{len(papers)//batch_size + 1}")7273 # Extract abstracts74 abstracts = []75 metadata_list = []76 for paper_path in batch:77 metadata = extract_paper_metadata(paper_path)78 abstracts.append(metadata['abstract'])79 metadata_list.append(metadata)8081 # Generate embeddings in parallel82 embeddings = generator.generate_embeddings_batch(abstracts)8384 # Insert into database85 with neo4j_driver.session() as session:86 for metadata, embedding in zip(metadata_list, embeddings):87 metadata['abstract_embedding'] = embedding88 create_paper_node(session, metadata)8990 print(f"✓ Ingested batch {i//batch_size + 1}")
Part 13: Real-World Production Case Study
Let's walk through a complete example from a hypothetical research lab:
Scenario: Computational Biology Research Lab
Requirements:
- 5,000 existing papers in their collection
- Weekly updates with new publications
- 15 active researchers with different expertise levels
- Need to find cross-domain connections (CS ↔ Biology)
- High precision required (wrong papers waste researcher time)
Implementation:
python1# config/bio_lab_config.py23RESEARCH_LAB_CONFIG = {4 'name': 'Computational Biology Lab',5 'paper_sources': [6 'local_collection', # Existing 5K papers7 'pubmed_api', # Weekly updates8 'biorxiv_api', # Preprints9 'arxiv_bio' # CS bio papers10 ],11 'personas': {12 'wet_lab_biologist': {13 'description': 'Bench scientists with limited CS background',14 'rlhf_thresholds': {15 'technical_detail': 0.3, # Less technical jargon16 'methodology_depth': 0.8, # High experimental detail17 'formality': 0.518 },19 'preferred_sources': ['Nature', 'Cell', 'Science']20 },21 'computational_biologist': {22 'description': 'Hybrid CS/Bio expertise',23 'rlhf_thresholds': {24 'technical_detail': 0.8, # Can handle complexity25 'methodology_depth': 0.9, # Wants algorithm details26 'formality': 0.727 },28 'preferred_sources': ['Nature Methods', 'Bioinformatics', 'PLOS Comp Bio']29 },30 'pi_researcher': {31 'description': 'Principal investigator, needs big picture',32 'rlhf_thresholds': {33 'technical_detail': 0.5, # Balanced34 'methodology_depth': 0.4, # Focus on conclusions35 'formality': 0.9 # Very formal36 },37 'preferred_sources': ['High-impact journals', 'Review articles']38 }39 },40 'quality_requirements': {41 'min_precision': 0.85, # Must retrieve >85% relevant papers42 'min_faithfulness': 0.90, # Responses must be 90% faithful to sources43 'max_latency': 3.0 # 3 second max response time44 }45}
Setup Script:
bash1#!/bin/bash2# setup_bio_lab.sh34echo "🧬 Setting up Computational Biology Research Assistant"56# 1. Ingest existing collection7echo "📚 Ingesting 5,000 existing papers..."8python scripts/ingest_research_data.py \9 --directory /data/lab_papers \10 --batch-size 50 \11 --parallel-workers 81213# 2. Set up automated paper updates14echo "📰 Configuring automated updates..."15python scripts/setup_paper_updates.py \16 --sources pubmed,biorxiv,arxiv \17 --schedule daily \18 --filter "computational biology OR bioinformatics"1920# 3. Generate persona-specific test datasets21echo "🧪 Generating evaluation datasets..."22python evaluation/generate_test_dataset.py \23 --personas wet_lab,computational,pi \24 --queries-per-persona 502526# 4. Run initial evaluation27echo "📊 Running baseline evaluation..."28python evaluation/run_evaluation.py \29 --config config/bio_lab_config.py3031# 5. Deploy to production32echo "🚀 Deploying to production..."33docker-compose -f docker-compose.bio-lab.yml up -d3435echo "✅ Setup complete!"36echo " Dashboard: http://lab-research-assistant.local"37echo " Monitoring: http://lab-research-assistant.local/metrics"
Weekly Evaluation Report Email:
python1# scripts/weekly_report.py23from vero.report import ReportGenerator4import smtplib5from email.mime.multipart import MIMEMultipart6from email.mime.text import MIMEText7from email.mime.image import MIMEImage8import matplotlib.pyplot as plt910def generate_weekly_report():11 """12 Automated weekly report sent to PI and lab members.13 """1415 # Generate vero-eval report16 generator = ReportGenerator(17 trace_db_path='evaluation/trace.db',18 results_path='evaluation/results/weekly.json'19 )2021 # Create visualizations22 fig, axes = plt.subplots(2, 2, figsize=(12, 10))2324 # 1. Precision trends by persona25 axes[0, 0].plot(26 weekly_data['wet_lab_precision'],27 label='Wet Lab',28 marker='o'29 )30 axes[0, 0].plot(31 weekly_data['computational_precision'],32 label='Computational',33 marker='s'34 )35 axes[0, 0].plot(36 weekly_data['pi_precision'],37 label='PI',38 marker='^'39 )40 axes[0, 0].set_title('Retrieval Precision by Persona')41 axes[0, 0].set_xlabel('Week')42 axes[0, 0].set_ylabel('Precision@5')43 axes[0, 0].legend()44 axes[0, 0].grid(True, alpha=0.3)4546 # 2. Faithfulness over time47 axes[0, 1].plot(48 weekly_data['faithfulness'],49 color='green',50 marker='o'51 )52 axes[0, 1].axhline(y=0.90, color='r', linestyle='--',53 label='Target (90%)')54 axes[0, 1].set_title('Response Faithfulness')55 axes[0, 1].set_xlabel('Week')56 axes[0, 1].set_ylabel('Faithfulness Score')57 axes[0, 1].legend()58 axes[0, 1].grid(True, alpha=0.3)5960 # 3. Query latency distribution61 axes[1, 0].hist(62 weekly_data['latencies'],63 bins=30,64 edgecolor='black'65 )66 axes[1, 0].axvline(x=3.0, color='r', linestyle='--',67 label='Max Latency (3s)')68 axes[1, 0].set_title('Query Latency Distribution')69 axes[1, 0].set_xlabel('Latency (seconds)')70 axes[1, 0].set_ylabel('Frequency')71 axes[1, 0].legend()7273 # 4. Top failure categories74 failure_categories = weekly_data['failure_categories']75 axes[1, 1].barh(76 list(failure_categories.keys()),77 list(failure_categories.values())78 )79 axes[1, 1].set_title('Top Failure Categories')80 axes[1, 1].set_xlabel('Count')8182 plt.tight_layout()83 plt.savefig('evaluation/results/weekly_report.png', dpi=150)8485 # Create email86 msg = MIMEMultipart()87 msg['Subject'] = f'Research Assistant Weekly Report - Week {week_number}'88 msg['From'] = 'research-assistant@lab.edu'89 msg['To'] = 'pi@lab.edu, lab-members@lab.edu'9091 # Email body92 html_body = f"""93 <html>94 <body>95 <h2>Research Assistant Performance Report</h2>96 <h3>Week {week_number} - {date_range}</h3>9798 <h4>📊 Key Metrics</h4>99 <table border="1" cellpadding="10">100 <tr>101 <th>Metric</th>102 <th>This Week</th>103 <th>Last Week</th>104 <th>Change</th>105 </tr>106 <tr>107 <td>Avg Precision@5</td>108 <td>{current_precision:.2%}</td>109 <td>{last_precision:.2%}</td>110 <td style="color: {'green' if change > 0 else 'red'};">111 {change:+.2%}112 </td>113 </tr>114 <tr>115 <td>Faithfulness</td>116 <td>{current_faithfulness:.2%}</td>117 <td>{last_faithfulness:.2%}</td>118 <td style="color: {'green' if faith_change > 0 else 'red'};">119 {faith_change:+.2%}120 </td>121 </tr>122 <tr>123 <td>Avg Latency</td>124 <td>{current_latency:.2f}s</td>125 <td>{last_latency:.2f}s</td>126 <td style="color: {'green' if latency_change < 0 else 'red'};">127 {latency_change:+.2f}s128 </td>129 </tr>130 <tr>131 <td>Queries Served</td>132 <td>{current_queries}</td>133 <td>{last_queries}</td>134 <td>{queries_change:+d}</td>135 </tr>136 </table>137138 <h4>🎯 Performance by Persona</h4>139 <ul>140 <li><strong>Wet Lab Biologists:</strong>141 Precision: {wet_lab_precision:.2%}142 (Target: >85% ✓)143 </li>144 <li><strong>Computational Biologists:</strong>145 Precision: {comp_bio_precision:.2%}146 (Target: >85% ✓)147 </li>148 <li><strong>PI Queries:</strong>149 Precision: {pi_precision:.2%}150 (Target: >85% ⚠️ Below target)151 </li>152 </ul>153154 <h4>⚠️ Issues & Recommendations</h4>155 <ul>156 <li>{issue_1}</li>157 <li>{issue_2}</li>158 </ul>159160 <p>See attached visualization for detailed trends.</p>161162 <p>163 <a href="http://lab-research-assistant.local/dashboard">164 View Interactive Dashboard165 </a>166 </p>167168 </body>169 </html>170 """171172 msg.attach(MIMEText(html_body, 'html'))173174 # Attach visualization175 with open('evaluation/results/weekly_report.png', 'rb') as f:176 img = MIMEImage(f.read())177 img.add_header('Content-Disposition', 'attachment',178 filename='weekly_trends.png')179 msg.attach(img)180181 # Send email182 with smtplib.SMTP('smtp.lab.edu', 587) as smtp:183 smtp.starttls()184 smtp.login('research-assistant@lab.edu', os.getenv('EMAIL_PASSWORD'))185 smtp.send_message(msg)186187 print("✓ Weekly report sent to lab members")188189if __name__ == "__main__":190 generate_weekly_report()

Conclusion: The Complete Picture
You now have everything needed to build, evaluate, and deploy a production-ready Research Assistant:
Core Architecture:
✅ Neo4j knowledge graph for research papers
✅ Ollama for local LLM inference
✅ Hybrid retrieval (vector + graph)
✅ Persona-driven responses with RLHF
Evaluation & Quality:
✅ vero-eval for rigorous testing
✅ Automated adversarial testing
✅ Continuous monitoring with alerts
✅ Weekly performance reports
Production Features:
✅ Caching for performance
✅ Batch processing for scale
✅ Automated paper updates
✅ Multi-persona support
The vero-eval Advantage:
What makes this system production-ready is the evaluation framework. Unlike traditional RAG systems that rely on gut feeling and spot-checking, we have:
- Systematic edge case testing - adversarial queries expose weaknesses
- Persona stress testing - ensures all user types are served well
- Automated regression detection - alerts when quality degrades
- Actionable metrics - precision/recall/faithfulness directly inform improvements
- Continuous learning - RLHF loop closes based on real performance data
This is the difference between a demo and a system you'd trust with real research workflows.
Next Steps:
- Clone the starter repo and follow the setup script
- Ingest your first 100 papers to test the pipeline
- Run vero-eval to establish your baseline
- Iterate on retrieval and persona prompts
- Deploy to staging and gather feedback
- Use weekly reports to drive improvements
Remember: The goal isn't perfect accuracy on day one. It's building a system that measurably improves over time through evaluation-driven iteration.
Now go build something that makes research more efficient! 🚀
Resources:
Questions? Open an issue in the repo or reach out to the community.

Sovereign AI: Building Local-First Intelligent Systems
by Daniel Kliewer · Paperback · 72 pages
The hands-on guide to building AI that runs on your hardware, keeps your data private, and eliminates cloud dependence. Working code included.