Building Autonomous Sovereign AI: How Autoresearch Loops and Expert Fine-Tuning Create Self-Improving Local AI Systems
How to build self-improving AI systems using autoresearch loops, agent recipes, and domain-specific fine-tuning with open-source tools. A complete implementation guide connecting the latest research from Introspection, Bridgewater AIA Labs, and Thinking Machines Lab.
Daniel Kliewer
Author, Sovereign AI

Autoresearch Loops and Differentiated Intelligence
Two Converging Blueprints for Self-Improving AI Systems
Date: July 2, 2026
Introduction: The Shift from Models to Systems That Improve Themselves
Two major threads in AI research converged almost simultaneously.
On one side, Introspection's "autoresearch" framework reframes AI systems not as static models, but as self-improving loops. On the other, Thinking Machines Lab and Bridgewater AIA Labs demonstrated something more concrete: carefully trained open-weight models can outperform frontier LLMs on tasks requiring expert judgment—at lower cost and higher accuracy.
Taken together, they point to a new design principle:
The unit of intelligence is no longer the model. It is the loop.
This post synthesizes both perspectives into a single architecture for building sovereign, self-improving AI systems—systems that continuously refine their own behavior through evaluation, feedback, and fine-tuning.
Part 1: Autoresearch — When the Loop Becomes the Product
Roland Gavrilescu's framing at Introspection introduces a shift in how we think about agent systems.
1. The Loop Is the Product
Traditional AI systems are static:
Train → Deploy → Maintain
Autoresearch systems are dynamic:
Observe → Evaluate → Improve → Repeat
The key idea is that the feedback loop itself becomes the product surface.
But the hard problem isn't building loops—it's designing signals that are meaningful enough for improvement without collapsing into noisy optimization.
Cheap signals (likes, heuristics, weak metrics) lead to "slop optimization." Expensive signals (expert review, structured evals) are what actually move capability.
2. Agent Recipes: Capturing How Systems Evolve
A core concept is the agent recipe.
An agent recipe is not configuration—it is history:
- The model + harness configuration
- The evaluation suite used over time
- The human expertise embedded in the system
- The failure cases that led to new evaluations
- The decisions that shaped the system's current behavior
If you inherited a production agent system, the code alone would not explain why it behaves the way it does. The recipe captures that missing context.
It is, effectively: A versioned memory of how intelligence was shaped.
3. Inner Loop vs Outer Loop
Autoresearch systems split into two interacting systems:
Inner loop:
- Executes tasks
- Produces outputs
- Interfaces with users
Outer loop:
- Observes performance
- Identifies failure patterns
- Creates new evaluations
- Updates prompts, tools, or training data
The outer loop is where improvement happens. The inner loop is where value is delivered.
The key design challenge is ensuring the outer loop remains cost-bounded and signal-efficient, not a runaway optimization engine.
4. Humans as Tools in the Loop
A subtle but important shift:
Humans are not outside the system. They are callable components inside the loop, especially early on.
As systems accumulate examples of human decisions, they reduce their reliance on explicit queries. This mirrors apprenticeship: early heavy supervision → gradual autonomy.
Part 2: The Expert Judgment Problem
Autoresearch loops matter because of a deeper empirical limitation in current frontier models.
Where Frontier Models Break
Bridgewater AIA Labs evaluated frontier models on six tasks involving real investment workflows:
- Financial article relevance
- Central bank document interpretation
- Boilerplate detection in research
- Email truncation detection
- Signal extraction from macroeconomic text
- General document relevance filtering
These are not reasoning-heavy tasks. They are judgment-heavy tasks. And that distinction matters.
Even with strong prompting, frontier models plateaued around ~78% accuracy—below the threshold required for real-world deployment in expert workflows.
The Core Limitation: Tacit Judgment
Prompts can only encode what experts can articulate. The most important judgments are often non-verbalizable.
This is where prompting stops working.
Why Fine-Tuning Wins
Fine-tuning bypasses articulation entirely. Instead of translating intuition into instructions, it learns directly from examples of decisions.
The result:
- Base model: ~44% accuracy
- With GRPO + structured training: ~73%
- Final system: ~84.7% accuracy
And critically:
- ~30% fewer errors than frontier models
- ~13.8× lower inference cost
This is not incremental improvement. It is a regime shift in how capability is produced.
What Actually Mattered in Training
The gains did not come from a single trick. They came from structured system design:
- GRPO-style RL: largest jump in performance
- Interleaved batching: improves cross-task generalization
- Loss function design (CISPO): stabilizes optimization
- On-policy distillation: prevents degradation over time
- Carefully curated expert feedback loops: highest leverage factor
But the most important bottleneck wasn't architecture—it was data quality and labeling strategy.
A key technique:
Train on cheap labels → route disagreements to experts → iterate
This turns expensive expert time into a targeted refinement signal rather than a brute-force labeling requirement.
Part 3: What This Means — The New AI Architecture Stack
When you combine autoresearch loops with fine-tuning results, a consistent architecture emerges.
1. Separate Inner and Outer Loops Explicitly
- Inner loop: fast inference, stable behavior, user-facing reliability
- Outer loop: slow optimization, experimentation, evaluation-driven updates
They must be independently constrained.
2. Treat "Recipes" as First-Class Artifacts
Agent systems should not be defined by prompts or configs. They should be defined by:
- Evaluation history
- Failure cases
- Data lineage
- Human correction traces
This is the difference between a system that works today and one that improves tomorrow.
3. Prompting Has a Ceiling
Prompt engineering works for:
- Knowledge retrieval
- Structured reasoning
- Clear rule-based tasks
It fails for:
- Tacit judgment
- Domain-specific intuition
- Expert-style filtering decisions
When the task depends on "feel," you need data, not prompts.
4. Fine-Tuning Is Not Optional for Expert Systems
If a task meets this condition: "An expert cannot fully explain how they decide," then the correct solution is:
- Not better prompting
- Not longer context windows
- But supervised + RL fine-tuning pipelines
5. Cost Efficiency Comes from Specialization
The economic advantage is structural. Smaller, specialized models:
- Beat frontier models on narrow expert tasks
- Cost an order of magnitude less
- Run locally with sovereignty guarantees
This is the foundation of differentiated intelligence.
Part 4: Sovereign AI Systems — The Practical Architecture
The implementation pattern that emerges looks like this:
Core Components
-
Local inference layer
- Ollama or similar runtime
- Open-weight models (Qwen, Llama, Mistral)
-
Agent harness
- Task execution layer
- Tool calling + orchestration
- Deterministic control flow
-
Evaluation system
- Domain-specific judges
- Failure detection logic
- Automated regression tests
-
Outer loop system
- Logs performance over time
- Generates new evaluations
- Updates recipes and datasets
-
Fine-tuning pipeline
- GRPO / RL-based optimization
- LoRA-based efficient training
- Distillation from stronger teachers
-
Knowledge layer
- Vector database (semantic memory)
- Knowledge graph (structured relationships)
- Persona routing (expert specialization)
Part 5: The Key Insight — Intelligence Is Becoming Infrastructure
The convergence here is not accidental. Both systems point to the same shift:
Old paradigm: Intelligence = model capability
New paradigm: Intelligence = system that improves itself
The model becomes just one component in a larger feedback architecture.
The real differentiator is:
- How you collect feedback
- How you structure evaluation
- How you convert experience into training signal
- How you close the loop
Conclusion: From Models to Living Systems
The next generation of AI systems will not be defined by parameter count or context length. They will be defined by:
- How quickly they learn from failure
- How well they encode expert judgment
- How tightly feedback loops are integrated into their architecture
- How cheaply they improve over time
Autoresearch provides the system design. Fine-tuning research provides the empirical validation.
Together, they define a single direction: AI systems are becoming self-improving infrastructures for capturing and refining human expertise.
The model is no longer the product. The loop is.
Sources
- Autoresearch: The feedback loop behind self-improving agents - Latent.Space
- Learning to replicate expert judgment in financial tasks - Thinking Machines Lab
Addendum: Implementation Notes and Minimal Code Examples for a Sovereign Autoresearch System
This addendum translates the architecture described above into concrete, minimal implementations. The goal is not production completeness, but to show how the pieces actually connect: inner loop, outer loop, evaluation layer, and fine-tuning pipeline.
1. Core Idea: Everything Reduces to a Loop
At runtime, every sovereign AI system collapses into the same structure:
python1def run_system(task):2 result = inner_loop(task)3 score = evaluate(result)4 feedback = outer_loop(task, result, score)5 update_system(feedback)6 return result
Everything else—agents, RAG, fine-tuning—is just implementation detail around this structure.
2. Inner Loop: Agent Execution Layer
The inner loop is the "worker." It must be stable, deterministic enough to evaluate, and cheap enough to run repeatedly.
Example: Local Agent with Ollama
python1from ollama import chat23class InnerLoopAgent:4 def __init__(self, model="qwen2.5:7b"):5 self.model = model67 def run(self, task, context=""):8 prompt = f"""9 You are an expert system.10 Context:11 {context}12 Task:13 {task}14 Return a structured answer.15 """16 response = chat(17 model=self.model,18 messages=[{"role": "user", "content": prompt}]19 )20 return response["message"]["content"]
Key point: The inner loop should NOT evolve itself. It only executes.
3. Evaluators: Turning Judgment into Code
Evaluators are where "taste" becomes computable.
Example: Simple domain evaluator
python1def relevance_evaluator(output: str, task: str) -> float:2 """3 Scores whether output matches expected domain constraints.4 In practice, this can be:5 - heuristics6 - small judge model7 - embedding similarity8 """9 keywords = ["market", "risk", "macro", "liquidity"]10 score = sum(1 for k in keywords if k in output.lower())11 return min(score / len(keywords), 1.0)
Better version: LLM-as-judge
python1def llm_judge(output, task, model="llama3.1:8b"):2 prompt = f"""3 Evaluate this output for correctness and relevance.4 Task:5 {task}6 Output:7 {output}8 Score from 0 to 1 with explanation.9 """10 res = chat(model=model, messages=[{"role": "user", "content": prompt}])11 return parse_score(res["message"]["content"])
4. Outer Loop: Autoresearch Engine
The outer loop is the "researcher." It looks at failures and modifies the system.
Minimal implementation
python1from collections import defaultdict23class OuterLoop:4 def __init__(self):5 self.failures = []67 def record(self, task, output, score):8 if score < 0.8:9 self.failures.append((task, output, score))1011 def analyze_patterns(self):12 patterns = defaultdict(int)13 for task, output, score in self.failures:14 if "market" in output:15 patterns["market_bias"] += 116 if len(output) < 50:17 patterns["verbosity_issue"] += 118 return patterns
5. Turning Failures into New Evaluators
This is the key autoresearch step: the system writes its own tests.
python1def generate_new_evaluator(pattern_name):2 if pattern_name == "verbosity_issue":3 def evaluator(output, task):4 return 1.0 if len(output) > 100 else 0.05 return evaluator6 if pattern_name == "market_bias":7 def evaluator(output, task):8 banned = ["guaranteed profit", "risk-free"]9 return 0.0 if any(b in output.lower() for b in banned) else 1.010 return evaluator
Then the outer loop injects this back into the system:
python1class System:2 def __init__(self):3 self.evaluators = [relevance_evaluator]45 def update(self, new_eval):6 self.evaluators.append(new_eval)
6. Agent Recipe: The Versioned Intelligence Artifact
This is where system memory becomes structured.
python1from dataclasses import dataclass, field2from datetime import datetime34@dataclass5class AgentRecipe:6 name: str7 model: str8 evaluators: list9 history: list = field(default_factory=list)1011 def log_failure(self, task, output, score):12 self.history.append({13 "task": task,14 "output": output,15 "score": score,16 "time": datetime.now().isoformat()17 })1819 def export(self):20 return {21 "name": self.name,22 "model": self.model,23 "evaluators": [e.__name__ for e in self.evaluators],24 "history": self.history25 }
Key idea: Recipes are not config files. They are compressed learning histories.
7. Full Autoresearch Loop (Putting It Together)
python1class AutoresearchSystem:2 def __init__(self, agent, recipe):3 self.agent = agent4 self.recipe = recipe5 self.outer = OuterLoop()67 def step(self, task):8 output = self.agent.run(task)9 score = self.evaluate(output, task)10 self.outer.record(task, output, score)11 self.recipe.log_failure(task, output, score)12 return output, score1314 def evaluate(self, output, task):15 scores = [e(output, task) for e in self.recipe.evaluators]16 return sum(scores) / len(scores)1718 def improve(self):19 patterns = self.outer.analyze_patterns()20 for pattern, count in patterns.items():21 if count > 3:22 new_eval = generate_new_evaluator(pattern)23 self.recipe.evaluators.append(new_eval)
8. Fine-Tuning Hook: Closing the Loop with Learning
Once enough failures accumulate, we convert them into training data.
python1def build_dataset(recipe):2 dataset = []3 for entry in recipe.history:4 dataset.append({5 "input": entry["task"],6 "output": entry["output"],7 "label": entry["score"]8 })9 return dataset
Then fine-tune (LoRA-style sketch):
python1from transformers import AutoModelForCausalLM23def fine_tune(model_name, dataset):4 model = AutoModelForCausalLM.from_pretrained(model_name)5 # pseudo-training loop6 for batch in dataset:7 loss = compute_loss(model, batch)8 loss.backward()9 return model
9. Knowledge Graph Hook (Optional but Powerful)
To move from "memory" to "structure":
python1import networkx as nx23class KnowledgeGraph:4 def __init__(self):5 self.graph = nx.DiGraph()67 def add_fact(self, subject, relation, obj):8 self.graph.add_edge(subject, obj, relation=relation)910 def query(self, node):11 return list(self.graph.neighbors(node))
Example usage:
python1kg = KnowledgeGraph()2kg.add_fact("inflation", "impacts", "interest_rates")3kg.add_fact("interest_rates", "impacts", "equities")
Now reasoning becomes graph traversal instead of pure generation.
10. The Complete System in One View
text1agent → inner loop execution2 ↓3 evaluation layer (judges)4 ↓5 outer loop (failure analysis)6 ↓7 recipe update (system memory)8 ↓9 fine-tuning dataset generation10 ↓11 model improvement12 ↓13 back to agent
This is the full autoresearch cycle. Not a metaphor. A literal closed system.
Closing Insight
Once implemented, something important becomes visible:
Intelligence is no longer stored in the model.
It is distributed across:
- evaluation functions
- failure history
- training data generation
- update rules
- and loop structure itself
The model is just the execution substrate. The loop is where intelligence actually accumulates.

Sovereign AI: Building Local-First Intelligent Systems
by Daniel Kliewer · Paperback · 72 pages
The hands-on guide to building AI that runs on your hardware, keeps your data private, and eliminates cloud dependence. Working code included.