July 2, 2026·12 min

Building Autonomous Sovereign AI: How Autoresearch Loops and Expert Fine-Tuning Create Self-Improving Local AI Systems

How to build self-improving AI systems using autoresearch loops, agent recipes, and domain-specific fine-tuning with open-source tools. A complete implementation guide connecting the latest research from Introspection, Bridgewater AIA Labs, and Thinking Machines Lab.

Daniel Kliewer

Author, Sovereign AI

autonomous-agentssovereign-aiautoresearchfine-tuninglocal-firstopen-sourceagent-recipesreinforcement-learningsovereign-architecturelocal-llmsollamasmolagentslanggraphdeerflow

From the Book

This is from Sovereign AI: Building Local-First Intelligent Systems.

Get the Book — $88

Building Autonomous Sovereign AI: How Autoresearch Loops and Expert Fine-Tuning Create Self-Improving Local AI Systems

Autoresearch Loops and Differentiated Intelligence

Two Converging Blueprints for Self-Improving AI Systems

Date: July 2, 2026

Introduction: The Shift from Models to Systems That Improve Themselves

Two major threads in AI research converged almost simultaneously.

On one side, Introspection's "autoresearch" framework reframes AI systems not as static models, but as self-improving loops. On the other, Thinking Machines Lab and Bridgewater AIA Labs demonstrated something more concrete: carefully trained open-weight models can outperform frontier LLMs on tasks requiring expert judgment—at lower cost and higher accuracy.

Taken together, they point to a new design principle:

The unit of intelligence is no longer the model. It is the loop.

This post synthesizes both perspectives into a single architecture for building sovereign, self-improving AI systems—systems that continuously refine their own behavior through evaluation, feedback, and fine-tuning.

Part 1: Autoresearch — When the Loop Becomes the Product

Roland Gavrilescu's framing at Introspection introduces a shift in how we think about agent systems.

1. The Loop Is the Product

Traditional AI systems are static:

Train → Deploy → Maintain

Autoresearch systems are dynamic:

Observe → Evaluate → Improve → Repeat

The key idea is that the feedback loop itself becomes the product surface.

But the hard problem isn't building loops—it's designing signals that are meaningful enough for improvement without collapsing into noisy optimization.

Cheap signals (likes, heuristics, weak metrics) lead to "slop optimization." Expensive signals (expert review, structured evals) are what actually move capability.

2. Agent Recipes: Capturing How Systems Evolve

A core concept is the agent recipe.

An agent recipe is not configuration—it is history:

The model + harness configuration
The evaluation suite used over time
The human expertise embedded in the system
The failure cases that led to new evaluations
The decisions that shaped the system's current behavior

If you inherited a production agent system, the code alone would not explain why it behaves the way it does. The recipe captures that missing context.

It is, effectively: A versioned memory of how intelligence was shaped.

3. Inner Loop vs Outer Loop

Autoresearch systems split into two interacting systems:

Inner loop:

Executes tasks
Produces outputs
Interfaces with users

Outer loop:

Observes performance
Identifies failure patterns
Creates new evaluations
Updates prompts, tools, or training data

The outer loop is where improvement happens. The inner loop is where value is delivered.

The key design challenge is ensuring the outer loop remains cost-bounded and signal-efficient, not a runaway optimization engine.

4. Humans as Tools in the Loop

A subtle but important shift:

Humans are not outside the system. They are callable components inside the loop, especially early on.

As systems accumulate examples of human decisions, they reduce their reliance on explicit queries. This mirrors apprenticeship: early heavy supervision → gradual autonomy.

Part 2: The Expert Judgment Problem

Autoresearch loops matter because of a deeper empirical limitation in current frontier models.

Where Frontier Models Break

Bridgewater AIA Labs evaluated frontier models on six tasks involving real investment workflows:

Financial article relevance
Central bank document interpretation
Boilerplate detection in research
Email truncation detection
Signal extraction from macroeconomic text
General document relevance filtering

These are not reasoning-heavy tasks. They are judgment-heavy tasks. And that distinction matters.

Even with strong prompting, frontier models plateaued around ~78% accuracy—below the threshold required for real-world deployment in expert workflows.

The Core Limitation: Tacit Judgment

Prompts can only encode what experts can articulate. The most important judgments are often non-verbalizable.

This is where prompting stops working.

Why Fine-Tuning Wins

Fine-tuning bypasses articulation entirely. Instead of translating intuition into instructions, it learns directly from examples of decisions.

The result:

Base model: ~44% accuracy
With GRPO + structured training: ~73%
Final system: ~84.7% accuracy

And critically:

~30% fewer errors than frontier models
~13.8× lower inference cost

This is not incremental improvement. It is a regime shift in how capability is produced.

What Actually Mattered in Training

The gains did not come from a single trick. They came from structured system design:

GRPO-style RL: largest jump in performance
Interleaved batching: improves cross-task generalization
Loss function design (CISPO): stabilizes optimization
On-policy distillation: prevents degradation over time
Carefully curated expert feedback loops: highest leverage factor

But the most important bottleneck wasn't architecture—it was data quality and labeling strategy.

A key technique:

Train on cheap labels → route disagreements to experts → iterate

This turns expensive expert time into a targeted refinement signal rather than a brute-force labeling requirement.

Part 3: What This Means — The New AI Architecture Stack

When you combine autoresearch loops with fine-tuning results, a consistent architecture emerges.

1. Separate Inner and Outer Loops Explicitly

Inner loop: fast inference, stable behavior, user-facing reliability
Outer loop: slow optimization, experimentation, evaluation-driven updates

They must be independently constrained.

2. Treat "Recipes" as First-Class Artifacts

Agent systems should not be defined by prompts or configs. They should be defined by:

Evaluation history
Failure cases
Data lineage
Human correction traces

This is the difference between a system that works today and one that improves tomorrow.

3. Prompting Has a Ceiling

Prompt engineering works for:

Knowledge retrieval
Structured reasoning
Clear rule-based tasks

It fails for:

Tacit judgment
Domain-specific intuition
Expert-style filtering decisions

When the task depends on "feel," you need data, not prompts.

4. Fine-Tuning Is Not Optional for Expert Systems

If a task meets this condition: "An expert cannot fully explain how they decide," then the correct solution is:

Not better prompting
Not longer context windows
But supervised + RL fine-tuning pipelines

5. Cost Efficiency Comes from Specialization

The economic advantage is structural. Smaller, specialized models:

Beat frontier models on narrow expert tasks
Cost an order of magnitude less
Run locally with sovereignty guarantees

This is the foundation of differentiated intelligence.

Part 4: Sovereign AI Systems — The Practical Architecture

The implementation pattern that emerges looks like this:

Core Components

Local inference layer
- Ollama or similar runtime
- Open-weight models (Qwen, Llama, Mistral)
Agent harness
- Task execution layer
- Tool calling + orchestration
- Deterministic control flow
Evaluation system
- Domain-specific judges
- Failure detection logic
- Automated regression tests
Outer loop system
- Logs performance over time
- Generates new evaluations
- Updates recipes and datasets
Fine-tuning pipeline
- GRPO / RL-based optimization
- LoRA-based efficient training
- Distillation from stronger teachers
Knowledge layer
- Vector database (semantic memory)
- Knowledge graph (structured relationships)
- Persona routing (expert specialization)

Part 5: The Key Insight — Intelligence Is Becoming Infrastructure

The convergence here is not accidental. Both systems point to the same shift:

Old paradigm: Intelligence = model capability

New paradigm: Intelligence = system that improves itself

The model becomes just one component in a larger feedback architecture.

The real differentiator is:

How you collect feedback
How you structure evaluation
How you convert experience into training signal
How you close the loop

Conclusion: From Models to Living Systems

The next generation of AI systems will not be defined by parameter count or context length. They will be defined by:

How quickly they learn from failure
How well they encode expert judgment
How tightly feedback loops are integrated into their architecture
How cheaply they improve over time

Autoresearch provides the system design. Fine-tuning research provides the empirical validation.

Together, they define a single direction: AI systems are becoming self-improving infrastructures for capturing and refining human expertise.

The model is no longer the product. The loop is.

Sources

Autoresearch: The feedback loop behind self-improving agents - Latent.Space
Learning to replicate expert judgment in financial tasks - Thinking Machines Lab

Addendum: Implementation Notes and Minimal Code Examples for a Sovereign Autoresearch System

This addendum translates the architecture described above into concrete, minimal implementations. The goal is not production completeness, but to show how the pieces actually connect: inner loop, outer loop, evaluation layer, and fine-tuning pipeline.

1. Core Idea: Everything Reduces to a Loop

At runtime, every sovereign AI system collapses into the same structure:

python
1def run_system(task):
2    result = inner_loop(task)
3    score = evaluate(result)
4    feedback = outer_loop(task, result, score)
5    update_system(feedback)
6    return result

Everything else—agents, RAG, fine-tuning—is just implementation detail around this structure.

2. Inner Loop: Agent Execution Layer

The inner loop is the "worker." It must be stable, deterministic enough to evaluate, and cheap enough to run repeatedly.

Example: Local Agent with Ollama

python
1from ollama import chat
2
3class InnerLoopAgent:
4    def __init__(self, model="qwen2.5:7b"):
5        self.model = model
6
7    def run(self, task, context=""):
8        prompt = f"""
9        You are an expert system.
10        Context:
11        {context}
12        Task:
13        {task}
14        Return a structured answer.
15        """
16        response = chat(
17            model=self.model,
18            messages=[{"role": "user", "content": prompt}]
19        )
20        return response["message"]["content"]

Key point: The inner loop should NOT evolve itself. It only executes.

3. Evaluators: Turning Judgment into Code

Evaluators are where "taste" becomes computable.

Example: Simple domain evaluator

python
1def relevance_evaluator(output: str, task: str) -> float:
2    """
3    Scores whether output matches expected domain constraints.
4    In practice, this can be:
5    - heuristics
6    - small judge model
7    - embedding similarity
8    """
9    keywords = ["market", "risk", "macro", "liquidity"]
10    score = sum(1 for k in keywords if k in output.lower())
11    return min(score / len(keywords), 1.0)

Better version: LLM-as-judge

python
1def llm_judge(output, task, model="llama3.1:8b"):
2    prompt = f"""
3    Evaluate this output for correctness and relevance.
4    Task:
5    {task}
6    Output:
7    {output}
8    Score from 0 to 1 with explanation.
9    """
10    res = chat(model=model, messages=[{"role": "user", "content": prompt}])
11    return parse_score(res["message"]["content"])

4. Outer Loop: Autoresearch Engine

The outer loop is the "researcher." It looks at failures and modifies the system.

Minimal implementation

python
1from collections import defaultdict
2
3class OuterLoop:
4    def __init__(self):
5        self.failures = []
6
7    def record(self, task, output, score):
8        if score < 0.8:
9            self.failures.append((task, output, score))
10
11    def analyze_patterns(self):
12        patterns = defaultdict(int)
13        for task, output, score in self.failures:
14            if "market" in output:
15                patterns["market_bias"] += 1
16            if len(output) < 50:
17                patterns["verbosity_issue"] += 1
18        return patterns

5. Turning Failures into New Evaluators

This is the key autoresearch step: the system writes its own tests.

python
1def generate_new_evaluator(pattern_name):
2    if pattern_name == "verbosity_issue":
3        def evaluator(output, task):
4            return 1.0 if len(output) > 100 else 0.0
5        return evaluator
6    if pattern_name == "market_bias":
7        def evaluator(output, task):
8            banned = ["guaranteed profit", "risk-free"]
9            return 0.0 if any(b in output.lower() for b in banned) else 1.0
10        return evaluator

Then the outer loop injects this back into the system:

python
1class System:
2    def __init__(self):
3        self.evaluators = [relevance_evaluator]
4
5    def update(self, new_eval):
6        self.evaluators.append(new_eval)

6. Agent Recipe: The Versioned Intelligence Artifact

This is where system memory becomes structured.

python
1from dataclasses import dataclass, field
2from datetime import datetime
3
4@dataclass
5class AgentRecipe:
6    name: str
7    model: str
8    evaluators: list
9    history: list = field(default_factory=list)
10
11    def log_failure(self, task, output, score):
12        self.history.append({
13            "task": task,
14            "output": output,
15            "score": score,
16            "time": datetime.now().isoformat()
17        })
18
19    def export(self):
20        return {
21            "name": self.name,
22            "model": self.model,
23            "evaluators": [e.__name__ for e in self.evaluators],
24            "history": self.history
25        }

Key idea: Recipes are not config files. They are compressed learning histories.

7. Full Autoresearch Loop (Putting It Together)

python
1class AutoresearchSystem:
2    def __init__(self, agent, recipe):
3        self.agent = agent
4        self.recipe = recipe
5        self.outer = OuterLoop()
6
7    def step(self, task):
8        output = self.agent.run(task)
9        score = self.evaluate(output, task)
10        self.outer.record(task, output, score)
11        self.recipe.log_failure(task, output, score)
12        return output, score
13
14    def evaluate(self, output, task):
15        scores = [e(output, task) for e in self.recipe.evaluators]
16        return sum(scores) / len(scores)
17
18    def improve(self):
19        patterns = self.outer.analyze_patterns()
20        for pattern, count in patterns.items():
21            if count > 3:
22                new_eval = generate_new_evaluator(pattern)
23                self.recipe.evaluators.append(new_eval)

8. Fine-Tuning Hook: Closing the Loop with Learning

Once enough failures accumulate, we convert them into training data.

python
1def build_dataset(recipe):
2    dataset = []
3    for entry in recipe.history:
4        dataset.append({
5            "input": entry["task"],
6            "output": entry["output"],
7            "label": entry["score"]
8        })
9    return dataset

Then fine-tune (LoRA-style sketch):

python
1from transformers import AutoModelForCausalLM
2
3def fine_tune(model_name, dataset):
4    model = AutoModelForCausalLM.from_pretrained(model_name)
5    # pseudo-training loop
6    for batch in dataset:
7        loss = compute_loss(model, batch)
8        loss.backward()
9    return model

9. Knowledge Graph Hook (Optional but Powerful)

To move from "memory" to "structure":

python
1import networkx as nx
2
3class KnowledgeGraph:
4    def __init__(self):
5        self.graph = nx.DiGraph()
6
7    def add_fact(self, subject, relation, obj):
8        self.graph.add_edge(subject, obj, relation=relation)
9
10    def query(self, node):
11        return list(self.graph.neighbors(node))

Example usage:

python
1kg = KnowledgeGraph()
2kg.add_fact("inflation", "impacts", "interest_rates")
3kg.add_fact("interest_rates", "impacts", "equities")

Now reasoning becomes graph traversal instead of pure generation.

10. The Complete System in One View

text
1agent → inner loop execution
2                      ↓
3              evaluation layer (judges)
4                      ↓
5              outer loop (failure analysis)
6                      ↓
7               recipe update (system memory)
8                      ↓
9           fine-tuning dataset generation
10                      ↓
11                model improvement
12                      ↓
13                   back to agent

This is the full autoresearch cycle. Not a metaphor. A literal closed system.

Closing Insight

Once implemented, something important becomes visible:

Intelligence is no longer stored in the model.

It is distributed across:

evaluation functions
failure history
training data generation
update rules
and loop structure itself

The model is just the execution substrate. The loop is where intelligence actually accumulates.

Sovereign AI: Building Local-First Intelligent Systems

by Daniel Kliewer · Paperback · 72 pages

The hands-on guide to building AI that runs on your hardware, keeps your data private, and eliminates cloud dependence. Working code included.

Buy on Amazon — $88 See Inside

← Back to all posts