Why Teams Avoid Fine-Tuning

Fine-tuning works. It has produced real gains for narrow, stable tasks. But for production AI agents, teams keep running into the same five problems:

1. GPU costs scale with iteration speed. Every fine-tuning run requires compute. When your agent handles evolving domains — new policies, new integrations, new edge cases — you're retraining constantly. The bill grows with the frequency of change.

2. Training data doesn't exist yet. Fine-tuning needs labeled input-output pairs. For agents, that means someone manually reviewing conversation traces and labeling them as good or bad. Most teams have thousands of raw traces and zero labeled datasets.

3. Model lock-in kills flexibility. A fine-tuned GPT-4o model is a GPT-4o model. When a better model launches — or your team wants to switch providers — the fine-tuned behaviors don't transfer. You start over.

4. Debugging is guesswork. When a fine-tuned model behaves unexpectedly, there is no way to inspect which training examples caused it. Weight changes are opaque. You can't trace a specific failure back to a specific learned behavior.

5. Frontier models keep leapfrogging. A fine-tuned model is a snapshot. When the next generation of base models arrives, it often outperforms the fine-tuned version of the previous generation — out of the box. The investment depreciates with every model release.

These aren't theoretical objections. They're the operational reasons teams abandon fine-tuning pipelines after building them.

The Alternative: In-Context Learning

In-context learning improves agent behavior by changing what goes into the prompt, not what's baked into the weights. The model stays the same. The instructions get better.

This is the principle behind Kayba. Instead of modifying a model's parameters, Kayba analyzes what your agent actually did, extracts the patterns that matter, and generates improved system prompts that carry those lessons into future runs.

The key insight: most agent failures aren't about the model lacking capability. They're about the system prompt not covering the right scenarios. A support agent that forgets to verify return eligibility doesn't need weight updates. It needs an instruction that says "always check return eligibility before processing a refund."

In-context learning captures this kind of operational knowledge and injects it at test time — when the agent is actually running.

How Kayba Does It

Kayba's pipeline turns raw execution traces into improved prompts through four stages:

1. Trace Analysis

Feed your agent's execution traces into Kayba — conversation logs, action sequences, tool calls, outcomes. Any format. No labeling required. The Recursive Reflector analyzes traces via REPL-based code execution, identifying where decisions went wrong and why.

2. Skill Extraction

Patterns are extracted as skills — concrete, human-readable strategies:

"When processing a return, verify the order is within the return window before asking for shipping details"
"If the user provides an order ID that doesn't match any records, ask for the email address associated with the order rather than asking them to re-enter the ID"
"After escalating to a human agent, confirm the escalation with the user and summarize what has been discussed"

Each skill links back to the trace evidence that produced it. Nothing is a black box.

3. Skillbook Curation

Skills accumulate in a Skillbook — a structured knowledge base with helpful/harmful counters, evidence links, and organized sections. You review and approve skills through a human-in-the-loop workflow. Remove a bad skill. Edit a good one. The Skillbook is yours to control.

4. Prompt Generation

Approved skills are compiled into updated system prompts. One click. The agent's next run carries the accumulated learnings from all previous traces.

The cycle is continuous. New traces produce new skills. The Skillbook grows. The agent gets better over time without anyone touching model weights.

What You Keep vs. What You Lose

Switching from fine-tuning to in-context learning involves real tradeoffs. Here's an honest breakdown:

What You Keep

Behavioral improvement — agents learn from past performance and make fewer repeated mistakes
Domain adaptation — the agent accumulates domain-specific strategies over time
Measurable gains — up to 2x consistency improvement on benchmarks (see results below)

What You Gain

Model freedom — switch from GPT-4o to Claude to Gemini without losing learned behaviors. The Skillbook transfers across providers.
Zero GPU costs — no training compute. Kayba uses LLM API calls for analysis, not GPU hours for gradient descent.
No training data pipeline — raw traces go in, skills come out. No labeling, no formatting, no data engineering.
Full transparency — every learned behavior is a readable skill with evidence. You can inspect, edit, or remove anything.
Granular control — remove a single skill instead of retraining the entire model. Rollback is surgical, not wholesale.
Continuous learning — new traces incrementally refine the Skillbook. No retraining cycles.

What You Lose

Internalized knowledge — fine-tuning bakes patterns into the model itself, so they're always available without prompt space. In-context learning requires prompt tokens.
Prompt length savings — a fine-tuned model can exhibit learned behaviors in zero tokens. A Skillbook-enhanced prompt is longer than a generic one.
Specialized output formats — if your agent needs to produce a very specific output format (structured JSON schemas, domain-specific notation), fine-tuning on format examples can be more reliable.

For most production agent use cases, the gains far outweigh the losses. The agents that benefit most from learning aren't limited by what the model knows — they're limited by what the prompt tells them to do.

Results

Benchmark: τ2-bench (Sierra Research)

Kayba was evaluated on τ2-bench, a benchmark that tests agents on complex enterprise customer service tasks. The Recursive Reflector analyzed past runs, extracted strategies into a Skillbook, and appended them to the agent's policy.

Metric	Baseline	With Kayba	Improvement
pass@1	41.2%	52.5%	+27.4%
pass@2	28.3%	44.2%	+56.2%
pass@3	22.5%	41.2%	+83.1%
pass@4	20.0%	40.0%	+100.0%

The improvement scales with consistency requirements. At pass@4 — the agent must succeed four consecutive times — Kayba doubles the baseline. This is purely in-context learning. No weights were changed.

Browser Agents (Field Results)

Browser agents navigating real websites with Kayba's Skillbook:

Success rate: 30% to 100%
Efficiency: 82% fewer steps to complete tasks
Cost: 65% lower token costs

The Skillbook accumulated site-specific navigation strategies, form-filling patterns, and failure recovery procedures — exactly the kind of operational knowledge that fine-tuning struggles to capture from raw interaction data.

No Fine-Tuning Was Involved

Every result above was achieved without changing model weights. The same base models, with better prompts. This is what test-time learning looks like in practice.

When You Might Still Need Fine-Tuning

In-context learning isn't a universal replacement. Fine-tuning is genuinely better when:

You need a new output format. If the model must produce domain-specific notation, structured data in a precise schema, or outputs in a format it has never seen, fine-tuning on format examples is more reliable than prompting.
Prompt length is a hard constraint. If your use case has strict token budgets and every token counts, internalizing behaviors via fine-tuning saves prompt space.
The task is narrow and static. If the agent does exactly one thing and the requirements never change, fine-tuning once and deploying is simpler than maintaining a learning pipeline.
You have clean, abundant training data. If you've already invested in labeled datasets and training infrastructure, fine-tuning leverages that investment directly.

For everyone else — teams with evolving domains, raw traces instead of labeled data, multi-provider strategies, or agents that need to keep improving after deployment — in-context learning is the faster, cheaper, more flexible path.

Getting Started

Kayba is open-source, framework-agnostic, and works with any LLM provider.

pip install ace-framework

Point it at your agent's traces. No GPUs. No training data. No model lock-in.

Documentation — Setup guides and API reference
GitHub — Source code and examples
Dashboard — Hosted version with visual Skillbook management
Kayba vs Fine-Tuning — Side-by-side comparison if you're still evaluating both approaches