The Short Answer
Both Kayba and ZeroEval market themselves as "self-improving AI agents," but they improve different things. ZeroEval focuses on making LLM-as-judge evaluations more reliable through calibration and then uses those judges to optimize prompts. Kayba analyzes agent execution traces to extract reusable skills and generate improved prompts from what actually happened.
ZeroEval improves how you evaluate agents. Kayba improves how agents behave.
What Each Tool Does
ZeroEval
ZeroEval (YC S25) provides calibrated LLM judges and automated prompt optimization:
- Calibrated judges: Tunes LLM-as-judge evaluators to reduce bias and improve scoring consistency
- Autotune: Uses judge signals to automatically optimize agent prompts
- Self-improving loop: Judges evaluate agent outputs, Autotune adjusts prompts, judges re-evaluate
- Managed service: Closed-source platform handling the eval-optimize cycle
ZeroEval is tackling a real problem -- LLM judges are notoriously unreliable, and better evaluation signals lead to better optimization. Their bet is that the bottleneck to agent improvement is eval quality.
Kayba
Kayba is an open-source learning layer (MIT, 2k+ stars) that synthesizes three published research papers into a unified framework:
- Recursive Reflector: REPL-based trace analysis that programmatically examines agent execution -- grounded in the ACE framework (arXiv:2510.04618) and Reflective LLM Methods (arXiv:2512.24601)
- Skill extraction: Failures and successes are distilled into atomic, reusable skills with helpful/harmful counters
- Skillbook: A persistent, transparent collection of everything the agent has learned -- organized, auditable, with provenance tracking back to source traces. Inspired by the Dynamic Cheatsheet approach (arXiv:2504.07952)
- Prompt generation: Approved skills are compiled into optimized system prompts
- Continuous learning: Delta updates refine the Skillbook incrementally as new traces come in
The framework is agent-agnostic and requires no fine-tuning -- it works by improving the context your agent receives, not by retraining weights.
The Key Difference: Judges vs. Traces
ZeroEval's approach starts with evaluation. The premise is that if you can score agent outputs more accurately, you can optimize prompts more effectively. Calibrated judges produce better signals, and Autotune uses those signals to search for better prompts.
Kayba's approach starts with understanding. Rather than scoring outputs with a judge, the Recursive Reflector programmatically analyzes full execution traces -- the sequence of actions, tool calls, reasoning steps, and outcomes. Skills are extracted from this analysis, not from evaluation scores.
This leads to a meaningful difference in what gets learned:
| ZeroEval | Kayba | |
|---|---|---|
| Signal source | Judge scores on outputs | Trace analysis of execution |
| What it captures | "This output was good/bad" | "The agent failed because it did X instead of Y in step 3" |
| Knowledge format | Optimized prompts (via Autotune) | Skills with provenance (traceable to specific traces) |
| Granularity | Output-level evaluation | Step-level behavioral analysis |
A judge can tell you an agent's output scored 0.7. A trace analysis can tell you the agent failed because it called the wrong API endpoint after misinterpreting the user's intent in step 2 -- and that insight becomes a reusable skill.
Comparison
| Dimension | Kayba | ZeroEval |
|---|---|---|
| Open source | Yes, MIT license | No, closed-source |
| Core approach | Trace analysis and skill extraction | Calibrated LLM judges and prompt optimization |
| What improves | Agent behavior (via Skillbook and prompts) | Eval quality (via calibrated judges) and prompts (via Autotune) |
| Research backing | 3 published papers (ACE, RLM, Dynamic Cheatsheet) | No published research |
| Human review | Built-in -- approve, edit, or reject skills before deployment | Not documented |
| Self-hosting | Yes, run entirely on your infrastructure | No, managed service only |
| Framework dependency | Framework-agnostic (any agent, any trace format) | Integration-dependent |
| Pricing | Free (OSS) / $29/month (hosted dashboard) | Not publicly listed |
| Maturity | Production-ready, 2k+ GitHub stars, active community | Early-stage (YC S25) |
Benchmarks
Kayba's trace-based approach is validated on public benchmarks:
- t2-bench: pass@1 improvement of +27.4%, scaling to +100% at pass@4
- Browser agents: Success rate from 30% to 100%, with 82% fewer steps and 65% lower costs
These results come from the published research papers and are reproducible with the open-source framework. ZeroEval has not published benchmark results at the time of writing.
When to Choose ZeroEval
ZeroEval may be a fit if:
- Your primary bottleneck is eval quality -- you need more reliable scoring of agent outputs before you can optimize anything
- You want a managed judge calibration service without building your own evaluation pipeline
- Output-level scoring is sufficient for your optimization needs (you don't need step-level trace analysis)
- You're comfortable with a closed-source platform and demo-led sales process
When to Choose Kayba
Kayba is the stronger choice if:
- You need to understand exactly what went wrong in an agent's execution, not just that the output scored poorly
- Auditability matters -- every improvement traces back to a specific execution, a specific failure, a specific skill
- You want to own your learning data and run on your own infrastructure
- Open source is important -- inspect the code, contribute, fork if needed
- You want research-backed methods with published, reproducible results
- You need framework-agnostic support across different agent architectures
Getting Started
Kayba is open-source and ready to use today:
pip install ace-framework
- Documentation -- Setup guides and API reference
- GitHub -- Source code and examples
- Dashboard -- Hosted version with visual Skillbook management