The Short Answer

Both Kayba and ZeroEval market themselves as "self-improving AI agents," but they improve different things. ZeroEval focuses on making LLM-as-judge evaluations more reliable through calibration and then uses those judges to optimize prompts. Kayba analyzes agent execution traces to extract reusable skills and generate improved prompts from what actually happened.

ZeroEval improves how you evaluate agents. Kayba improves how agents behave.

What Each Tool Does

ZeroEval

ZeroEval (YC S25) provides calibrated LLM judges and automated prompt optimization:

Calibrated judges: Tunes LLM-as-judge evaluators to reduce bias and improve scoring consistency
Autotune: Uses judge signals to automatically optimize agent prompts
Self-improving loop: Judges evaluate agent outputs, Autotune adjusts prompts, judges re-evaluate
Managed service: Closed-source platform handling the eval-optimize cycle

ZeroEval is tackling a real problem -- LLM judges are notoriously unreliable, and better evaluation signals lead to better optimization. Their bet is that the bottleneck to agent improvement is eval quality.

Kayba

Kayba is an open-source learning layer (MIT, 2k+ stars) that synthesizes three published research papers into a unified framework:

Recursive Reflector: REPL-based trace analysis that programmatically examines agent execution -- grounded in the ACE framework (arXiv:2510.04618) and Reflective LLM Methods (arXiv:2512.24601)
Skill extraction: Failures and successes are distilled into atomic, reusable skills with helpful/harmful counters
Skillbook: A persistent, transparent collection of everything the agent has learned -- organized, auditable, with provenance tracking back to source traces. Inspired by the Dynamic Cheatsheet approach (arXiv:2504.07952)
Prompt generation: Approved skills are compiled into optimized system prompts
Continuous learning: Delta updates refine the Skillbook incrementally as new traces come in

The framework is agent-agnostic and requires no fine-tuning -- it works by improving the context your agent receives, not by retraining weights.

The Key Difference: Judges vs. Traces

ZeroEval's approach starts with evaluation. The premise is that if you can score agent outputs more accurately, you can optimize prompts more effectively. Calibrated judges produce better signals, and Autotune uses those signals to search for better prompts.

Kayba's approach starts with understanding. Rather than scoring outputs with a judge, the Recursive Reflector programmatically analyzes full execution traces -- the sequence of actions, tool calls, reasoning steps, and outcomes. Skills are extracted from this analysis, not from evaluation scores.

This leads to a meaningful difference in what gets learned:

	ZeroEval	Kayba
Signal source	Judge scores on outputs	Trace analysis of execution
What it captures	"This output was good/bad"	"The agent failed because it did X instead of Y in step 3"
Knowledge format	Optimized prompts (via Autotune)	Skills with provenance (traceable to specific traces)
Granularity	Output-level evaluation	Step-level behavioral analysis

A judge can tell you an agent's output scored 0.7. A trace analysis can tell you the agent failed because it called the wrong API endpoint after misinterpreting the user's intent in step 2 -- and that insight becomes a reusable skill.

Comparison

Dimension	Kayba	ZeroEval
Open source	Yes, MIT license	No, closed-source
Core approach	Trace analysis and skill extraction	Calibrated LLM judges and prompt optimization
What improves	Agent behavior (via Skillbook and prompts)	Eval quality (via calibrated judges) and prompts (via Autotune)
Research backing	3 published papers (ACE, RLM, Dynamic Cheatsheet)	No published research
Human review	Built-in -- approve, edit, or reject skills before deployment	Not documented
Self-hosting	Yes, run entirely on your infrastructure	No, managed service only
Framework dependency	Framework-agnostic (any agent, any trace format)	Integration-dependent
Pricing	Free (OSS) / $29/month (hosted dashboard)	Not publicly listed
Maturity	Production-ready, 2k+ GitHub stars, active community	Early-stage (YC S25)

Benchmarks

Kayba's trace-based approach is validated on public benchmarks:

t2-bench: pass@1 improvement of +27.4%, scaling to +100% at pass@4
Browser agents: Success rate from 30% to 100%, with 82% fewer steps and 65% lower costs

These results come from the published research papers and are reproducible with the open-source framework. ZeroEval has not published benchmark results at the time of writing.

When to Choose ZeroEval

ZeroEval may be a fit if:

Your primary bottleneck is eval quality -- you need more reliable scoring of agent outputs before you can optimize anything
You want a managed judge calibration service without building your own evaluation pipeline
Output-level scoring is sufficient for your optimization needs (you don't need step-level trace analysis)
You're comfortable with a closed-source platform and demo-led sales process

When to Choose Kayba

Kayba is the stronger choice if:

You need to understand exactly what went wrong in an agent's execution, not just that the output scored poorly
Auditability matters -- every improvement traces back to a specific execution, a specific failure, a specific skill
You want to own your learning data and run on your own infrastructure
Open source is important -- inspect the code, contribute, fork if needed
You want research-backed methods with published, reproducible results
You need framework-agnostic support across different agent architectures

Getting Started

Kayba is open-source and ready to use today:

pip install ace-framework

Documentation -- Setup guides and API reference
GitHub -- Source code and examples
Dashboard -- Hosted version with visual Skillbook management