Back to Home

Kayba vs ZeroEval

Compare Kayba's trace-based agent learning with ZeroEval's LLM judge optimization. Open-source Skillbook vs calibrated judge Autotune.

March 11, 2026
ComparisonZeroEvalLLM JudgesSelf-Improving Agents

The Short Answer

Both Kayba and ZeroEval market themselves as "self-improving AI agents," but they improve different things. ZeroEval focuses on making LLM-as-judge evaluations more reliable through calibration and then uses those judges to optimize prompts. Kayba analyzes agent execution traces to extract reusable skills and generate improved prompts from what actually happened.

ZeroEval improves how you evaluate agents. Kayba improves how agents behave.

What Each Tool Does

ZeroEval

ZeroEval (YC S25) provides calibrated LLM judges and automated prompt optimization:

  • Calibrated judges: Tunes LLM-as-judge evaluators to reduce bias and improve scoring consistency
  • Autotune: Uses judge signals to automatically optimize agent prompts
  • Self-improving loop: Judges evaluate agent outputs, Autotune adjusts prompts, judges re-evaluate
  • Managed service: Closed-source platform handling the eval-optimize cycle

ZeroEval is tackling a real problem -- LLM judges are notoriously unreliable, and better evaluation signals lead to better optimization. Their bet is that the bottleneck to agent improvement is eval quality.

Kayba

Kayba is an open-source learning layer (MIT, 2k+ stars) that synthesizes three published research papers into a unified framework:

  • Recursive Reflector: REPL-based trace analysis that programmatically examines agent execution -- grounded in the ACE framework (arXiv:2510.04618) and Reflective LLM Methods (arXiv:2512.24601)
  • Skill extraction: Failures and successes are distilled into atomic, reusable skills with helpful/harmful counters
  • Skillbook: A persistent, transparent collection of everything the agent has learned -- organized, auditable, with provenance tracking back to source traces. Inspired by the Dynamic Cheatsheet approach (arXiv:2504.07952)
  • Prompt generation: Approved skills are compiled into optimized system prompts
  • Continuous learning: Delta updates refine the Skillbook incrementally as new traces come in

The framework is agent-agnostic and requires no fine-tuning -- it works by improving the context your agent receives, not by retraining weights.

The Key Difference: Judges vs. Traces

ZeroEval's approach starts with evaluation. The premise is that if you can score agent outputs more accurately, you can optimize prompts more effectively. Calibrated judges produce better signals, and Autotune uses those signals to search for better prompts.

Kayba's approach starts with understanding. Rather than scoring outputs with a judge, the Recursive Reflector programmatically analyzes full execution traces -- the sequence of actions, tool calls, reasoning steps, and outcomes. Skills are extracted from this analysis, not from evaluation scores.

This leads to a meaningful difference in what gets learned:

ZeroEvalKayba
Signal sourceJudge scores on outputsTrace analysis of execution
What it captures"This output was good/bad""The agent failed because it did X instead of Y in step 3"
Knowledge formatOptimized prompts (via Autotune)Skills with provenance (traceable to specific traces)
GranularityOutput-level evaluationStep-level behavioral analysis

A judge can tell you an agent's output scored 0.7. A trace analysis can tell you the agent failed because it called the wrong API endpoint after misinterpreting the user's intent in step 2 -- and that insight becomes a reusable skill.

Comparison

DimensionKaybaZeroEval
Open sourceYes, MIT licenseNo, closed-source
Core approachTrace analysis and skill extractionCalibrated LLM judges and prompt optimization
What improvesAgent behavior (via Skillbook and prompts)Eval quality (via calibrated judges) and prompts (via Autotune)
Research backing3 published papers (ACE, RLM, Dynamic Cheatsheet)No published research
Human reviewBuilt-in -- approve, edit, or reject skills before deploymentNot documented
Self-hostingYes, run entirely on your infrastructureNo, managed service only
Framework dependencyFramework-agnostic (any agent, any trace format)Integration-dependent
PricingFree (OSS) / $29/month (hosted dashboard)Not publicly listed
MaturityProduction-ready, 2k+ GitHub stars, active communityEarly-stage (YC S25)

Benchmarks

Kayba's trace-based approach is validated on public benchmarks:

  • t2-bench: pass@1 improvement of +27.4%, scaling to +100% at pass@4
  • Browser agents: Success rate from 30% to 100%, with 82% fewer steps and 65% lower costs

These results come from the published research papers and are reproducible with the open-source framework. ZeroEval has not published benchmark results at the time of writing.

When to Choose ZeroEval

ZeroEval may be a fit if:

  • Your primary bottleneck is eval quality -- you need more reliable scoring of agent outputs before you can optimize anything
  • You want a managed judge calibration service without building your own evaluation pipeline
  • Output-level scoring is sufficient for your optimization needs (you don't need step-level trace analysis)
  • You're comfortable with a closed-source platform and demo-led sales process

When to Choose Kayba

Kayba is the stronger choice if:

  • You need to understand exactly what went wrong in an agent's execution, not just that the output scored poorly
  • Auditability matters -- every improvement traces back to a specific execution, a specific failure, a specific skill
  • You want to own your learning data and run on your own infrastructure
  • Open source is important -- inspect the code, contribute, fork if needed
  • You want research-backed methods with published, reproducible results
  • You need framework-agnostic support across different agent architectures

Getting Started

Kayba is open-source and ready to use today:

pip install ace-framework
  • Documentation -- Setup guides and API reference
  • GitHub -- Source code and examples
  • Dashboard -- Hosted version with visual Skillbook management