The Short Answer
Opik is an LLM evaluation and observability platform that also offers automated prompt optimization via GEPA (Genetic Evolution for Prompt Adaptation). Kayba is a learning layer that analyzes agent traces, extracts reusable skills into a Skillbook, and generates improved prompts from what the agent actually experienced.
Opik optimizes prompts by searching for better variants. Kayba learns from what went wrong (and right) in real agent executions.
Both are open-source. Both improve agent performance. The difference is in how: evolutionary search vs reflective learning from experience.
What Each Tool Does
Opik
Opik, built by Comet (Series B, $50M funding), provides:
- Tracing: Log and visualize LLM calls, tool use, and agent steps
- Evaluation: Run automated evals with built-in and custom metrics
- GEPA: Genetic Evolution for Prompt Adaptation — an optimization algorithm that mutates prompt variants, evaluates them against a dataset, and selects the best-performing version
- Optimization algorithms: 5 optimization strategies beyond GEPA for different use cases
- Experiment tracking: Compare prompt versions, model configurations, and parameter sweeps
- Monitoring: Production dashboards for cost, latency, and quality metrics
It's a well-funded, feature-rich platform with strong evaluation tooling and a genuinely novel approach to prompt optimization.
Kayba
Kayba provides:
- Trace analysis: The Recursive Reflector (grounded in RLM, arXiv:2512.24601) programmatically analyzes agent execution traces via REPL-based code execution — identifying failure patterns, root causes, and successful strategies
- Skill extraction: Failures and successes are distilled into atomic, reusable skills with helpful/harmful counters and provenance tracking
- Skillbook: A persistent, transparent collection of everything the agent has learned — organized, auditable, and human-reviewable
- Prompt generation: Approved skills are compiled into optimized system prompts using the Dynamic Cheatsheet pattern (arXiv:2504.07952)
- Continuous learning: Delta updates refine the Skillbook incrementally as new traces arrive
Kayba synthesizes three research threads — ACE (arXiv:2510.04618), RLM, and Dynamic Cheatsheet — into a single open-source framework (MIT, 2k+ stars).
The Key Difference: Optimization vs Learning
This is the core distinction, and it matters more than it might seem at first.
Opik's GEPA approach is evolutionary. It takes a prompt, generates mutations (rewording, restructuring, adding instructions), evaluates each variant against a dataset, and selects the fittest. It's search over prompt space — powerful, but the resulting prompt is a black box. You get a better prompt; you don't necessarily know why it's better or what specific behaviors changed.
Kayba's approach is reflective. It reads actual agent execution traces, identifies what went wrong and what went right, and extracts those findings as discrete, named skills. Each skill has a description, provenance (which trace it came from), and a helpful/harmful score. You can read, edit, approve, or reject any skill before it enters the agent's prompt.
| Aspect | Opik (GEPA) | Kayba (Recursive Reflector) |
|---|---|---|
| Method | Evolutionary search over prompt variants | Trace analysis and skill extraction |
| Input | A prompt + evaluation dataset | Real agent execution traces |
| Process | Mutate prompts, evaluate, select fittest | Analyze traces, extract skills, generate prompts |
| Output | An optimized prompt | A Skillbook of discrete skills + generated prompt |
| Transparency | Improved prompt (hard to diff meaningfully) | Each skill is named, described, and traceable to its source |
| Requires | Evaluation dataset with ground truth | Agent traces (production or test) |
| Learning signal | Eval metric scores | What actually happened in agent execution |
Think of it this way: GEPA is like A/B testing thousands of prompt variants at once. Kayba is like a senior engineer reviewing agent logs, writing down lessons learned, and updating the runbook.
Comparison
| Dimension | Opik | Kayba |
|---|---|---|
| Primary function | Evaluation platform + prompt optimization | Learning layer + prompt improvement |
| Optimization method | GEPA (evolutionary) + 4 other algorithms | Recursive Reflector + Skillbook + Dynamic Cheatsheet |
| Trace handling | Log, visualize, evaluate | Analyze, extract skills, learn |
| Transparency | Optimized prompt as output | Individual skills with provenance, human review step |
| Eval dataset required | Yes, for optimization | No — learns from execution traces directly |
| Framework dependency | Framework-agnostic | Framework-agnostic |
| Open source | Yes (Apache 2.0) | Yes (MIT, 2k+ stars) |
| Backing | Comet, Series B ($50M) | ETH AI Center, ETH Zurich, University of St. Gallen |
| Pricing | Free (OSS) / cloud tiers | Free (OSS) / $29/month (hosted dashboard) |
| Benchmark results | Not published for agent tasks | pass@1 +27.4% to pass@4 +100% on tau2-bench |
When Opik Is the Better Choice
Opik is stronger if:
- You already have well-defined evaluation datasets with ground truth and want to maximize a specific metric
- You need a full evaluation and monitoring platform, not just the learning step
- Your bottleneck is prompt wording rather than agent behavior patterns — GEPA excels at finding optimal phrasing
- You want multiple optimization strategies to choose from (GEPA plus 4 additional algorithms)
- You're already in the Comet ecosystem and want unified experiment tracking
When Kayba Is the Better Choice
Kayba is stronger if:
- You don't have a labeled evaluation dataset and want to learn directly from production traces
- You need to understand what your agent learned, not just that it improved — the Skillbook makes every behavioral change auditable
- Your agent operates in complex, multi-step workflows where failures are subtle and context-dependent
- You want a human-in-the-loop review step before changes reach production
- You care about compounding improvement — the Skillbook grows over time, and each new analysis cycle builds on previous learning
- You're running browser agents or multi-tool agents where Kayba has demonstrated strong results (30% to 100% success rate, 82% fewer steps, 65% lower costs in browser agent benchmarks)
Using Them Together
Opik and Kayba are genuinely complementary — they operate at different levels.
- Opik traces and evaluates your agent in production, giving you monitoring and quality metrics
- Kayba analyzes those traces (or traces from any source), extracts skills, and builds a Skillbook
- Review the extracted skills in Kayba's dashboard — approve, edit, or reject
- Generate an improved prompt from the approved Skillbook
- Opik evals verify the new prompt performs better before deployment
- Optionally, run GEPA on top of Kayba's generated prompt to further optimize phrasing
Opik tells you how well your agent performs and can optimize prompt wording. Kayba identifies what the agent should learn and makes that knowledge explicit. Together: evaluate, learn, optimize, verify.
Getting Started
Kayba is open-source and works with traces from any source — including Opik exports.
pip install ace-framework
- Documentation -- Setup guides and API reference
- GitHub -- Source code and examples
- Dashboard -- Hosted version with visual Skillbook management