The Short Answer

Opik is an LLM evaluation and observability platform that also offers automated prompt optimization via GEPA (Genetic Evolution for Prompt Adaptation). Kayba is a learning layer that analyzes agent traces, extracts reusable skills into a Skillbook, and generates improved prompts from what the agent actually experienced.

Opik optimizes prompts by searching for better variants. Kayba learns from what went wrong (and right) in real agent executions.

Both are open-source. Both improve agent performance. The difference is in how: evolutionary search vs reflective learning from experience.

What Each Tool Does

Opik

Opik, built by Comet (Series B, $50M funding), provides:

Tracing: Log and visualize LLM calls, tool use, and agent steps
Evaluation: Run automated evals with built-in and custom metrics
GEPA: Genetic Evolution for Prompt Adaptation — an optimization algorithm that mutates prompt variants, evaluates them against a dataset, and selects the best-performing version
Optimization algorithms: 5 optimization strategies beyond GEPA for different use cases
Experiment tracking: Compare prompt versions, model configurations, and parameter sweeps
Monitoring: Production dashboards for cost, latency, and quality metrics

It's a well-funded, feature-rich platform with strong evaluation tooling and a genuinely novel approach to prompt optimization.

Kayba

Kayba provides:

Trace analysis: The Recursive Reflector (grounded in RLM, arXiv:2512.24601) programmatically analyzes agent execution traces via REPL-based code execution — identifying failure patterns, root causes, and successful strategies
Skill extraction: Failures and successes are distilled into atomic, reusable skills with helpful/harmful counters and provenance tracking
Skillbook: A persistent, transparent collection of everything the agent has learned — organized, auditable, and human-reviewable
Prompt generation: Approved skills are compiled into optimized system prompts using the Dynamic Cheatsheet pattern (arXiv:2504.07952)
Continuous learning: Delta updates refine the Skillbook incrementally as new traces arrive

Kayba synthesizes three research threads — ACE (arXiv:2510.04618), RLM, and Dynamic Cheatsheet — into a single open-source framework (MIT, 2k+ stars).

The Key Difference: Optimization vs Learning

This is the core distinction, and it matters more than it might seem at first.

Opik's GEPA approach is evolutionary. It takes a prompt, generates mutations (rewording, restructuring, adding instructions), evaluates each variant against a dataset, and selects the fittest. It's search over prompt space — powerful, but the resulting prompt is a black box. You get a better prompt; you don't necessarily know why it's better or what specific behaviors changed.

Kayba's approach is reflective. It reads actual agent execution traces, identifies what went wrong and what went right, and extracts those findings as discrete, named skills. Each skill has a description, provenance (which trace it came from), and a helpful/harmful score. You can read, edit, approve, or reject any skill before it enters the agent's prompt.

Aspect	Opik (GEPA)	Kayba (Recursive Reflector)
Method	Evolutionary search over prompt variants	Trace analysis and skill extraction
Input	A prompt + evaluation dataset	Real agent execution traces
Process	Mutate prompts, evaluate, select fittest	Analyze traces, extract skills, generate prompts
Output	An optimized prompt	A Skillbook of discrete skills + generated prompt
Transparency	Improved prompt (hard to diff meaningfully)	Each skill is named, described, and traceable to its source
Requires	Evaluation dataset with ground truth	Agent traces (production or test)
Learning signal	Eval metric scores	What actually happened in agent execution

Think of it this way: GEPA is like A/B testing thousands of prompt variants at once. Kayba is like a senior engineer reviewing agent logs, writing down lessons learned, and updating the runbook.

Comparison

Dimension	Opik	Kayba
Primary function	Evaluation platform + prompt optimization	Learning layer + prompt improvement
Optimization method	GEPA (evolutionary) + 4 other algorithms	Recursive Reflector + Skillbook + Dynamic Cheatsheet
Trace handling	Log, visualize, evaluate	Analyze, extract skills, learn
Transparency	Optimized prompt as output	Individual skills with provenance, human review step
Eval dataset required	Yes, for optimization	No — learns from execution traces directly
Framework dependency	Framework-agnostic	Framework-agnostic
Open source	Yes (Apache 2.0)	Yes (MIT, 2k+ stars)
Backing	Comet, Series B ($50M)	ETH AI Center, ETH Zurich, University of St. Gallen
Pricing	Free (OSS) / cloud tiers	Free (OSS) / $29/month (hosted dashboard)
Benchmark results	Not published for agent tasks	pass@1 +27.4% to pass@4 +100% on tau2-bench

When Opik Is the Better Choice

Opik is stronger if:

You already have well-defined evaluation datasets with ground truth and want to maximize a specific metric
You need a full evaluation and monitoring platform, not just the learning step
Your bottleneck is prompt wording rather than agent behavior patterns — GEPA excels at finding optimal phrasing
You want multiple optimization strategies to choose from (GEPA plus 4 additional algorithms)
You're already in the Comet ecosystem and want unified experiment tracking

When Kayba Is the Better Choice

Kayba is stronger if:

You don't have a labeled evaluation dataset and want to learn directly from production traces
You need to understand what your agent learned, not just that it improved — the Skillbook makes every behavioral change auditable
Your agent operates in complex, multi-step workflows where failures are subtle and context-dependent
You want a human-in-the-loop review step before changes reach production
You care about compounding improvement — the Skillbook grows over time, and each new analysis cycle builds on previous learning
You're running browser agents or multi-tool agents where Kayba has demonstrated strong results (30% to 100% success rate, 82% fewer steps, 65% lower costs in browser agent benchmarks)

Using Them Together

Opik and Kayba are genuinely complementary — they operate at different levels.

Opik traces and evaluates your agent in production, giving you monitoring and quality metrics
Kayba analyzes those traces (or traces from any source), extracts skills, and builds a Skillbook
Review the extracted skills in Kayba's dashboard — approve, edit, or reject
Generate an improved prompt from the approved Skillbook
Opik evals verify the new prompt performs better before deployment
Optionally, run GEPA on top of Kayba's generated prompt to further optimize phrasing

Opik tells you how well your agent performs and can optimize prompt wording. Kayba identifies what the agent should learn and makes that knowledge explicit. Together: evaluate, learn, optimize, verify.

Getting Started

Kayba is open-source and works with traces from any source — including Opik exports.

pip install ace-framework

Documentation -- Setup guides and API reference
GitHub -- Source code and examples
Dashboard -- Hosted version with visual Skillbook management