The Short Answer
Both Kayba and Lemma aim to automatically improve AI agent performance over time. The fundamental difference is in approach and transparency: Kayba is an open-source learning framework where every improvement is auditable through the Skillbook. Lemma is a closed-source prompt optimization service where changes happen inside a black box.
Kayba shows you exactly what it learned and why. Lemma optimizes your prompts behind closed doors.
What Each Tool Does
Lemma
Lemma (YC F25) provides continuous prompt optimization as a service:
- Drift detection: Monitors agent performance and flags when outputs degrade
- Prompt optimization: Automatically rewrites prompts to improve results
- Delivery via API or PR: Pushes optimized prompts through your existing workflow
- Managed service: Handles the optimization loop end-to-end
Lemma is solving a real problem — agents drift over time, and manual prompt tuning is tedious. Their approach is to handle it as a managed service with minimal setup.
Kayba
Kayba is an open-source learning layer (MIT, 2k+ stars) that synthesizes three published research papers into a unified framework:
- Recursive Reflector: REPL-based trace analysis that programmatically examines agent execution — grounded in the ACE framework (arXiv:2510.04618) and Reflective LLM Methods (arXiv:2512.24601)
- Skill extraction: Failures and successes are distilled into atomic, reusable skills with helpful/harmful counters
- Skillbook: A persistent, transparent collection of everything the agent has learned — organized, auditable, with provenance tracking back to source traces. Inspired by the Dynamic Cheatsheet approach (arXiv:2504.07952)
- Prompt generation: Approved skills are compiled into optimized system prompts
- Continuous learning: Delta updates refine the Skillbook incrementally as new traces come in
The framework is agent-agnostic and requires no fine-tuning — it works by improving the context your agent receives, not by retraining weights.
The Key Difference: Transparency
When Lemma optimizes your prompt, you get a new prompt. You don't see the reasoning, the intermediate analysis, or the specific failure patterns it detected. If something breaks, you're debugging a black box.
When Kayba generates an improved prompt, every step is traceable:
| Step | What you can inspect |
|---|---|
| Trace analysis | Which traces were analyzed, what the Recursive Reflector found |
| Skill extraction | Each skill links to the specific trace and failure pattern that produced it |
| Skillbook | Every learned behavior is visible — helpful count, harmful count, source, status |
| Review | You approve, edit, or reject skills before they affect prompts |
| Prompt generation | The generated prompt maps directly to approved Skillbook entries |
With Kayba, if an agent's behavior changes, you can trace exactly which skill caused it, which trace that skill came from, and whether the skill is actually helping. With Lemma, you get an optimized prompt and trust that it's better.
Comparison
| Dimension | Kayba | Lemma |
|---|---|---|
| Open source | Yes, MIT license | No, closed-source |
| Transparency | Full — Skillbook shows every learned behavior with provenance | Black-box — optimized prompts delivered without visible reasoning |
| Research backing | 3 published papers (ACE, RLM, Dynamic Cheatsheet) | No published research |
| Approach | Trace analysis, skill extraction, Skillbook curation, prompt generation | Drift detection, prompt rewriting |
| Human review | Built-in — approve, edit, or reject skills before deployment | Limited — you receive optimized prompts |
| Self-hosting | Yes, run entirely on your infrastructure | No, managed service only |
| Framework dependency | Framework-agnostic (any agent, any trace format) | Integration-dependent |
| Fine-tuning required | No — improves context, not weights | No — prompt-level optimization |
| Pricing | Free (OSS) / $29/month (hosted dashboard) | Contact sales (demo-led) |
| Maturity | Production-ready, 2k+ GitHub stars, active community | Early-stage (YC F25) |
Benchmarks
Kayba's approach is validated on public benchmarks:
- t2-bench: pass@1 improvement of +27.4%, scaling to +100% at pass@4
- Browser agents: Success rate from 30% to 100%, with 82% fewer steps and 65% lower costs
These results come from the published research papers and are reproducible with the open-source framework.
When to Choose Lemma
Lemma may be a fit if:
- You want a fully managed service with zero infrastructure to maintain
- You're comfortable with a black-box approach and trust the output without needing to inspect the reasoning
- Prompt optimization is your primary concern, not building a persistent knowledge base of agent behaviors
- You prefer a demo-led sales process over self-serve
When to Choose Kayba
Kayba is the stronger choice if:
- You need to understand exactly what changed in your agent's behavior and why
- Auditability matters — regulated industries, enterprise compliance, or teams that need to review changes before deployment
- You want to own your learning data, not send it to a third-party service
- Self-hosting is a requirement (data sovereignty, air-gapped environments)
- You value open-source — inspect the code, contribute, fork if needed
- You want research-backed methods rather than proprietary optimization
Getting Started
Kayba is open-source and ready to use today:
pip install ace-framework
- Documentation — Setup guides and API reference
- GitHub — Source code and examples
- Dashboard — Hosted version with visual Skillbook management