The Problem with Customer Support Agents
Customer support is one of the highest-volume deployments for AI agents. Companies are using LLM-powered agents to handle returns, process refunds, answer billing questions, and route escalations. The agents work well on straightforward cases, but they fail in predictable, repeated ways:
- Policy violations — approving refunds outside the return window, applying discounts the customer isn't eligible for, or waiving fees that policy doesn't allow
- Missed escalations — failing to route high-severity issues (legal threats, safety concerns, VIP accounts) to human agents when the situation demands it
- Wrong refund calculations — issuing full refunds when only partial credit applies, or not accounting for restocking fees and promotional pricing
- Inconsistent responses — giving different answers to the same question depending on how the customer phrases it, eroding trust and creating audit risk
- Eligibility check failures — not verifying order status, warranty expiration, or account standing before offering a resolution
- Tone misjudgments — responding with scripted cheerfulness to a frustrated customer who has contacted support three times about the same issue
Each of these failures has a cost. A wrongly issued refund is direct revenue loss. A missed escalation becomes a complaint to leadership. Inconsistent responses lead to customers gaming the system by rephrasing their request until the agent says yes.
The worst part: the agent makes the same mistakes tomorrow because it has no memory of what went wrong today.
How Kayba Solves This
Kayba is an open-source learning layer that sits on top of your support agent. It doesn't replace your agent or modify its runtime behavior. Instead, it analyzes conversation traces after the fact, extracts reusable skills, and generates improved prompts so the agent handles the next interaction better.
The Pipeline for Support Agents
-
Trace collection: Your support agent produces conversation logs for each customer interaction — the customer's messages, the agent's responses, any tool calls (order lookups, refund processing, escalation triggers), and the final resolution.
-
Recursive analysis: Kayba's Recursive Reflector uses REPL-based code execution to programmatically analyze these traces. It doesn't just summarize conversations — it cross-references agent decisions against policy rules, checks whether eligibility was verified before actions were taken, and identifies patterns across hundreds of interactions.
-
Skill extraction: Failures and successes become atomic skills. For a support agent, these might look like:
- "Before approving any refund over $50, verify the order is within the 30-day return window by checking the
order_datefield" - "When a customer mentions 'lawyer', 'legal', or 'attorney', immediately escalate to a human agent regardless of issue severity"
- "Promotional bundle items can only be returned as a complete set — do not process partial returns on bundled orders"
- "If the customer has contacted support 3+ times about the same order, acknowledge the repeated contact and prioritize resolution"
- "Before approving any refund over $50, verify the order is within the 30-day return window by checking the
-
Skillbook curation: Skills accumulate in a transparent Skillbook with helpful and harmful counters. A skill that prevents policy violations across many interactions gets reinforced. A skill that turns out to be too restrictive (flagging too many false escalations, for example) gets flagged for review.
-
Prompt generation: Approved skills are compiled into an improved system prompt for the support agent. Delta updates mean only new or modified skills are added — the prompt doesn't bloat over time.
What the Skillbook Looks Like
After analyzing a few hundred support interactions, a typical Skillbook for a customer support agent might contain:
| Skill | Section | Helpful | Harmful |
|---|---|---|---|
| Verify return window (30 days from delivery, not purchase) before processing refunds | Return Policy | 23 | 0 |
| Escalate immediately when customer mentions legal action or regulatory complaints | Customer Escalation | 18 | 2 |
| Check loyalty tier before offering retention discounts — Gold+ only | Refund Processing | 14 | 1 |
| Partial refunds apply to orders with mixed returnable/non-returnable items | Order Management | 11 | 0 |
| Acknowledge repeated contacts before jumping to troubleshooting | Customer Escalation | 9 | 0 |
| Restocking fees (15%) apply to electronics opened past the seal | Return Policy | 7 | 1 |
Each skill links back to the specific traces that produced it. You can audit exactly which customer interaction taught the agent each behavior and adjust the skill if business rules change.
Results: Enterprise Support Benchmarks
Kayba's effectiveness on customer support tasks is validated by results on the t2-bench benchmark, which specifically tests enterprise customer service domains including order management, return processing, and policy compliance scenarios.
On t2-bench, agents using Kayba's learning pipeline showed:
- pass@1: +27.4% improvement — the agent gets it right on the first attempt significantly more often
- pass@4: +100% improvement — with multiple attempts, the ceiling doubles
These aren't gains on generic chatbot benchmarks. t2-bench evaluates exactly the kinds of multi-step, policy-constrained decisions that support agents face daily: checking eligibility, applying the right refund type, routing escalations correctly.
The improvement comes from the Skillbook encoding domain knowledge that the base model doesn't have — your specific return windows, your escalation criteria, your refund tiers.
Real-World Failure Patterns
Pattern: Policy Violations on Edge Cases
A support agent approves a full refund for a customer returning an opened laptop 45 days after purchase. The return policy allows 30 days, with a 15% restocking fee for opened electronics. The agent misses both rules.
Without Kayba: The refund goes through. Finance catches it during reconciliation. A manager updates the system prompt with more detailed policy language, but the prompt is already 3,000 tokens long and the agent still misses edge cases buried in the middle.
With Kayba: The Recursive Reflector identifies a pattern across multiple traces where the agent failed to check the return window before issuing refunds. Two skills are extracted: one for the 30-day window check and one for the restocking fee on opened electronics. These skills are concise and actionable — the agent's next prompt includes them as specific rules rather than buried paragraphs.
Pattern: Escalation Handling Failures
A customer writes in frustrated after three failed delivery attempts. They mention they'll "take this to consumer protection" if not resolved. The agent responds with standard troubleshooting steps and offers to reschedule delivery.
Without Kayba: The interaction eventually resolves after the customer calls in and reaches a human agent. No learning happens. The next time a customer mentions a regulatory body, the AI agent again misses the escalation signal.
With Kayba: Analysis across escalation-related traces reveals that the agent consistently fails to detect indirect legal or regulatory language. A skill is extracted: "When a customer references consumer protection agencies, the Better Business Bureau, or regulatory bodies, treat this as an escalation trigger equivalent to direct legal threats." The helpful counter on this skill climbs as it prevents repeated escalation failures.
Pattern: Eligibility Check Gaps
A customer requests a price match on an item purchased during a promotional event. The agent checks current pricing and issues a partial refund for the difference. However, the promotion terms explicitly excluded price matching for 60 days post-purchase.
Without Kayba: The incorrect adjustment processes. When the policy team audits, they find dozens of similar cases — the agent never checks promotional terms before applying price adjustments.
With Kayba: The Skillbook builds a skill: "Before processing price adjustments, check whether the original order was placed during a promotional event. Promotional purchases have a 60-day price-match exclusion." The skill links back to the traces where this failure occurred, making it easy for the policy team to verify the rule is correctly captured.
Integration
Kayba is framework-agnostic. It works with any support agent that produces conversation logs:
- Intercom / Zendesk AI agents — Export conversation transcripts and feed them into Kayba for analysis
- Custom LangChain / CrewAI agents — Pipe traces directly from your support agent pipeline
- OpenAI Agents SDK — Works with agents built on the Assistants API or Agents SDK
- Any framework — If your agent produces logs with customer messages, agent responses, and tool calls, Kayba can analyze them
pip install ace-framework
No changes to your agent's runtime are required. Kayba analyzes traces offline and generates improved prompts. Your agent keeps running as-is — it just starts each new interaction with better context.
Getting Started
- Documentation — Setup guides and API reference
- GitHub — Source code and examples
- Dashboard — Hosted dashboard for visual Skillbook management and prompt generation
- Book a Demo — See Kayba applied to your support agent's traces