Context engineering is the delicate art and science of filling the context window with just the right information for the next step. - Andrej Karpathy
The Problem
LLMs have an attention budget. Every token depletes it.
- O(n²) attention pairs → longer context = thinner, noisier attention
- ChromaDB study: 11/12 models dropped below 50% performance at 32K tokens
- Microsoft study: accuracy fell from 90% → 51% in longer conversations
More context ≠ better outcomes. After a threshold, performance degrades (context rot).
Why Context Fails
The parallel to human cognition is striking. When humans face information overload, the dorsolateral prefrontal cortex "gives up" and decision quality deteriorates. But humans have anxiety as a warning signal. LLMs have no such mechanism. They silently degrade without self-awareness.
Research reveals counterintuitive findings:
- Distractors: Even ONE irrelevant element reduces performance
- Structure Paradox: Logically organized contexts can perform worse than shuffled ones
- Position Effects: Information at start/end is retrieved better than middle
The implication: careful curation beats comprehensive context every time.
Types of Agent Memory
Not all memory is equal:
| Type | What it stores | Example | Limitation |
|---|---|---|---|
| Semantic | Facts about things | "Python uses indentation" | Doesn't teach how |
| Episodic | Events that happened | "Build failed at 3pm" | Context-specific, doesn't generalize |
| Procedural | How to do things | "Always check schema before migration" | Transfers across tasks |
RAG gives you semantic memory. Chat history gives you episodic memory. The challenge is building procedural memory: patterns of how to succeed that transfer to new situations.
Approaches to Context Management
Static Context
What most teams start with:
CLAUDE.md/CURSOR_RULESfiles with project rules- Examples folders
- Manual PRPs (Product Requirements Prompts)
Trade-offs:
- ✅ Simple to implement
- ✅ Predictable behavior
- ❌ Goes stale fast
- ❌ Manual maintenance overhead
- ❌ Token bloat (loads everything every time)
Long-Horizon Techniques
For tasks that exceed the context window:
Compaction
- Summarize history → restart with high-fidelity summary
- Keep architectural decisions, discard redundant tool outputs
- Best for: conversational tasks with extensive back-and-forth
Structured Note-Taking
- Agent writes persistent notes outside context (e.g.,
NOTES.md) - Pull back into context as needed
- Best for: iterative development with clear milestones
Sub-Agent Architectures
- Coordinator plans; specialized sub-agents do deep dives
- Return condensed summaries (≈1-2k tokens)
- Best for: complex research where parallel exploration pays off
Dynamic Context / Learning Systems
Systems where context evolves through execution:
- Reflect on what worked/failed
- Curate strategies into persistent memory
- Inject learned patterns on future runs
This addresses the maintenance problem of static context. The system learns instead of requiring manual updates.
The Stanford ACE framework formalizes this as a feedback loop between execution and curation. Our open-source implementation of the framework (agentic-context-engine) has shown promising results: 30% → 100% success rate on browser automation with 82% fewer steps.
Key Principles
1. Smallest Possible High-Signal Tokens
Good context engineering = finding the minimum tokens that maximize desired outcome.
Techniques:
- Compression formats (reduce token overhead)
- Citation-based tracking (reference, don't repeat)
- Active pruning (remove what doesn't help)
2. Just-In-Time Context
Don't preload everything. Fetch what's needed during execution.
- Keep lightweight references (file paths, queries)
- Load data at runtime using tools
- Mirrors human cognition: we don't memorize databases, we know how to look things up
3. Right Altitude
System prompts should be clear but not over-specified:
- Too specific → fragility, high maintenance
- Too vague → bad output, false assumptions
Find the level of abstraction that guides without constraining.
4. Tool Design
Fewer, well-scoped tools beat many overlapping ones. If a human can't pick the right tool from your set, the model won't either.