Your AI agent needs a context budget, not a bigger model
Longer context windows help, but serious agent work needs explicit rules for what to include, cache, compress, route, delegate or drop.
TL;DR
The next useful agent feature is not just a larger context window.
It is a context budget: a clear operating policy for what the agent should keep in the prompt, what it should cache, what it should send to a tool, what it should delegate to a cheaper or local model, and what it should leave out.
Bigger windows are useful. Unbounded context is still not strategy.
What changed
The infrastructure around context is becoming visible.
Anthropic now documents context windows as a core design constraint, not just a model spec. The same documentation set includes prompt caching and token-efficient tool use, which are both signals that context handling is moving from prompt craft into systems design.
OpenAI has its own prompt caching guide, and its API pricing separates input, cached input and output economics in ways that make context decisions financial decisions as well as technical ones.
Around the model layer, tools such as LiteLLM routing and LiteLLM virtual keys and budgets show the same pattern from another angle: model calls are no longer just calls. They are routed, metered, retried, budgeted and governed.
That is the boring layer. Which means it matters.
Why it matters
Agent systems are context-hungry by default.
They want instructions, memories, files, tool schemas, recent chat, logs, browser state, source documents, examples, previous failures, user preferences and half the working directory. Then they want to call tools and paste the results back in.
That works until it does not.
A large context window can hide bad context hygiene for a while. It does not remove the tradeoff. More context can mean more cost, more latency, more irrelevant material, more privacy exposure and more room for the model to attend to the wrong thing.
The practical question is not “how much can we fit?”
It is “what deserves to be here?”
Embedded context budget
| Context decision | Use when | Default rule |
|---|---|---|
| Keep in prompt | The information is directly needed for the current judgement or action | Include it, but trim to the smallest useful excerpt |
| Cache | Stable instructions, tool specs, policy text or repeated source material will be reused | Cache the stable prefix where provider support makes that economical |
| Retrieve | The material is too large but searchable | Pull only the relevant passage, not the whole archive |
| Route to a tool | A specialised system can answer better than the model guessing | Use the tool and return compact results |
| Delegate to local/cheap model | The task is bounded: extraction, classification, cleanup, first-pass summarisation | Offload it, then reserve expensive judgement for the final call |
| Drop | The information is stale, duplicative, weakly related or merely comforting | Leave it out and preserve the budget |
This is not a prompt trick. It is an operating policy.
The RTK-shaped signal
RTK-style token-saving wrappers are interesting because they point at the right layer.
The claim is not that any one wrapper solves context management. The claim is that agent infrastructure is starting to treat token output, tool-call overhead and context bloat as engineering targets.
That includes the boring craft of making agents speak more like operations staff and less like brand interns. Terse language matters. Army-speak, checklists, status codes, routing labels, proxy summaries and compressed handoff notes are not aesthetic quirks; they are ways of reducing token waste and ambiguity.
A proxy summary is often better than a transcript. A compact tool result is often better than a log dump. A clear blocked / done / needs-human status line is often better than five paragraphs of reassuring sludge.
That is the direction to watch: less theatre around the assistant’s personality, more work on the machinery that keeps the run small, auditable and affordable.
A good agent runtime should know when to say: this does not belong in the main model call.
Routing beats dumping
Routing is the other half of the signal.
LlamaIndex documents routers that choose between query engines or tools instead of sending every question to one giant context bundle. LiteLLM documents provider and model routing and fallbacks, while its virtual-key system adds budget and access controls.
Ollama’s OpenAI-compatible API work shows why this matters locally too. If a local model can sit behind a familiar interface, then the agent can offload bounded work without redesigning the whole stack.
That is the same operating pattern covered in the lab note Local models are useful when the job is bounded: do the cheap, bounded, checkable work locally where possible; save expensive hosted judgement for the parts that actually need it.
Local does not mean magic. It still needs hardware, logging discipline, model selection and quality checks. But for bounded internal tasks, it can be exactly the right place to spend cheap tokens instead of expensive ones.
Practical implications
If you are building or buying agent systems, ask these questions before asking for a bigger model:
- What information is always loaded into the agent?
- Which parts are cached, and which are resent every time?
- Can the system retrieve excerpts instead of stuffing whole documents into context?
- Are tool results compact, or do they dump noisy logs back into the model?
- Can bounded work route to a local or cheaper model automatically?
- Does the agent have a terse status language for handoffs, blockers and tool results?
- Are premium models protected by explicit budget rules?
- Can a human inspect why a model or context path was chosen?
If the answer is “we just give it everything”, the system is not mature. It is lucky.
Rob’s take
The context window arms race is useful, but it is not the whole game.
Real agent work needs a context accountant.
Not because tokens are precious in some abstract optimisation sense. Because context is attention, money, latency, privacy and operational risk all packed into one prompt.
The better agent systems will not just remember more. They will know what not to carry.
Watch next
- Provider-level prompt caching becoming normal, not exotic.
- Tool-call formats getting smaller and more structured.
- Agent runtimes adding explicit context manifests.
- Budget policies tied to model routing, not pasted into a README.
- Local models used as bounded workers rather than fake frontier replacements.
Sources
- Anthropic: Context windows
- Anthropic: Prompt caching
- Anthropic: Token-efficient tool use
- OpenAI: Prompt caching
- OpenAI: API pricing
- LiteLLM: Routing, load balancing and fallbacks
- LiteLLM: Virtual keys and budgets
- LlamaIndex: Routers
- Ollama: OpenAI compatibility
Quick signal helps Rob sharpen future briefings.