What "context" really costs you
Context is the most expensive thing about AI, and almost nobody budgets for it.
By context we mean what the model sees, when it sees it, and what happens to that information afterward. Models are cheap to call. The work of giving them the right context at the right moment is not — and that is where the real bill comes from.
If you have not yet read AI vocabulary without the vendor speak, start there for the definitions behind the words we use here.
Context is not free, and it compounds
Every time you send a request to a model, the text that goes with it — instructions, question, reference material, conversation history — is called the context. You pay for every word of it. The model's attention is also paid out of that budget: the more you stuff in, the less of it the model actually uses well.
"Context" is three costs in a trench coat:
- The bill. What you pay your provider per request.
- The quality loss. What the model misses when it has too much to wade through.
- The governance cost. The overhead of knowing what went into the prompt, where it came from, who can see it, and how to unwind it.
Most teams track the first one, sometimes. Most ignore the other two until something breaks.
1. The bill
Token pricing is straightforward once you sit with it for five minutes. (A token is roughly three-quarters of a word.)
A page of a PDF is around 600 tokens. An email thread is 2,000 or so. A long contract is 20,000 to 40,000. A one-hour meeting transcript is around 10,000.
Per-token pricing moves every few months. The shape of the math does not: cost per request × number of requests = your monthly AI bill. That is trivial arithmetic, and people keep failing at it because the per-request cost is so small nobody checks it.
Where the bill gets away from people:
- Sending the whole document on every question. If your assistant loads the 40,000-token employee handbook on every query, you are paying to resend it thousands of times a month. Retrieve only the relevant sections.
- Fat instructions with examples that never change. Major providers offer prompt caching that drops the cost of static context by roughly ten times. Most teams never turn it on.
- "Memory" built as full transcript replay. Passing the entire conversation back on every turn is the default in most starter examples. It will bankrupt a long-running assistant. Summarize, compact, or store and retrieve — do not stack forever.
- Output bloat. If the model writes 2,000-word answers when you needed 300, your output bill is seven times what it needs to be. Output is usually priced higher than input. Bound it.
If you cannot put the token cost of one run of a workflow on an index card, you are not managing it. You are absorbing it.
2. The quality loss
The marketing message lately is "bigger context windows." Models can read a million tokens in one shot now. That is impressive. It is also misleading.
What the research consistently shows — and what every operator who has tried it has felt — is that as the context grows, the model's ability to use it gets worse. It pays close attention to what is at the start and end of a long prompt, and glazes over what is in the middle.
Practical consequences:
- Relevance beats volume. A short prompt with the three right paragraphs will outperform a massive prompt that includes everything you own. Choosing what to retrieve is not an optimization — it is the whole game.
- Order matters. Put the most important instructions at the start, the most relevant reference material just before the question, and the question itself last.
- Cheap context is not free context. Even as token prices drop, the model's attention is a limited resource. Diluted context does not fix itself because it became affordable.
When an AI system is underperforming, the first question is rarely "are we on the best model?" It is: what is in the prompt, in what order, and why?
3. The governance cost
This is the cost that shows up in a risk register two years in.
Every word you put into a prompt is a word you have chosen to expose to the model. That text may be used for training, stored for monitoring, logged for your own records, or retrieved later by someone else's query — depending on how your provider is configured. Every answer the model produces is an answer your organization has effectively issued, grounded or not, reviewed or not.
Four practical consequences:
- The prompt is the real perimeter. If you do not want customer PII in your provider's infrastructure, the control is a filter on the context, not a checkbox on the sales order.
- Retrieval breaks access control if you let it. When a system pulls documents from a database into a prompt, whatever permissions existed on the original data need to be re-enforced at retrieval time. The single most common security miss we see in deployed AI systems is a missing filter, not a compromised model.
- Accountability starts with the prompt. If you need to explain why a model said something — to a regulator, a customer, or yourself — the minimum record is the exact prompt the model saw, the exact output, the model version, and the retrieval sources. "We use ChatGPT" is not an answer.
- Deletion extends to the prompts and the embeddings. If a customer asks you to delete their data, your obligation includes the prompts and vector stores you have built, not just the source database. Most teams forget the vector store exists.
The real work has a name
Two years ago, the fashionable phrase was "prompt engineering." The more honest name — now used by the people actually building production systems — is context engineering.
It means designing what the model sees, when it sees it, and what happens to it afterward:
- Selection. Which documents, examples, and state get pulled in for this call.
- Ordering. Where each piece sits in the prompt.
- Compaction. How long conversations and large documents get summarized so you are not paying to move the world on every turn.
- Caching. What is stable enough to be cached and paid for once.
- Provenance. Which source each retrieved chunk came from, captured well enough that an answer can be traced back.
- Governance. Who is allowed to see what, before the context is built, and what happens after.
This is not exciting. It is where roughly 80% of the quality, safety, and cost of a production AI system is actually determined. The model is a commodity. The context is the craft.
Four principles that survive contact with production
None of these are original to us. They are the ones that keep earning their keep in serious systems work.
- Retrieve narrow before you generate wide. Three right paragraphs beat a 200-page handbook.
- Separate durable from volatile. Stable instructions go in a cache-friendly layer; today's query and today's data go in the hot path.
- Make the context inspectable. If an engineer cannot pull the exact prompt that produced a given output in under sixty seconds, your system is not debuggable. It is not accountable either.
- Measure before you optimize. Pick ten to fifty real examples, write down what "good" looks like, and run them. Your intuition about which model or prompt is better is almost always wrong in specific, repeatable ways.
How we help
When you pay a consultant, an agency, or an internal team to "do AI" for you, the work they are actually doing — whether the deck says it or not — is context engineering. The model choice is twenty minutes of work. The prompts are a week. The retrieval, the evaluations, the cost discipline, the access control, the provenance, the operational runbook — that is the real work, and that is what separates a demo from a durable system.
We do not do that work for you. We build the discipline into your team. If that is what you are trying to build, two sentences on where you are stuck is enough to start a real conversation.
Going deeper
For the business-side version of the same idea — context at the organization level rather than inside a prompt — read Organization context before models.
For the research and documentation underneath:
- Jake Van Clief, "Interpretable Context Methodology" — the paper that sharpened our view of context as architecture.
- Anthropic's documentation on prompt caching — how to pay once for context that does not change.
- OpenAI's platform documentation — the official reference for their caching, retrieval, and evaluation patterns.
For ongoing practitioner writing worth following:
- Simon Willison's blog — running commentary on what is actually working inside production LLM systems, including where the cost and quality cliffs live. Our default reference for keeping up with the field without the noise.
- Andrej Karpathy, Intro to Large Language Models — a clean 1-hour explanation of how these systems actually work, useful before you start making architectural decisions about context.
- Hamel Husain — A Field Guide to Rapidly Improving AI Products and Your AI Product Needs Evals — the closest practitioner mirror of our own methodology. If you are going to read one consultant's writing on measuring AI systems, read his.
- Chip Huyen — AI Engineering — the reference book for building real applications on top of foundation models. The technical companion to this essay.
- Eugene Yan — Patterns for Building LLM-based Systems & Products — the canonical practitioner-level reference for evals, RAG, caching, guardrails, and defensive UX. If you are building in-house, this is a bookmark, not a one-time read.
If something in here maps to a problem you are sitting on
Two sentences on what you are trying to do is enough to start. We reply personally—no sequences, no SDR handoff.
New writing is announced via the —same list site-wide. Away from the home page, this opens the signup form over the page so you do not lose your place.