Context Engineering
Context Engineering is a buzz word that has been going around for some time. Since LLMs were invented, prompt engineering was the golden standard for getting the best results. The focus was on writing effective and clear prompts so that the LLM does not go off rails or hallucinate. But recently, the models have gotten very good. You don’t need to preface your prompt with “You are a really good Software Engineer” anymore and you most certainly don’t need to end every prompt with “MAKE NO MISTAKES”. In fact, with these stronger models, we are moving toward more autonomous flows that usually require multiple loops of inference, tool calls and longer time horizon.
These autonomous flows generate a lot more data as compared to the past (full message history, MCPs, tools, skills etc). Passing all this information onto the next loop can be disastrous both for cost and the quality of the model output. Context Engineering, in my opinion, is the natural progression of prompt engineering — it focuses on curating all the information we have and passing only the most relevant information into the limited context window.
Why not just pass in all the information?
LLMs in general cannot process too much information. Context Rot occurs when we use up more tokens within the limited context window and evidence shows that model performance drops drastically as we reach closer to the end of the window. It is important to be prudent with what the model is focusing on given the limited “attention” it has. So you might ask — why not just have a larger context window? That way the model can take in everything.
A useful analogy: context is like a giant whiteboard. A bigger whiteboard lets you write more, but you still cannot explain all the tiny details, where each detail lives, and how they connect. Attention is still limited, so a bigger window does not preserve everything.
- Context window — how much text the model is allowed to see
- Attention budget — how well the model can use and focus on the text it sees
There is also a computational cost. Transformers compare every token against every other token — 10 tokens means 90 pairwise comparisons, but 1,000,000 tokens means ~1 trillion. The n² problem makes large contexts enormously expensive.
What makes a good context?
This is a resource constraint problem. We want the smallest possible set of high-signal tokens that maximise the likelihood of the desired outcome. Two key components:
1. System Prompt. This is more art than engineering. You cannot hardcode too many instructions and force certain actions, but vague high-level guidance is equally useless. My approach: start with the minimal set of information needed to do the job, test on that, then slowly add instructions to refine. You mould the best version rather than expecting to nail it in one shot.
2. Tools. Tools allow agents to interact with the environment and retrieve context as needed. They need to be lean, clear, and self-contained. Ambiguous tools that do too much make it hard for the agent to use them well.
Context Retrieval
There is a shift in how retrieval is done. The old pattern was pre-inference retrieval: retrieve everything before the model starts.
User asks question
↓
System searches vector database / embeddings
↓
Relevant chunks inserted into prompt
↓
LLM answers using those chunks
The newer pattern is just-in-time context. The model has references to where information lives, and fetches only what it needs during execution. If you ask a backend question, frontend files never enter the context window. This is also called progressive disclosure — the agent builds understanding step by step.
A good example of debugging a codebase with just-in-time retrieval:
ls
# → src/tests/prisma/middleware.ts package.json
find src -iname "*auth*"
# → src/lib/auth.ts src/api/login/route.ts src/middleware/authGuard.ts
cat src/lib/auth.ts src/api/login/route.ts
# → only the relevant files, not the whole repo
Each step reveals a bit more context and informs the next step. Agents use not just file contents but metadata and structure to make decisions.
The clear drawback is latency — just-in-time retrieval is slower than pre-computed data. Anthropic’s workaround is a hybrid model: certain files (CLAUDE.md) are ingested at the start, while tools like grep and glob handle just-in-time retrieval for everything else.
Long-Horizon Tasks
Long-horizon tasks require agents to maintain coherent inference and context even when token counts inevitably exceed the context window. Three techniques help:
1. Compaction. Near the limit, the agent summarises the conversation and continues with a compressed context. Critical details — architectural decisions, unresolved bugs, implementation state — are preserved; lower-signal content is discarded. This is genuinely hard to get right. My personal approach when nearing the window: ask the agent to write the critical information to a file before compacting.
2. Agent Memory. The agent writes notes that persist across context resets. When the window clears, the memory survives. This is an area I will continue to follow — once agents can tap into memory reliably, they become substantially more capable.
3. Sub-agents. Each sub-agent has its own clean context window. A main agent delegates, then synthesises and aggregates results. Clean separation of concerns applied to context management.
Conclusion
Despite models getting larger and more capable, I think it remains important to treat context as a finite resource. That said, I would not be surprised if Context Engineering evolves into something else entirely as models continue to improve.
One thing that stuck with me after reading this: agents are modelled very closely to how humans actually solve problems. The debugging example above — exploring file names, doing targeted searches, opening only what is relevant — is exactly what a good engineer does. If you are ever trying to optimise an agent, step back and think about how a human would solve the same problem. The answer often lies there.