Context Engineering Replaced Prompt Engineering — Here's Why It Matters

Context Engineering Replaced Prompt Engineering — Here's Why It Matters

The Thoughtworks Technology Radar, Volume 33 (November 2025), promoted "context engineering" from a niche concern to an adopted technique. Easy to miss among the usual Radar noise. Worth paying attention to.

The distinction is simple. Prompt engineering is what you do inside a model's context window -- phrasing, few-shot examples, system instructions. Context engineering is what you do to decide what enters that window in the first place. One is copywriting. The other is plumbing, and it turns out the plumbing is harder.

Why the rename is not just semantics

I run production agentic systems at a retail-data platform serving tier-1 grocers globally. LangGraph orchestration graphs, Pydantic-validated tool schemas, pgvector for retrieval, LoRA-tuned models for domain-specific tasks. When something goes wrong in these systems, I almost never trace the failure back to a badly worded prompt. The context was wrong. The model saw the wrong things, or too many things, or the right things without enough structure to reason over them.

A prompt tells the model what to do. Context tells it what's true right now. Getting that second part right -- at inference time, for the specific decision being made, with traceable provenance -- is the actual engineering problem. Thoughtworks naming it is useful because naming things lets teams budget for them.

What this looks like in production

In practice, context engineering breaks into four concerns that rarely show up in conference talks.

Standard RAG -- embed documents, cosine similarity, stuff the top-k chunks into the prompt -- works for Q&A chatbots. It falls over for anything that requires reasoning about relationships between entities. If an agent needs to check whether a supplier contract clause conflicts with a procurement policy, it needs the relevant subgraph from a knowledge base, not five paragraphs that happen to score well on embedding distance. We run structured retrieval over Neo4j alongside pgvector. The vector store finds candidate nodes. The graph query retrieves the neighbourhood that makes those nodes meaningful.

Provenance is not optional

Every piece of context that enters a model window should be traceable back to its source, version, and the retrieval path that selected it. This is not compliance theatre. Under the EU AI Act and FCA expectations for algorithmic decision making, you need to reconstruct why a system made a specific recommendation six months after it made it. If you can't answer "what did the model see when it produced that output?", you don't have explainability. You have a chatbot with a disclaimer page.

Context windows are a budget

A 128k-token window is not an invitation to dump everything in. Every token carries a compute cost and a marginal information value, and those two curves cross sooner than most teams expect. We track context usation the same way we track cloud spend: what went in, what did it cost per query, did the extra context actually change the output? In most of our workflows, a carefully assembled 8k-token context outperforms a naively stuffed 40k-token one. Less noise. Faster inference. Lower bill. Better answers.

Tool-call accuracy at depth

This is where agents actually break. A tool call with 95% accuracy per step sounds fine. Compound it over a 20-step workflow and the probability that every step is correct drops to 36%. By step 40: 13%. You cannot prompt your way out of compound error rates. It's a systems design problem. We use planner-executor architectures -- a planning node decomposes the task, executor nodes handle individual steps, and validation gates sit between them. Each executor gets the narrowest context slice that still lets it do its job. Fail fast, roll back, log everything.

The vocabulary gap

Hiring interviews and vendor calls have taught me there's a vocabulary tier forming around AI in 2026. It's a surprisingly honest signal.

Table stakes: RAG, guardrails, human-in-the-loop, multi-agent, MCP. These tell me you've followed the conversation. That's fine. Necessary, not sufficient.

Differentiating: context engineering, the A2A vs MCP distinction (agent-to-agent negotiation versus agent-to-tool invocation -- they're different protocols solving different problems), autonomy budgets, blast radius, decision rights, orchestration graphs, topology-aware scheduling. These tell me you've shipped something and dealt with the consequences.

The tell is specific. Anyone who talks about agents without mentioning failure modes, rollback, observability, or cost is presenting a demo. Anyone who says "tool use, context engineering, autonomy budgets, blast radius, and we measure tool-call accuracy at depth" has run something in production. You can usually tell which group has been paged at 2am about it.

What this means for engineering leaders

If you're making AI platform decisions in 2026, "should we do AI?" is no longer the question. That conversation ended about eighteen months ago. The question is whether you're treating the context layer as infrastructure -- with its own tests, its own observability, its own cost tracking, its own team -- or whether it's a prompt template in a Git repo that someone wrote during a hackathon.

Infrastructure looks like: a context assembly pipeline, a retrieval layer that understands your domain's data topology (not just a vector store bolted on), provenance metadata flowing from source through retrieval through inference to output, and autonomy boundaries that are configured and enforced rather than assumed.

The models will keep getting better on their own. The context layer won't. That's your problem to solve, and it's also the part of the stack that's genuinely yours -- nobody can copy it from a tutorial.