A Concrete Example of Context Engineering — and How a Suite of Tools Makes It Tractable
A step-by-step context engineering example for a LangGraph customer support agent using GPT-4o, Pinecone, text-embedding-ada-002, and tiktoken — showing how Memory Visualizer, Token Inspector, RAG Chunk Inspector, and Agent Trace Inspector work together.
Abstract definitions of context engineering are easy to find. What is harder to find is a concrete walkthrough: here is the agent, here is the context window, here is exactly what went wrong and what was done to fix it. This post works through one.
The agent
A customer support agent built with LangGraph. It runs on GPT-4o (128k token context window). When a user sends a message, the agent:
- Retrieves relevant documentation chunks from a Pinecone index (embedded with text-embedding-ada-002)
- Looks up the user's account history from a CRM
- Calls GPT-4o with the assembled context
- Optionally calls a
lookup_ordertool if the query involves a specific order
The initial implementation worked well for 20–30 turns. After that, responses became repetitive, the agent started ignoring retrieved documentation, and occasionally the API returned a context length error.
The context budget, before any engineering
Here is what the context window looked like in the initial implementation at turn 30, measured with tiktoken (o200k_base):
| Layer | Allocated | Actual at turn 30 |
|---|---|---|
| System prompt | 1,800 tokens | 1,800 tokens |
| Pinecone retrieved chunks (top-8, 600 tokens each) | 4,800 tokens | 4,800 tokens |
| CRM account history injection | 2,000 tokens | 1,400 tokens |
| Conversation history (all turns) | unlimited | 52,000 tokens |
| Tool results | 1,000 tokens | 600 tokens |
| Total | — | 60,600 tokens |
The conversation history had no truncation. By turn 30 it consumed 52,000 of the 128k window. By turn 430 it would overflow. The retrieved chunks were top-8 with 600-token chunks — 4,800 tokens of retrieval regardless of whether all 8 chunks were relevant. The CRM injection was capped but its actual token count was never verified.
The context engineering decisions
1. Token budget design. Before touching code, the budget was redesigned:
| Layer | New budget |
|---|---|
| System prompt | 1,800 tokens |
| Pinecone retrieved chunks | 2,400 tokens (top-4, 600 tokens each) |
| CRM injection | 1,200 tokens (trimmed) |
| Conversation history | 8,000 tokens (sliding window, last N turns) |
| Summarization buffer | 2,000 tokens (compressed older turns) |
| Tool results | 800 tokens |
| Reserved for response | 4,000 tokens |
| Total | 20,200 tokens |
The window went from filling unpredictably to a stable 20,200 tokens per call — about 16% of the 128k window, leaving substantial headroom for burst turns and longer tool results.
2. Retrieval precision. Reducing top-k from 8 to 4 and validating that 600-token chunks did not cut across meaningful section boundaries. The RAG Chunk Inspector showed that several chunks were split mid-sentence by the token boundary, losing context. Switching to sentence-boundary chunking at ~500 tokens fixed this without increasing the chunk count.
3. History truncation + summarization. The sliding window keeps the last 6 turns verbatim (roughly 1,800 tokens at 300 tokens/turn average). Older turns are compressed into a summarization buffer by a second GPT-4o call that produces a 150-token summary of each group of 10 older turns.
4. CRM injection trimming. The raw CRM output was 2,000 tokens because it included fields never referenced in responses. Selecting only the 8 relevant fields reduced it to 800 tokens consistently.
Verifying the design with tooling
Before deploying the revised architecture, three things were checked:
Memory Visualizer — The revised layer configuration was modelled in ContextIQ's Memory Architecture Visualizer with a GPT-4o 128k context limit. The turn-by-turn simulation confirmed the window would stay under 25,000 tokens indefinitely, with the summarization buffer activating after turn 6 and keeping history stable.
Token Inspector — The Token Inspector was used to measure the actual token count of the system prompt, the CRM injection, and a representative Pinecone chunk using tiktoken o200k_base — the same tokenizer GPT-4o uses. The measured values matched the budget within 5%.
Agent Trace Inspector — After deploying to a staging environment, OTLP traces were exported from LangSmith and loaded into the Agent Trace Inspector. The gen_ai.usage.input_tokens values on the ResearchAgent spans confirmed that actual input token counts matched the designed budget. The lookup_order tool spans showed a consistent 200–400 token result payload, well within the 800-token tool result budget.
The outcome
Average input tokens per call dropped from ~28,000 (measured across 100 production calls before the change) to ~18,500 — a 34% reduction. At 15,000 calls/day on GPT-4o at $2.50/1M input tokens, that is a saving of $1,406/month. The context length errors stopped. Response quality at turn 30+ improved because the agent was no longer working with a context window dominated by old, low-relevance conversation turns.
The underlying prompt did not change. The model did not change. The improvement came entirely from context engineering: designing the budget, validating the design, and measuring the outcome in production traces.
Follow Trango Compute on LinkedIn
We post updates on new tools, context engineering patterns, and LLM cost research.