AI Agent Memory Architecture: How LangGraph, CrewAI, AutoGen, and OpenAI Agents SDK Manage Context

Memory is the hardest part of building production AI agents — not because the concepts are complicated, but because the tradeoffs are invisible until your agent starts hallucinating facts it should have known, or burning your entire token budget on stale context.

Most agent frameworks give you the primitives to build memory. What they don't give you is a clear model for which kind of memory to use for each part of your agent's state, or how those memory stores interact at runtime. This post covers both.

The Four Memory Layers

Agent memory can be decomposed into four functionally distinct layers. These map directly to how the human brain organizes information — not by accident, but because the same tradeoffs (speed vs. capacity, precision vs. fuzziness, volatile vs. persistent) apply to both biological and artificial systems.

Layer	What it holds	Storage	Persistence
Working memory	Active context — current messages, tool outputs, scratchpad	Context window	Ephemeral (per-turn)
Episodic memory	Past interactions and events retrieved by similarity	Vector database	Persistent
Semantic memory	World knowledge — documents, KB, product data	Vector database / BM25	Persistent
Procedural memory	How to act — tool definitions, system prompt, few-shot examples	System prompt / tool registry	Static per-deployment

Each layer has a different token cost, retrieval latency, and staleness profile. The job of a memory architecture is to compose these four layers so the agent has exactly what it needs in the context window — no more, no less.

Working Memory: The Context Window Budget

Working memory is everything currently in the context window: the system prompt, conversation history, tool call results, and any retrieved content. It is the only memory the model can directly read. Everything else has to be retrieved into working memory before the model can use it.

The hard constraint is the context window limit. GPT-4o supports 128k tokens; Claude 3.5 Sonnet supports 200k; Gemini 1.5 Pro supports 2M. These are ceilings, not targets — filling the context window with every message in a long conversation is the most common cause of degraded agent quality and ballooning costs.

Token budget allocation is the core design decision for working memory. A typical production allocation for a task-oriented agent:

System prompt + tool definitions: 1,000–3,000 tokens
Retrieved episodic context: 500–2,000 tokens
Retrieved semantic context (RAG): 1,000–4,000 tokens
Conversation history (sliding window): 2,000–8,000 tokens
Current turn + scratchpad: 500–2,000 tokens

The total working memory footprint determines your per-request cost. At 10,000 requests/day, the difference between a 4,000-token and an 8,000-token context is roughly $70–$300/month depending on the model.

Episodic Memory: Remembering Past Interactions

Episodic memory stores what happened — past conversations, decisions, and outcomes — in a form that can be retrieved by semantic similarity. The canonical implementation is a vector database (Pinecone, Weaviate, pgvector) where each past turn or summarized interaction is stored as an embedding.

At query time, the agent embeds the current user message, retrieves the top-k most similar past interactions, and injects them into working memory as context. This gives the agent continuity across sessions without having to replay the entire conversation history.

The retrieval quality problem is the main failure mode here. Cosine similarity retrieval is good at finding semantically related content, but it has no notion of recency, user identity, or task relevance. A user asking "what was the plan we discussed?" might retrieve a planning conversation from three months ago while missing a directly relevant one from last week that happened to use different vocabulary.

Production episodic memory typically combines dense retrieval (embeddings) with metadata filters (user ID, session ID, recency window) to constrain the search space before ranking by similarity.

Semantic Memory: The Knowledge Base

Semantic memory is the agent's long-term knowledge — product documentation, internal wikis, domain knowledge, or any structured or unstructured content the agent needs to answer questions correctly.

This is what most engineers mean when they say "RAG." The retrieval pattern is the same as episodic memory (embed, search, inject), but the content is static or slowly changing reference material rather than interaction history.

Chunking strategy matters more than most teams realize. A document chunked at 512 tokens with 50-token overlap behaves very differently from the same document chunked at sentence boundaries with metadata preserved. For technical documentation, semantic chunking (splitting on headings and natural section breaks) consistently outperforms fixed-size chunking on retrieval precision.

Weaviate, Pinecone, Qdrant, and pgvector are the most common backends. For teams already on Supabase, pgvector + IVFFlat index gives good enough recall for corpora under ~1M chunks without adding a separate service.

Procedural Memory: The Agent's Skillset

Procedural memory encodes how the agent should act — its personality, constraints, available tools, and worked examples. Unlike the other layers, procedural memory is typically static per deployment and lives in the system prompt and tool registry.

Tool definitions are part of procedural memory. Every tool you register adds tokens to every request, whether the tool is used or not. A function-calling agent with 20 registered tools is paying for those definitions on every turn. For agents with large tool sets, dynamic tool selection — retrieving the relevant tool subset based on the current task before building the context window — can significantly reduce per-request token costs.

Few-shot examples are the highest-leverage form of procedural memory for steering output format and reasoning style. A system prompt with three well-chosen examples of correct tool use consistently outperforms elaborate prose instructions for behavioral alignment.

How Frameworks Implement These Layers

LangGraph

LangGraph models agent state as a typed graph where each node can read from and write to a shared state object. Memory is explicit: you define what lives in state (working memory), what gets checkpointed between runs (episodic), and what's injected from external stores (semantic).

MemorySaver is the built-in checkpointer that persists conversation state between invocations. For semantic memory, most LangGraph applications use a retriever node that calls out to a vector store before the main reasoning node. The explicit state graph makes it easy to reason about what the model sees at each step — a significant advantage over implicit memory approaches.

CrewAI

CrewAI provides a higher-level abstraction where memory is managed per-agent via ShortTermMemory, LongTermMemory, and EntityMemory types. Short-term memory is in-context (working memory); long-term and entity memory are backed by an embeddings store.

The EntityMemory abstraction is particularly useful for multi-agent workflows — it extracts and stores named entities (people, companies, concepts) encountered during task execution and makes them available to other agents in the crew. This gives crews a shared semantic layer without requiring explicit retrieval calls in every agent.

AutoGen

AutoGen's ConversableAgent maintains conversation history internally and exposes a memory parameter that can be pointed at a custom memory backend. AutoGen's GroupChat pattern — multiple agents coordinating via a message bus — requires careful memory design because each agent only sees its own message history by default, not the full group conversation.

For long-running AutoGen workflows, a common pattern is to attach a RetrieveUserProxyAgent as a dedicated memory retrieval agent that fields content lookups from the task agents, keeping the retrieval concern separate from the reasoning concern.

OpenAI Agents SDK

The OpenAI Agents SDK takes a minimal approach: agents have a instructions string (procedural memory) and access to tools (also procedural). Working memory is managed by the Runner, which maintains the thread and handles the turn loop. There is no built-in episodic or semantic memory — these are expected to be provided via tool calls (search_knowledge_base, get_user_history, etc.).

This is the most explicit approach: memory is just a tool, and the agent decides when to call it. The tradeoff is that good memory usage requires the model to reason correctly about when retrieval is needed — which adds a failure mode that frameworks like LangGraph and CrewAI handle more structurally.

Designing a Memory Architecture

The four-layer model is a useful starting point, but production architectures almost always involve context-specific decisions about what to retrieve, when, and at what cost.

A few principles that hold across frameworks:

Retrieve late, not early. Build the working memory context as close to the model call as possible, using the current turn's intent to guide retrieval. Retrieving at the start of a session and carrying the results through multiple turns means your retrieved context goes stale.

Summarize, don't truncate. When conversation history grows past your window budget, summarize the oldest turns into a compact representation rather than dropping them. LangGraph's trim_messages with a summarization fallback is the standard pattern.

Measure what's in the context. Token counts for each memory layer should be observable in your traces. If you don't know your p95 working memory size at runtime, you're flying blind on cost and context quality.

You can visualize and design your agent's memory architecture — map the four layers, assign token budgets, and trace data flow between components — using the ContextIQ Memory Architecture Visualizer.