project brief
The Window Is Not the Memory
Retrieval was fixed — then the chatbot got dumber the longer a technician talked to it, and forgot everything by morning. The two things people conflate: the context window and memory.
A while back I wrote about what actually broke when I shipped an agentic RAG system for a cleaning-robot company — the retrieval, the routing, the evals. The punchline was that the agent was the easy part.
This is the sequel. It’s about the next two failures, the ones that only showed up once retrieval was good and real technicians were having real, multi-turn conversations with the thing:
- The bot got noticeably worse the longer a single conversation ran. Sharp at turn three, confidently wrong by turn forty.
- It remembered nothing. A tech could solve a problem with it on Monday and start from a blank slate on Tuesday.
For a while I treated these as one problem — “the model needs more context” — and kept making the window bigger. That fixed neither. The thing I’d missed is embarrassing in hindsight: the context window is not memory.
The window is the model’s working set for a single turn: volatile, bounded, rebuilt from scratch every time. Memory is the system you build around the model to decide what gets loaded back into that window. Two different problems. Two different fixes.
Part one — the window is a budget, not a bucket
My instinct was a bigger bucket: more context, more knowledge. But a context window isn’t a bucket you fill — it’s a budget you spend. Every token competes for the model’s attention, and that attention doesn’t stretch evenly as the window grows.
The cleanest name for the symptom is “lost in the middle”: a fact the model uses perfectly when it sits at the top or bottom of the context gets quietly ignored when it’s buried in between. Not missing — present, and ignored. Drag the fact around and watch:
▸ buried in the middle, the model does worse than if you had given it no documents at all
The dip isn’t subtle. In the original study it fell below the score the model gets with no documents at all — you’d have been better off not retrieving. And a bigger window doesn’t save you: a fact stranded deep in a long context is still in the dead zone.
That’s the static version. In an agentic loop it gets worse, because the window doesn’t hold one carefully-placed fact — it holds everything, and it grows every turn. System prompt and tool schemas. The running conversation. This turn’s retrieved chunks. And the big one: tool outputs, which in a RAG agent are often verbose dumps that pile up on every iteration. They all compete for the same fixed budget.
▸ fits the window, but it is past the point where quality starts to rot
Two things to notice:
- “Fits” and “reliable” are different lines. You blow past the budget where quality holds up long before you hit the advertised cap.
- Most of the bloat is re-fetchable junk — old tool outputs you could drop and re-retrieve later if you ever needed them.
So the fix is triage, not a bigger window. Three moves, cheapest first:
- Clear before you summarise. Re-fetchable tool results can be dropped outright and replaced with a stub. No model call, lossless if you can fetch them again.
- Compact only when you must. Genuinely-needed dialogue can be summarised — but it costs a model call and it’s lossy. I’ve been burned by exactly this: an exact part number compressed into “the part,” surfacing as a wrong answer ten turns on.
- Retrieve just-in-time. Keep a lightweight pointer (a doc id, a query) in context and pull the full payload only when the step actually needs it.
Part two — the window forgets; memory is what you build around it
Everything above lives inside one conversation. The second failure was across conversations, and it’s a different beast. When a session ends, the window is gone and the model keeps no state. If you want the bot to know on Tuesday what it learned on Monday, you have to build that yourself. Memory isn’t a model feature — it’s plumbing you write around the model.
It helps to split it the way the literature does. Short-term memory is the live thread — this conversation’s messages. Long-term memory is everything that has to outlive the thread, stored outside the window and pulled back on demand. Long-term splits again by what it holds: facts about the customer (semantic), what happened in past sessions (episodic), and rules for how to behave (procedural).
The naïve version of short-term memory is a sliding window: keep the last k messages, drop the rest. It’s fine — until an early message held a constraint you needed to keep.
▸ the one constraint that mattered got evicted, and the bot answered for the wrong robot
That’s the dropped-constraint bug in robot form. The model statement (“it’s a K2”) was said once, early, and a raw window evicted it. Summarising helps — but notice it’s lossy: the summary kept “robot down, docking issue” and dropped the exact model, so the best case is the bot asking again, not getting it right. The real fix is to promote durable facts out of the window and into long-term memory the moment they appear.
Which sounds easy and isn’t — because the hard part of long-term memory isn’t reading, it’s writing. Every fact you store is a fact you’ll later have to update, reconcile, or retire. “Customer runs K1” was true for a year and is now actively wrong. If you only ever append, the store fills with contradictions and the bot starts averaging them.
The systems that do this well — Mem0, Zep, Letta and friends — treat a new fact as a decision, not an insert: add it, update an existing one, delete a contradiction, or do nothing. The better ones are temporal: they don’t delete the old fact, they record when it stopped being true, so “what model do they run?” has a correct answer for any point in time.
And then there’s recall, where people assume memory is just RAG again. It isn’t. Document retrieval ranks by similarity. Memory recall has to weigh recency and importance too — otherwise a stale-but-similar fact beats the fresh one.
▸ similarity alone recalls the stale fact; weighting recency surfaces what is true now
Similarity-only, the year-old “K1” memory wins: it’s a great match for the question, and completely wrong. Add a little recency weight and the recent “K2” upgrade surfaces. That’s the Generative Agents recall idea in miniature — score on relevance and recency and importance, not similarity alone.
If I started over tomorrow
- Decide per fact whether it’s window-scoped or memory-scoped before writing a line of retrieval.
- Budget the context window explicitly; treat “fits” and “reliable” as two different limits.
- Clear before you summarise; summarise before you stuff.
- Make every memory write a decision (add / update / delete), never a blind append.
- Stamp facts with time so the bot can tell “true once” from “true now.”
- Namespace memory per user and test the bleed case on day one.
Closing
More context was never the answer. A bigger window just made the bloat more expensive and the forgetting no better. What helped was boring: spend the window like a budget, and build the memory the model doesn’t have. The window is for thinking. The memory is for remembering. Keeping those two jobs separate turned out to be most of the work.
Notes & sources
- Liu et al., Lost in the Middle — positional degradation in long contexts (TACL 2024).
- Chroma, Context Rot — performance vs. input length across 18 models.
- Anthropic, Effective context engineering for AI agents — attention budget, clearing vs. compaction.
- Packer et al., MemGPT — paging memory in and out of the window (now Letta).
- Mem0 and Zep / Graphiti — write-back, conflict resolution, temporal facts.
- Park et al., Generative Agents — the recency · importance · relevance recall formula.
This article was made with the help of AI, based on notes I kept throughout the project, and research I did before and during the project.