← writing

project brief

The Window Is Not the Memory

Retrieval was fixed — then the chatbot got dumber the longer a technician talked to it, and forgot everything by morning. The two things people conflate: the context window and memory.

April 7, 2026ContextMemoryAgentic RAG

A while back I wrote about what actually broke when I shipped an agentic RAG system for a cleaning-robot company — the retrieval, the routing, the evals. The punchline was that the agent was the easy part.

This is the sequel. It’s about the next two failures, the ones that only showed up once retrieval was good and real technicians were having real, multi-turn conversations with the thing:

  • The bot got noticeably worse the longer a single conversation ran. Sharp at turn three, confidently wrong by turn forty.
  • It remembered nothing. A tech could solve a problem with it on Monday and start from a blank slate on Tuesday.

For a while I treated these as one problem — “the model needs more context” — and kept making the window bigger. That fixed neither. The thing I’d missed is embarrassing in hindsight: the context window is not memory.

The window is the model’s working set for a single turn: volatile, bounded, rebuilt from scratch every time. Memory is the system you build around the model to decide what gets loaded back into that window. Two different problems. Two different fixes.

Part one — the window is a budget, not a bucket

My instinct was a bigger bucket: more context, more knowledge. But a context window isn’t a bucket you fill — it’s a budget you spend. Every token competes for the model’s attention, and that attention doesn’t stretch evenly as the window grows.

The cleanest name for the symptom is “lost in the middle”: a fact the model uses perfectly when it sits at the top or bottom of the context gets quietly ignored when it’s buried in between. Not missing — present, and ignored. Drag the fact around and watch:

lost in the middle
fact position
8 / 15
answer accuracy
50%
vs. no docs
-6pts

buried in the middle, the model does worse than if you had given it no documents at all

The dip isn’t subtle. In the original study it fell below the score the model gets with no documents at all — you’d have been better off not retrieving. And a bigger window doesn’t save you: a fact stranded deep in a long context is still in the dead zone.

That’s the static version. In an agentic loop it gets worse, because the window doesn’t hold one carefully-placed fact — it holds everything, and it grows every turn. System prompt and tool schemas. The running conversation. This turn’s retrieved chunks. And the big one: tool outputs, which in a RAG agent are often verbose dumps that pile up on every iteration. They all compete for the same fixed budget.

context budget
context used
131k/ 200k
input cost / turn
$0.39
prefill latency
3.9s
0↑ reliable budget ~50.0k200k cap
system + tools 3.0khistory 20.0kretrieved chunks 4.0ktool outputs 96.0kreply reserve 8.0k

fits the window, but it is past the point where quality starts to rot

Two things to notice:

  • “Fits” and “reliable” are different lines. You blow past the budget where quality holds up long before you hit the advertised cap.
  • Most of the bloat is re-fetchable junk — old tool outputs you could drop and re-retrieve later if you ever needed them.

So the fix is triage, not a bigger window. Three moves, cheapest first:

  • Clear before you summarise. Re-fetchable tool results can be dropped outright and replaced with a stub. No model call, lossless if you can fetch them again.
  • Compact only when you must. Genuinely-needed dialogue can be summarised — but it costs a model call and it’s lossy. I’ve been burned by exactly this: an exact part number compressed into “the part,” surfacing as a wrong answer ten turns on.
  • Retrieve just-in-time. Keep a lightweight pointer (a doc id, a query) in context and pull the full payload only when the step actually needs it.

Part two — the window forgets; memory is what you build around it

Everything above lives inside one conversation. The second failure was across conversations, and it’s a different beast. When a session ends, the window is gone and the model keeps no state. If you want the bot to know on Tuesday what it learned on Monday, you have to build that yourself. Memory isn’t a model feature — it’s plumbing you write around the model.

It helps to split it the way the literature does. Short-term memory is the live thread — this conversation’s messages. Long-term memory is everything that has to outlive the thread, stored outside the window and pulled back on demand. Long-term splits again by what it holds: facts about the customer (semantic), what happened in past sessions (episodic), and rules for how to behave (procedural).

The naïve version of short-term memory is a sliding window: keep the last k messages, drop the rest. It’s fine — until an early message held a constraint you needed to keep.

window eviction
techHey, got a robot down on site 12.
techIt's a model K2 — the newer chassis.evicted
botGot it. What's it doing?
techWon't dock after a firmware update.
botLet's check the dock sensors first.
techSensors look clean, LEDs normal.
botTry a manual re-pair of the dock.
techDid that. Still no docking.
botOkay — let's look at the charging contacts.
techContacts are fine. What's the charging-reset procedure?
bot answers
Here's the K1 charging-reset procedure… ✗ wrong model

the one constraint that mattered got evicted, and the bot answered for the wrong robot

That’s the dropped-constraint bug in robot form. The model statement (“it’s a K2”) was said once, early, and a raw window evicted it. Summarising helps — but notice it’s lossy: the summary kept “robot down, docking issue” and dropped the exact model, so the best case is the bot asking again, not getting it right. The real fix is to promote durable facts out of the window and into long-term memory the moment they appear.

Which sounds easy and isn’t — because the hard part of long-term memory isn’t reading, it’s writing. Every fact you store is a fact you’ll later have to update, reconcile, or retire. “Customer runs K1” was true for a year and is now actively wrong. If you only ever append, the store fills with contradictions and the bot starts averaging them.

The systems that do this well — Mem0, Zep, Letta and friends — treat a new fact as a decision, not an insert: add it, update an existing one, delete a contradiction, or do nothing. The better ones are temporal: they don’t delete the old fact, they record when it stopped being true, so “what model do they run?” has a correct answer for any point in time.

And then there’s recall, where people assume memory is just RAG again. It isn’t. Document retrieval ranks by similarity. Memory recall has to weigh recency and importance too — otherwise a stale-but-similar fact beats the fresh one.

memory recall
query: which robot model does this customer run?
Customer runs model K1 across all sites.8mo ago→ recalled
rel rec imp score 0.92
Customer upgraded their fleet to model K2.6d ago
rel rec imp score 0.86
Site 12 has three charging docks.4w ago
rel rec imp score 0.55
Customer SLA is next-business-day.4mo ago
rel rec imp score 0.40
Primary contact prefers email over phone.9d ago
rel rec imp score 0.20
bot answers
Tells the tech it's a K1 ✗ — stale; they upgraded 6 days ago

▸ similarity alone recalls the stale fact; weighting recency surfaces what is true now

Similarity-only, the year-old “K1” memory wins: it’s a great match for the question, and completely wrong. Add a little recency weight and the recent “K2” upgrade surfaces. That’s the Generative Agents recall idea in miniature — score on relevance and recency and importance, not similarity alone.

If I started over tomorrow

  • Decide per fact whether it’s window-scoped or memory-scoped before writing a line of retrieval.
  • Budget the context window explicitly; treat “fits” and “reliable” as two different limits.
  • Clear before you summarise; summarise before you stuff.
  • Make every memory write a decision (add / update / delete), never a blind append.
  • Stamp facts with time so the bot can tell “true once” from “true now.”
  • Namespace memory per user and test the bleed case on day one.

Closing

More context was never the answer. A bigger window just made the bloat more expensive and the forgetting no better. What helped was boring: spend the window like a budget, and build the memory the model doesn’t have. The window is for thinking. The memory is for remembering. Keeping those two jobs separate turned out to be most of the work.

Notes & sources

This article was made with the help of AI, based on notes I kept throughout the project, and research I did before and during the project.