← writing

project brief

What Actually Breaks Agentic RAG in Production

I built an agentic RAG system for a cleaning-robot company's field technicians. The agent was the easy part — here's what actually broke in production, and the four fixes that mattered.

February 18, 2026RAGAgentsRetrieval

This company runs cleaning robots in the field. When one breaks, a technician needs to diagnose it and get it back online — fast. Every hour of downtime is an hour the robot isn’t earning its keep.

The knowledge to fix it technically exists. It’s just scattered across four places that don’t talk to each other:

  • Past maintenance tickets — real fixes, written in messy human language.
  • Manuals and troubleshooting PDFs — thorough, but long and painful to search.
  • Product and spec docs — part numbers, model variants, tolerances.
  • Tribal knowledge — living in one senior tech’s head, available only when they pick up.

So the goal was never “build a chatbot.” It was: make repair knowledge searchable and trustworthy enough that technicians actually reach for it instead of guessing.

The plan I started with

I used to think RAG was basically four steps: chunk the docs, embed them, retrieve the top-k, and ask the model nicely. That gets you a demo. It does not get you something a technician will trust at 2am with a dead robot in front of them.

Here are the four things that actually broke when I pushed past the demo — and what I changed.

1. One big index is a soup

My first version threw tickets, manuals, and product docs into a single index behind a single retriever. It felt elegant. It was inconsistent.

Different sources behave differently. Tickets are grounded in reality but unstructured. Manuals are structured but easy to shred with bad chunking. Product docs often need an exact match — a part number, a model variant. Blend them into one soup and you get confident answers grounded in the wrong place.

The failure mode that taught me this (more than once): a technician asks about model K2, the retriever hands back a charging section for K1, and the model writes a clean, plausible checklist for the wrong robot. In a demo you skim right past it. In the field, someone burns twenty minutes.

The fix wasn’t a smarter agent. It was source-specific retrieval tools with pre-filtering — and an agent whose main job is just to route to the right one. Try both modes:

retrieval router

technician query

one retriever → combined_indexwrong source

retrieved chunk

Charging troubleshooting — Model K1

If the robot does not charge, clean the charging contacts and reseat it on the dock. Check the K1 power-board fuse (F2)…

Wrong model. K1 ≠ K2, and it never mentions E42 — but the answer reads confident.

The agent here isn’t clever. It picks a tool and a filter. That turned out to be most of the value.

2. I kept making it “more agentic”

When the system felt weak, my instinct was to add agent steps — another tool call, another round of “thinking,” another intermediate reasoning pass. It felt like progress. It mostly added latency and new ways to fail.

Two things stack up fast. Sequential tool calls add real wall-clock time. And every extra step is another chance to hit a timeout, a rate limit, or a tool output you didn’t handle. A query that should be “retrieve one or two things and answer” turns into six tool calls — and by the time it finishes, the technician has already re-asked or walked away.

Push the steps up and watch both curves move:

agent-loop latency
tool calls
3
latency
2.6s
p(failure)
12.9%

still inside the window where someone is actually waiting for the answer

So I tightened the system instead of making it more agentic: a hard cap on tool calls, timeouts and retries, parallel retrieval where the calls are independent, and smaller context windows. Less ambitious, far more reliable.

3. You can’t prompt your way out of bad retrieval

When answers came back wrong, I reached for the prompt. Reword the instructions, add a “be careful” line, try again. It rarely helped — because most “wrong answers” were just wrong retrieval with a confident wrapper. The model was faithfully summarising the wrong chunks.

Error codes were the clearest case. E42 isn’t semantic. Embedding-similarity search happily returns sections that are about charging, or about errors in general, without ever surfacing the one chunk that actually mentions E42. Flip the toggle:

error-code retrieval
query: “E42 — won’t charge”
  1. 1Battery care & charging best practices0.82
  2. 2Dock alignment troubleshooting0.79
  3. 3Power-LED status guide0.77
  4. 4E40–E49 error family — overview0.71

“E42” isn’t semantic — top hits look related but none actually contain the code

What fixed it had nothing to do with prompting: hybrid retrieval with exact-match handling for codes and part numbers, chunking that keeps procedures intact instead of splitting a step sequence in half, and metadata-first filtering before similarity ever runs.

4. I delayed evals, and paid for it

For too long I “tested” with a handful of questions I already knew would work. RAG makes that trap easy to fall into — you end up optimising for your own demo prompts and calling it quality.

What actually moved things was boring: a small golden set of technician-style questions, run as a regression check every time I touched retrieval. Thirty questions was enough. The point wasn’t a score — it was being able to classify why something failed, because the fix is different each time. Click through a few:

failure-mode classifier

No chunk actually contains “E42.” Fix: hybrid / exact-match retrieval for error codes.

retrieval misschunkingsynthesispass

Once failures had names — retrieval miss, chunking, synthesis — I stopped guessing and started fixing the right layer.

5. Observability was the product, not a later task

I kept telling myself I’d harden the operational side later. But technicians don’t experience “the model.” They experience waiting, retries, and “is this thing stuck?” And when an answer is wrong, you need to replay the whole chain — tool calls, retrieved chunks, scores, latency, error codes — or you’re debugging by vibes.

If I started over tomorrow

  • Design the taxonomy and metadata schema first — before any embedding.
  • Build the golden set early. Even thirty questions changes how you work.
  • Treat error codes and part numbers as their own retrieval case (hybrid / exact).
  • Keep the agent on a leash: routing and constraints, not free rein.
  • Ship observability earlier than feels necessary.

Closing

I’m still genuinely excited about agentic RAG. I just have a much more boring definition of it now. Not “autonomous AI.” More like: good routing, tight retrieval, real metadata, early evals, and a system that’s willing to say “I’m not sure” instead of guessing.

The agent was never the hard part. Everything underneath it was.

This article was made with the help of AI, based on notes I kept throughout the project.