project brief

Text-to-SQL Is Harder Than It Looks

Text-to-SQL demos beautifully and breaks quietly. The model nails the syntax — then trips on ambiguity, your schema, and answers that look right but aren't. A field guide to the gap between the demo and a number you'd actually trust.

May 30, 2026Text-to-SQLAnalyticsAgents

Same cleaning-robot company as before. Once the support side was working, the ops team asked the obvious next thing: could they just ask the data questions in English instead of waiting on me to build a dashboard for every one? “How many robots were down last month?” “What’s our average repair time?” “Which customers cost us the most in maintenance?”

Text-to-SQL looked like a solved problem. You paste the schema, the model writes the query, you run it. In a demo it’s genuinely magic. In front of people making decisions off the number, it was a different story — and the failures were quiet ones.

Here’s the warehouse behind every example below — a small, very normal analytics schema:

robots(robot_id, model, site_id, deployed_at, status)
sites(site_id, name, region, customer_id)
tickets(ticket_id, robot_id, opened_at, closed_at, severity, status)
maintenance(maint_id, robot_id, performed_at, cost, part_number)
customers(customer_id, name, tier)

1. The question is ambiguous; the SQL is not

Natural language is underspecified. SQL is exact. Text-to-SQL is the act of forcing one into the other, and every gap gets filled by a silent assumption. “How many robots were down last month?” sounds precise. It isn’t. What counts as “down” — the current status, or anything that threw a high-severity ticket, or anything that threw any ticket? Is “last month” the previous calendar month, or the trailing thirty days?

query ambiguity

question: “how many robots were down last month?”

“down” =

“last month” =

SELECT COUNT(*)
FROM robots
WHERE status = 'down';   -- "last month" never made it into the query

answer7robotstime window dropped

▸ each reading is valid SQL; the model just picks one and never tells you which

Every one of those readings is correct SQL. They just answer different questions, and the spread isn’t small — the same sentence ranges from 7 to 34. The model doesn’t know which you meant, so it commits to one and writes confident SQL for it. The fix isn’t a smarter model; it’s refusing to let it guess — pin the definitions, or make the agent ask before it answers.

2. The model doesn’t know your warehouse

Even with the question nailed down, the model knows column names, not what they mean. It has never seen how your business actually defines a metric. Ask for “average repair time” and it will write something perfectly reasonable and subtly off.

schema grounding

question: “what’s our average repair time?”

SELECT AVG(closed_at - opened_at) AS avg_repair_time
FROM tickets;

schema-blind

9.2h

governed metric

14.6h

▸ averages every ticket — trivial ones close fast and drag the number down by ~37%

The schema-blind version averages every ticket — including the trivial “how do I restart it” ones that close in minutes — and quietly understates the real repair time. The model had no way to know that “repair” excludes those, or that open tickets shouldn’t count. That knowledge lives in a semantic layer (a metrics layer, dbt metrics, a governed view — pick your flavour): you define repair_time once, and every query inherits the definition instead of re-deriving it. This is the single highest-leverage thing you can put between an LLM and a warehouse.

3. SQL that runs and lies

This is the one that kept me up. A query can be syntactically perfect, execute without a single error, return a clean table of numbers — and be wrong. There’s no exception to catch, no red squiggle. Someone just makes a decision off a bad number.

The classic is the fan-out: join a one-to-many relationship into an aggregate and your rows get silently multiplied.

silent wrongness

question: “total maintenance cost per customer”

SELECT c.name, SUM(m.cost) AS total_cost
FROM customers c
JOIN sites s        ON s.customer_id = c.customer_id
JOIN robots r       ON r.site_id = s.site_id
JOIN maintenance m  ON m.robot_id = r.robot_id
JOIN tickets t      ON t.robot_id = r.robot_id   -- ⚠ fans out every cost row
GROUP BY c.name;

customerreported cost

Acme$3,200

Globex$7,200

Initech$3,000

Acme has 1 repair ($800) but 4 tickets → the join repeats that cost row 4× = $3,200.

▸ the query runs, returns clean numbers, and is wrong — no error to catch it

Maintenance cost gets duplicated once per ticket, every total inflates, and the result looks completely plausible — costs are positive, the biggest customer is still the biggest. Nobody notices until finance does. This is exactly why the serious benchmarks measure execution accuracy — whether the query returns the right result, not whether it’s valid SQL. On BIRD, over real, messy databases, top systems sit in the low 80s on execution accuracy while humans hit ~93%. The gap isn’t syntax. It’s being right.

If I started over tomorrow

Treat the semantic layer as a prerequisite, not a nice-to-have — it fixes ambiguity and grounding at once.
Make the agent ask a clarifying question instead of guessing whenever a term is underspecified.
Measure execution accuracy against a golden set, never “does it produce valid SQL.”
Lint generated SQL for fan-out and missing filters before anyone sees the result.
Always surface the SQL and the assumptions next to the answer. Trust is the product.

Closing

The model writing SQL was never the hard part — that’s the part that demos well. The hard part is everything that makes a number trustworthy: a shared definition of what was asked, grounding in what the data actually means, and a way to catch the answers that are confidently wrong. Text-to-SQL isn’t a translation problem. It’s a trust problem wearing a translation problem’s clothes.

Notes & sources

Li et al., BIRD — text-to-SQL over real, messy databases; introduces execution accuracy and the large human-vs-model gap.
Yu et al., Spider — the cross-domain text-to-SQL benchmark that started this line of work.

This article was made with the help of AI, based on notes I kept throughout the project, and research I did before and during the project. The company and schema, and demo numbers are illustrative.