project brief
Text-to-SQL Is Harder Than It Looks
Text-to-SQL demos beautifully and breaks quietly. The model nails the syntax — then trips on ambiguity, your schema, and answers that look right but aren't. A field guide to the gap between the demo and a number you'd actually trust.
Same cleaning-robot company as before. Once the support side was working, the ops team asked the obvious next thing: could they just ask the data questions in English instead of waiting on me to build a dashboard for every one? “How many robots were down last month?” “What’s our average repair time?” “Which customers cost us the most in maintenance?”
Text-to-SQL looked like a solved problem. You paste the schema, the model writes the query, you run it. In a demo it’s genuinely magic. In front of people making decisions off the number, it was a different story — and the failures were quiet ones.
Here’s the warehouse behind every example below — a small, very normal analytics schema:
robots(robot_id, model, site_id, deployed_at, status) sites(site_id, name, region, customer_id) tickets(ticket_id, robot_id, opened_at, closed_at, severity, status) maintenance(maint_id, robot_id, performed_at, cost, part_number) customers(customer_id, name, tier)
1. The question is ambiguous; the SQL is not
Natural language is underspecified. SQL is exact. Text-to-SQL is the act of forcing one into the other, and every gap gets filled by a silent assumption. “How many robots were down last month?” sounds precise. It isn’t. What counts as “down” — the current status, or anything that threw a high-severity ticket, or anything that threw any ticket? Is “last month” the previous calendar month, or the trailing thirty days?
SELECT COUNT(*) FROM robots WHERE status = 'down'; -- "last month" never made it into the query
▸ each reading is valid SQL; the model just picks one and never tells you which
Every one of those readings is correct SQL. They just answer different questions, and the spread isn’t small — the same sentence ranges from 7 to 34. The model doesn’t know which you meant, so it commits to one and writes confident SQL for it. The fix isn’t a smarter model; it’s refusing to let it guess — pin the definitions, or make the agent ask before it answers.
2. The model doesn’t know your warehouse
Even with the question nailed down, the model knows column names, not what they mean. It has never seen how your business actually defines a metric. Ask for “average repair time” and it will write something perfectly reasonable and subtly off.
SELECT AVG(closed_at - opened_at) AS avg_repair_time FROM tickets;
▸ averages every ticket — trivial ones close fast and drag the number down by ~37%
The schema-blind version averages every ticket — including the trivial “how do I restart it” ones that close in minutes — and quietly understates the real repair time. The model had no way to know that “repair” excludes those, or that open tickets shouldn’t count. That knowledge lives in a semantic layer (a metrics layer, dbt metrics, a governed view — pick your flavour): you define repair_time once, and every query inherits the definition instead of re-deriving it. This is the single highest-leverage thing you can put between an LLM and a warehouse.
3. SQL that runs and lies
This is the one that kept me up. A query can be syntactically perfect, execute without a single error, return a clean table of numbers — and be wrong. There’s no exception to catch, no red squiggle. Someone just makes a decision off a bad number.
The classic is the fan-out: join a one-to-many relationship into an aggregate and your rows get silently multiplied.
SELECT c.name, SUM(m.cost) AS total_cost FROM customers c JOIN sites s ON s.customer_id = c.customer_id JOIN robots r ON r.site_id = s.site_id JOIN maintenance m ON m.robot_id = r.robot_id JOIN tickets t ON t.robot_id = r.robot_id -- ⚠ fans out every cost row GROUP BY c.name;
▸ the query runs, returns clean numbers, and is wrong — no error to catch it
Maintenance cost gets duplicated once per ticket, every total inflates, and the result looks completely plausible — costs are positive, the biggest customer is still the biggest. Nobody notices until finance does. This is exactly why the serious benchmarks measure execution accuracy — whether the query returns the right result, not whether it’s valid SQL. On BIRD, over real, messy databases, top systems sit in the low 80s on execution accuracy while humans hit ~93%. The gap isn’t syntax. It’s being right.
If I started over tomorrow
- Treat the semantic layer as a prerequisite, not a nice-to-have — it fixes ambiguity and grounding at once.
- Make the agent ask a clarifying question instead of guessing whenever a term is underspecified.
- Measure execution accuracy against a golden set, never “does it produce valid SQL.”
- Lint generated SQL for fan-out and missing filters before anyone sees the result.
- Always surface the SQL and the assumptions next to the answer. Trust is the product.
Closing
The model writing SQL was never the hard part — that’s the part that demos well. The hard part is everything that makes a number trustworthy: a shared definition of what was asked, grounding in what the data actually means, and a way to catch the answers that are confidently wrong. Text-to-SQL isn’t a translation problem. It’s a trust problem wearing a translation problem’s clothes.
Notes & sources
- Li et al., BIRD — text-to-SQL over real, messy databases; introduces execution accuracy and the large human-vs-model gap.
- Yu et al., Spider — the cross-domain text-to-SQL benchmark that started this line of work.
This article was made with the help of AI, based on notes I kept throughout the project, and research I did before and during the project. The company and schema, and demo numbers are illustrative.