What is execution accuracy?

Execution accuracy is the share of generated queries that produce the same result as the correct reference query when both run against the database. It is the standard metric because it accepts any SQL that returns the right answer, even if written differently.

What is the difference between Spider and BIRD?

Spider tests clean academic schemas. BIRD uses larger, messier real-world databases (12,751 question-SQL pairs across 95 databases and 37+ domains) that need external knowledge. Scores on BIRD are lower, which makes it the better proxy for how text-to-SQL performs on your data.

Is 90% accuracy good enough for production?

Often not, for self-service. As one analysis argued, an end user with no validation layer who hits one hallucinated table or wrong filter loses trust in the whole system. The fix is guardrails: read-only access, visible SQL, clear definitions, and a human check on important queries.

How do I make text-to-SQL more accurate in practice?

Give it clear schema and column names, define business terms like active user, ask specific questions, and read the generated SQL before trusting the result. A tool that retrieves the right tables, validates queries, and shows you the SQL closes most of the gap.

How Accurate Is Text-to-SQL? Benchmarks | Sequel

Q: How accurate is text-to-SQL?

On the clean Spider benchmark, execution accuracy passed 90%. On the harder, real-world BIRD benchmark, the top system reached about 82% in late 2025 and many popular models score 50-60%. Real-world accuracy sits between the two and depends heavily on schema quality and question complexity.

The headline you have seen is "text-to-SQL is over 90% accurate." It is true on one benchmark and misleading for your database. The real answer is more useful: very accurate on clean, simple questions, noticeably weaker on complex queries over messy data, and the gap is where every production headache lives.

This piece explains what the accuracy numbers mean, why a high score can still be useless, and how to get reliable results anyway. It is the evidence behind our text-to-SQL guide.

What "accuracy" actually measures

When researchers report text-to-SQL accuracy, they almost always mean execution accuracy. That is the share of generated queries that, when run, return the same result as the correct reference query.

Execution accuracy is the right metric because SQL has many correct forms. Two queries can look different and return identical results. Comparing the output, not the text, handles those equivalent formulations. Keep one thing in mind: "90% accurate" means 9 in 10 queries returned the right answer on a test set. It does not mean 9 in 10 will on yours.

Spider vs BIRD: the gap that tells the story

Two benchmarks dominate the field, and the distance between them is the most important fact about text-to-SQL accuracy.

Benchmark	What it tests	Where scores land
Spider	Clean, academic schemas	Over 90% in recent years
BIRD	95 large, messy real-world databases, 37+ domains	Top system ~82% (late 2025); many models 50-60%

Spider execution accuracy climbed past 90%, which is where the optimistic headline comes from. BIRD, introduced to be harder and more realistic, sits lower because it uses messy data, external knowledge, and value-based questions. The top entry reached 81.67% on the test set in late 2025, still well short of the 92.96% human-expert baseline, and most models score lower still.

The BIRD-SQL leaderboard: the human-expert baseline (92.96%) still sits above the best systems (~82%).

Your production database looks like BIRD, not Spider. So treat the 50-80% range as the honest anchor for hard, multi-join analytics.

@ttunguz on the gap between benchmark math scores and real database queries; more in his Spider 2.0 write-up.

Why benchmarks overstate real-world accuracy

Even BIRD is kinder than your data, for two reasons.

First, benchmark schemas are curated. Real schemas have cryptic column names, undocumented tables, and overlapping definitions that confuse generation.

Second, the benchmarks have their own errors. Researchers re-analyzing the test sets found annotation errors in 52.8% of BIRD's test examples, meaning some reported scores are partly measuring noise in the benchmark's own gold answers. The takeaway is not "text-to-SQL is unreliable." It is "trust it more on simple questions and verify it on complex ones."

Why 90% can still be "useless"

Here is the contrarian case worth taking seriously. In a sharp Towards Data Science essay, Gary Zavaleta argues that for self-service analytics, high-but-imperfect accuracy is the worst of both worlds:

"An accuracy of 80% or even 90% is, unfortunately, not enough. ... You cannot compromise on accuracy because it immediately erodes trust. And what happens when a system loses trust? It will not be used."

His point: an end user with no validation layer who hits a single hallucinated table or misread filter stops believing the tool entirely. "Achieving 90% accuracy might be academically interesting," he writes, "but in the enterprise, it is industrially useless." The lesson is not to abandon text-to-SQL. It is to design for the 10%, with guardrails and a human check where it counts.

Gary Zavaleta's Towards Data Science essay on why high-but-imperfect accuracy erodes trust in self-service analytics.

Where accuracy drops in practice

The failure modes are predictable, which makes them manageable. Data engineers describe them vividly. In a thread on whether AI-generated reporting can be trusted, one practitioner's experience:

An r/dataengineering discussion on the real-world failure modes of AI reporting.

"Anything with actual logic is super sketch with AI... it's always doing bad things with date inclusion ranges, making assumptions. Randomly inserting 500 fields."

u/FridayPush, r/dataengineering

The usual suspects:

Multi-table joins with ambiguous relationships
Date ranges and off-by-one boundary errors
Undefined business terms like "active" or "revenue"
Cryptic schemas with names the model cannot interpret
Missing filters, like forgetting to exclude test accounts

A simple aggregation rarely fails. A five-table join with a business-specific definition often needs a second look. Even Uber, after shipping QueryGPT, said hallucinated tables and columns "remains an area that we are constantly working on."

How to get reliable results

The good news: most of the real-world gap closes with a few habits.

Give it clear schema. Readable table and column names lift accuracy more than any prompt trick. The model uses them as context, as we explain in how text-to-SQL works.
Define your terms. Tell the tool what "active user" means once, and stop it guessing.
Ask specific questions. "Weekly signups for Q1" beats "how is growth."
Read the SQL. A few seconds reading the query catches the wrong-join error. Knowing some SQL query optimization basics helps you spot trouble.
Keep it read-only. Accuracy is about correctness; safety is about access. A read-only user means a wrong query is a wrong answer, never a damaged table.

A tool that retrieves the right tables, validates queries, and shows you the SQL turns "I hope this is right" into "I can see it is right."

So, can you trust it?

For the everyday analytical questions that make up most of a team's data work, yes. For complex, high-stakes analytics over a messy schema, trust it as a fast first draft and verify the query.

That is exactly how the engineering teams ship it. They pair generation with retrieval, validation, and a human review on the hard queries, the pattern behind production systems like Uber's QueryGPT. Accuracy is not a number you wait for. It is something you design around.

Want text-to-SQL that retrieves the right tables, shows its work, and runs read-only by default? Get started free or compare the best text-to-SQL tools.

How Accurate Is Text-to-SQL? Benchmarks and Limitations

What "accuracy" actually measures

Spider vs BIRD: the gap that tells the story

Why benchmarks overstate real-world accuracy

Why 90% can still be "useless"

Where accuracy drops in practice

How to get reliable results

So, can you trust it?

Meet your always-on data analyst.

How to Turn Hermes Into Your Marketing Analyst

How to Make Your OpenClaw a Marketing Specialist

AI Agents for Marketing Analytics