How does a model turn English into SQL?

It treats SQL generation as a grounded translation task. Given your question plus your database schema (and often example queries), the model predicts the SQL that answers it. The schema grounds the output in your real tables rather than a generic guess.

Why does schema context matter so much?

Without your schema, the model does not know your table or column names, so it cannot write a query that runs. Production systems go further and retrieve the most relevant tables and example queries before generation, because real databases have far too many tables to send all at once.

How do systems pick the right tables?

Retrieval. Pinterest builds a vector index of table summaries and past queries and retrieves the top candidates. LinkedIn uses embedding-based retrieval plus a knowledge graph and an LLM re-ranker to narrow 20 candidates to 7. Uber uses an intent step to pick a business domain first.

What is query validation in text-to-SQL?

Validation checks the generated SQL before it runs: that it parses, references real columns, and is a read operation. Many systems add a correction loop. LinkedIn reported 80% of sessions use its 'Fix with AI' feature to repair errors automatically.

How does it handle different SQL dialects?

The model generates SQL for the target database, so Postgres, BigQuery, and Snowflake each get their own dialect. A tool that knows which database it is connected to produces dialect-correct SQL instead of generic SQL that fails on syntax.

How Text-to-SQL Works: LLMs and Schema | Sequel

Type "top 10 customers by revenue this year," and a correct SQL query appears. The model did not memorize your database. It translated your question using your schema as a map, after a retrieval step found the right tables. Understanding that machinery is how you get better answers out of any text-to-SQL tool.

This is the technical companion to our text-to-SQL guide. We follow the path from sentence to query, using the architectures Uber, Pinterest, and LinkedIn published, because they agree on what actually matters.

Text-to-SQL is a grounded translation task

At its core, text-to-SQL is translation. The source language is English. The target is SQL. A large language model is good at this because it has seen enormous amounts of both.

The catch is grounding. A translation of "revenue by region" is meaningless without knowing your tables are orders and regions and how they join. So the model never works from the question alone.

Schema context is the foundation

Before generation, the tool gathers context about your database and feeds it to the model alongside your question. That context typically includes:

Table names and what they hold
Column names and their data types
Primary and foreign keys that define joins
Sometimes a few sample values per column

This is why naming matters. A column called created_at tells the model far more than dt. Clear schemas produce clear SQL. Cryptic schemas produce confident, wrong SQL.

Retrieval: finding the right tables first

Here is the step that separates a demo from production. You cannot send a model thousands of tables. You have to find the handful that matter, and that is a retrieval problem.

Pinterest's system builds a vector index of table summaries and historical queries, embeds the user's question, retrieves the top N candidates, then uses an LLM to pick the top K. It returns those to the user for confirmation before generating SQL. Pinterest also found documentation quality decisive: search hit rate climbed from 40% to 90% as table documentation weight increased.

Pinterest's engineers describe a retrieval-first design in How we built Text-to-SQL at Pinterest.

LinkedIn's SQL Bot combines embedding-based retrieval with a knowledge graph of schemas, field descriptions, access patterns, and certified queries, then runs an LLM re-ranker to cut 20 candidate tables to 7. Uber's QueryGPT starts with an Intent Agent that routes the question to a business domain, then a Column Prune Agent that trims the schema to fit the context window.

Different designs, same insight: get the right tables in front of the model and most of the accuracy follows.

Generation: from question to query

With the question and the selected schema in hand, the model generates SQL. Say you ask: "What were the top 3 plans by revenue last month?"

Given a subscriptions table with plan, amount, and created_at, the model produces:

SELECT plan, SUM(amount) AS revenue
FROM subscriptions
WHERE created_at >= date_trunc('month', current_date - interval '1 month')
  AND created_at <  date_trunc('month', current_date)
GROUP BY plan
ORDER BY revenue DESC
LIMIT 3;

Notice what the model inferred: the right table, the date math for "last month," the grouping, and the limit. Uber improves this step with few-shot prompting, feeding the model a small set of similar example queries so the output matches house join patterns and style. We show how this fits the broader analyst pipeline in how AI data analysts work.

Uber's QueryGPT pairs intent routing with few-shot example queries to shape generation, detailed in QueryGPT – Natural Language to SQL Using Generative AI.

Validation and self-correction

A generated query is a draft, and drafts have bugs. Before the query touches your database, a careful tool validates it.

Check	What it confirms
Parse	The SQL is syntactically valid
Schema match	Referenced tables and columns exist
Operation type	It is a SELECT, not a write
Dialect	Syntax matches the target database

Production systems add a correction loop on top. LinkedIn reported that 80% of SQL Bot sessions use its "Fix with AI" feature, which feeds an execution error back to the model to repair the query. Generation plus self-correction beats generation alone.

LinkedIn's SQL Bot leans on a "Fix with AI" correction loop, described in Practical text-to-SQL for data analytics.

Execution, read-only

Validation passes, so the query runs. The most important control here is the connection itself. A read-only database user can run SELECT and nothing else, so even a flawed query cannot mutate data. Set this up with our read-only PostgreSQL user guide.

Results return for that question. They do not need to be stored anywhere permanent. The model saw your schema to write the query; it does not need to keep your rows afterward.

Handling dialects

SQL is not one language. date_trunc in Postgres, Snowflake's own date functions, and BigQuery's syntax all differ. A tool that knows which database it is connected to generates the right dialect.

This matters more than it sounds. Generic SQL that ignores dialect fails on syntax. Dialect-aware generation is part of why a connected tool beats pasting your schema into a general chatbot, and it is what lets a tool join across several systems in one question.

Why context engineering is the whole game

The pattern across Uber, Pinterest, and LinkedIn is the same one practitioners repeat on Hacker News. As one founder put it in the "Text-to-SQL is dead, long live text-to-SQL" thread, the promise "is not that it can deal with any arbitrarily complex enterprise setup, but rather that you expose it with enough guidance on a controlled and sufficiently good data model."

The model is the easy part now. Retrieval, schema quality, examples, and validation are where real systems win or lose. That said, generation is still imperfect, especially on complex joins over messy schemas. The numbers are in how accurate text-to-SQL is, and reading the generated SQL remains your best safeguard.

Want to see the schema-to-query path on your own database? Get started free or explore the features.

How Text-to-SQL Works: Schema Context, Retrieval, and Validation

Text-to-SQL is a grounded translation task

Schema context is the foundation

Retrieval: finding the right tables first

Generation: from question to query

Validation and self-correction

Execution, read-only

Handling dialects

Why context engineering is the whole game

Meet your always-on data analyst.

How to Turn Hermes Into Your Marketing Analyst

How to Make Your OpenClaw a Marketing Specialist

AI Agents for Marketing Analytics