Text-to-SQL
Overview
Section titled “Overview”Send a natural language question and AutoDB returns a validated, executed SQL query with results formatted as a markdown table. The pipeline uses pgvector similarity search to retrieve relevant schema and document context, generates SQL via Claude Sonnet on AWS Bedrock, validates it with pglast and schema reference checks, and executes it in a read-only transaction against your database.
Pipeline
Section titled “Pipeline”The text-to-sql endpoint runs a six-step pipeline:
- RAG context retrieval — pgvector cosine similarity search pulls the most relevant schema chunks and business-context document chunks for your question.
- Prompt construction — builds a structured prompt with the user question, retrieved schema chunks (with similarity scores), detailed column metadata and foreign keys, business context documents, and any previous failed attempts for self-correction.
- LLM generation — calls Claude Sonnet via AWS Bedrock with structured output, producing SQL, a confidence score, reasoning, referenced tables, and warnings.
- Syntax validation — parses the generated SQL with pglast. Syntax errors are terminal (no retry).
- Schema hallucination check — walks the SQL AST to extract table and qualified column references, then compares them against the retrieved schema context. If the LLM invented tables or columns, it gets one retry with corrective feedback appended to the prompt.
- Execution — runs the validated SELECT against your database in a rolled-back transaction with a 10-second statement timeout and a 500-row cap.
Request
Section titled “Request”curl -X POST https://api.autodb.app/api/v1/connections/550e8400-e29b-41d4-a716-446655440000/queries/generate \ -H "Content-Type: application/json" \ -H "X-API-Key: YOUR_API_KEY" \ -d '{ "query": "Show me the top 10 customers by total order value in the last 30 days", "max_schema_chunks": 5, "max_doc_chunks": 3 }'| Field | Type | Default | Description |
|---|---|---|---|
query | string | (required) | Natural language description of the query to generate. |
max_schema_chunks | integer | 5 | Maximum schema chunks to retrieve via vector similarity search (1—20). |
max_doc_chunks | integer | 3 | Maximum document chunks to include as business context (0—20). |
Response
Section titled “Response”{ "success": true, "data": { "connection_id": "550e8400-e29b-41d4-a716-446655440000", "natural_language_query": "Show me the top 10 customers by total order value in the last 30 days", "sql": "SELECT c.id, c.name, SUM(o.total) AS total_value FROM customers c JOIN orders o ON o.customer_id = c.id WHERE o.created_at >= NOW() - INTERVAL '30 days' GROUP BY c.id, c.name ORDER BY total_value DESC LIMIT 10", "confidence": 0.92, "reasoning": "Found customers and orders tables with the required columns. Used a JOIN on customer_id FK. Filtered by created_at and grouped by customer for aggregation.", "referenced_tables": ["customers", "orders"], "warnings": [], "sql_valid": true, "hallucinated_tables": [], "hallucinated_columns": [], "schema_chunks_used": 2, "doc_chunks_used": 0, "execution_error": null, "markdown_output": "| id | name | total_value |\n| --- | --- | --- |\n| 42 | Acme Corp | 15230.00 |\n| 17 | Globex Inc | 12100.50 |\n\n_2 rows returned_" }, "error": null}Response Fields
Section titled “Response Fields”| Field | Description |
|---|---|
sql | Generated PostgreSQL SELECT statement. |
confidence | LLM confidence score (0.0—1.0) based on how well the schema covers the intent. |
reasoning | Explanation of which tables and columns were selected and why. |
referenced_tables | Tables referenced in the generated SQL. |
warnings | Warnings about schema gaps, ambiguities, or failures. |
sql_valid | Whether the SQL passed pglast syntax validation and schema reference checks. |
hallucinated_tables | Table names in the SQL not found in the retrieved schema context. |
hallucinated_columns | Qualified column references (table.column) not found in the schema context. |
schema_chunks_used | Number of schema chunks included in the LLM prompt. |
doc_chunks_used | Number of document chunks included as business context. |
execution_error | Error message if SQL execution failed. Null on success. |
markdown_output | Query results as a markdown table on success, or an error summary on failure. |
Confidence Score
Section titled “Confidence Score”The confidence field reflects how completely the retrieved schema covers the user’s intent:
- 1.0 — all required tables and columns were found in the schema context
- 0.5—0.9 — most of the intent is covered, but some assumptions were made
- Below 0.5 — the schema is insufficient to fully answer the question; check
warningsfor details
Error Handling
Section titled “Error Handling”When the pipeline fails, the response includes success: false with a structured error code:
| Error Code | Meaning |
|---|---|
SCHEMA_VALIDATION_FAILED | The generated SQL references tables or columns not in the retrieved schema, even after one retry. |
SQL_GENERATION_FAILED | The LLM produced invalid or empty SQL. |
EXECUTION_FAILED | The SQL passed validation but failed when executed against the database. |
On failure, markdown_output contains a plain-English error summary suitable for displaying to end users.
Safety Guardrails
Section titled “Safety Guardrails”- Read-only execution — the query runs inside a transaction that is always rolled back. No writes can persist.
- Statement timeout — queries are capped at 10 seconds to prevent runaway execution.
- Row cap — a maximum of 500 rows are returned per query.
- Hallucination detection — a pglast AST visitor validates that every table and qualified column reference exists in the retrieved schema. CTE names are excluded from table validation to avoid false positives.
- Retry policy — only schema hallucinations get one retry with corrective feedback. Syntax errors and execution errors are terminal.
Context-Aware Generation
Section titled “Context-Aware Generation”Text-to-sql uses the same RAG pipeline as other agents. When you have uploaded business documents and run schema introspection, the pipeline retrieves the most relevant chunks to include in the LLM prompt. This helps the model understand table relationships, business terminology, and query patterns specific to your database.
The schema_chunks_used and doc_chunks_used fields tell you how much context was available for generation. For best results, ensure your connection has a recent schema snapshot and relevant business documentation uploaded.