Let SQL Be the Judge — Raja Patnaik

I wanted a DSPy + GEPA example that doesn’t rely on human labels and instead lets a system judge model outputs. In this post I walk through a small project that translates natural‑language questions to SQLite and then executes the generated SQL to score it. GEPA uses those scores to evolve the program.

Because many people run older GEPA builds, I wired the optimizer to a scalar metric wrapper (sql_metric_scalar) that satisfies the 5‑argument signature GEPA expects. You’ll still see rich feedback in the console (format errors, SQLite errors, result diffs), but GEPA optimizes on the numeric score only.

The full script lives on GitHub.

What problem are we solving?

Task: (schema, question) → SQL for a tiny bookstore database.
Judge: SQLite itself. If the SQL is safe, runs, and returns the expected result, the score is high.
Optimization: Use GEPA to mutate the DSPy program’s instructions until the average score improves.
Compatibility: I purposefully use a scalar metric so the code runs on older GEPA versions (no feedback_metric).

How to run

pip install dspy-ai gepa
export OPENAI_API_KEY=...    # if you’re using OpenAI backends
python dspy-gepa-sql-generator/nl2sql_gepa.py

By default we use openai/gpt-4o-mini as the student and openai/gpt-4o as the reflection LM. You can override with env vars:

export DSPY_STUDENT_MODEL=...
export DSPY_REFLECT_MODEL=...

Architecture at a glance

Models & GEPA import. I configure a student LM (the program being optimized) and a reflection LM (used by GEPA internally). I also import GEPA from either the new or legacy DSPy path.

student_lm = dspy.LM(STUDENT_MODEL, temperature=0.2, max_tokens=800)
reflection_lm = dspy.LM(REFLECT_MODEL, temperature=0.8, max_tokens=2000)
dspy.configure(lm=student_lm)

try:
    from dspy import GEPA
except Exception:
    from dspy.teleprompt import GEPA

SQLite world. I create authors, books, and sales in memory and insert a handful of rows per table. This keeps everything deterministic and cheap.
Schema prompt. I turn the DB into a compact, LM‑friendly string: TABLE … (columns) plus 1–2 EXAMPLE_ROWS per table. That’s what the program sees at inference time.

def describe_schema(conn, sample_rows=2) -> str:
    # PRAGMA table_info + a couple of rows per table → one compact string

DSPy program. A Signature forces the model to output exactly one SELECT/WITH statement. The Module uses Chain‑of‑Thought internally but only surfaces the SQL string.

class NL2SQL(dspy.Signature):
    schema = dspy.InputField(...)
    question = dspy.InputField(...)
    sql = dspy.OutputField("Only ONE statement... Return only the SQL.")

class NL2SQLProgram(dspy.Module):
    def __init__(self):
        self.generate = dspy.ChainOfThought(NL2SQL)
    def forward(self, schema, question):
        return self.generate(schema=schema, question=question)

Tasks with gold SQL (headroom matters). I include straightforward questions and a few “harder” ones that stress ordering, aliases, top‑K, and result shape. For gold tasks, I precompute expected rows/columns by running the gold SQL.

def build_examples(conn, schema_text):
    # For each (question, optional_gold_sql):
    #   expected = {"columns": cols, "rows": rows, "ordered": bool("ORDER BY" in gold)}

The metric: format → execute → correctness (+ gentle penalties)

I designed a single metric that returns both a score (0..1) and feedback text for humans to read:

Format & safety (+0.4). Must be a single SELECT/WITH, no DDL/DML/PRAGMA.
Execution (+0.3). Query runs without SQLite errors.
Correctness (+0.3). If we have gold, exact rows + columns must match. Otherwise I give partial credit.

To create optimization headroom, I add small penalties (−0.05 each) if the question implies a behavior but the SQL misses it:

Question asks for ordering → require ORDER BY (and correct ASC/DESC).
“Top K” → require LIMIT K.
“Distinct” → require DISTINCT.
“Single number” → exactly one row & one column.

Two small helpers keep outputs clean:

FORBIDDEN = re.compile(r"\b(INSERT|UPDATE|DELETE|DROP|ALTER|PRAGMA|...)\b", re.I)

def _clean_sql(text: str) -> str:
    # strip code fences, keep only first statement, ensure trailing ';'

Why penalties? They turn vague natural‑language requests into explicit constraints the model must satisfy, and they keep baseline scores below the ceiling so GEPA has something to optimize.

Older‑build compatibility: `sql_metric_scalar`

My metric returns a dict ({"score": float, "feedback": str}), which older GEPA builds can’t consume directly. The fix is a 5‑argument scalar wrapper:

def sql_metric_scalar(gold, pred, trace=None, pred_name=None, pred_trace=None):
    return float(sql_metric(gold, pred, trace=trace, pred_name=pred_name, pred_trace=pred_trace)["score"])

I pass metric=sql_metric_scalar to GEPA. If your local GEPA exposes feedback_metric/feedback_producer, the script will attach the rich sql_metric automatically; otherwise it prints a warning and proceeds in scalar‑only mode.

The training loop I run

Split: first ~8 examples for train (GEPA search), the rest for dev (reporting).
Baseline: run the program over dev, print SQL, score, and feedback.
GEPA: call compile() with auto=“medium” and the scalar metric.
Post‑GEPA: re‑run over dev and print the same diagnostics.
Before/After sample: I dump one example’s SQL before and after to make changes obvious.

Pseudocode view (actual code is in the script):

program = NL2SQLProgram()

# Baseline
for ex in devset:
    pred = program(schema=ex.schema, question=ex.question)
    print(_clean_sql(pred.sql), sql_metric(ex, pred))

# GEPA (scalar metric)
gepa = GEPA(metric=sql_metric_scalar, auto="medium", reflection_lm=reflection_lm, ...)
optimized_program = gepa.compile(program, trainset=trainset, valset=devset)

# Post‑GEPA
for ex in devset:
    pred = optimized_program(schema=ex.schema, question=ex.question)
    print(_clean_sql(pred.sql), sql_metric(ex, pred))

What I typically observe

Baseline mean score around 0.90–0.95 with the stricter checks.
On older GEPA builds, you’ll see:

⚠️ GEPA build does not expose 'feedback_metric'/'feedback_producer'. Optimization will use the scalar metric only

That’s expected and harmless - the optimizer still runs against the numeric score.

Post‑GEPA movement depends on headroom, budget, and the student model. If the baseline is near ceiling, try:
Moving one or two hard tasks into the trainset and keeping others in devset.
Increasing budget to auto=“heavy”.
Using a smaller student model while keeping a stronger reflection LM.

Extending the idea

Other tools: swap SQLite for a JSON Schema validator, linter, REST API, or simulator. Same metric pattern.
Multi‑objective: combine correctness with token/latency cost (e.g., penalize long queries) to explore trade‑offs.
Richer shape checks: enforce exact column names/aliases, or require specific numeric precision.
Harder schemas: add publishers/inventory, window functions, or date math to stress generalization.

Key takeaways

Letting systems judge model outputs is a powerful alternative to static labels.
A strict but fair metric (format → execute → correctness + small penalties) gives GEPA useful signal.
On older GEPA, a 5‑arg scalar wrapper keeps everything running smoothly, while you still print rich feedback for humans.

If you want to explore feedback‑aware GEPA later, the script already includes a sql_metric_dual helper you can switch to after upgrading. For now, the scalar path keeps the demo portable and reproducible.