Skip to content

dbt PR Review — real-world issue corpus

A living, sourced corpus of real dbt/SQL pitfalls practitioners report (dbt Discourse/Slack, Reddit, StackOverflow, engineering blogs), each turned into a reproducible test. Every entry is a problem someone actually hit — not a synthetic example — and the reviewer is held to catching it with zero false positives on the correct version.

Reproduce: bun run --conditions=browser script/review-realworld-eval.ts (detection + precision) and script/review-dialect-coverage.ts (10-dialect coverage). Verified end-to-end on a real third-party repo (jaffle_shop, DuckDB).

Latest measured results

  • Caught: 15/15 sourced pitfalls
  • False positives on the correct version: 0/5
  • Dialect coverage (10 dialects): function precision 100%, recall 95.7%, structural/type 100%
  • Real repo (jaffle_shop PR): every finding tied to the change; no noise about pre-existing code (diff-scoped)

The corpus

# Pitfall Where practitioners report it What the reviewer emits
1 LEFT JOIN silently becomes INNER — a filter on the right table in WHERE drops the unmatched rows sqlbenjamin.wordpress.com "LEFT JOINs and WHERE clauses", SQLShack, Toad forum join_risk (critical) — move the predicate to the ON clause
2 Fan-out — one-to-many join inflates SUM/COUNT dbt join-logic docs, Holistics fan-out docs fanout / join_risk — aggregate before joining, or count(distinct …)
3 NOT IN (subquery) with NULLs returns no rows classic SQL gotcha (StackOverflow) sql_correctness — use NOT EXISTS
4 Incremental model with no is_incremental() guard → full reprocess / dupes dbt incremental-models docs materialization
5 Dedup row_number() with no ORDER BY → which row survives flaps between runs dbt "remove partial duplicates" blog dedup (warning)
6 Clock (current_timestamp/getdate) baked into a transform → non-idempotent dbt Slack / idempotency guidance idempotency
7 SELECT * in a mart → breaks downstream on upstream schema change; scan cost dbt style guides warehouse_cost / sql_quality
8 = NULL instead of IS NULL → always false classic SQL gotcha sql_correctness
9 Division with no zero-guard → divide-by-zero failures dbt Slack (safe_divide/nullif) sql_correctness (core L032)
10 Non-portable function for the project's dialect (e.g. NVL on BigQuery) cross-warehouse migration pain (dbt Discourse, SQLGlot) sql_quality (core L033, dialect-aware)
11 Comma / implicit cross join SQL joins tutorials, cartesian-product warnings join_risk
12 Unguarded COUNT(DISTINCT) at scale → cost BigQuery/Snowflake cost threads (approx_count_distinct) warehouse_cost
13 BETWEEN on a timestamp drops the last day's afternoon (inclusive upper bound → 00:00:00) SO "Exclude rows with certain time of day" sql_correctness — use half-open >= / <
14 String \|\| concat NULL-propagation — any NULL operand → whole result NULL Baeldung "Concatenate with NULL Values in SQL" sql_correctness — use concat_ws/coalesce
15 Hand-rolled surrogate key over raw concat — NULL field nulls the key / NULL-vs-'' collisions dbt-utils #488, dbt Discourse surrogate-key sql_correctness / dedup

Plus the incremental unique_key with NULL components → duplicate rows class (dbt Discourse #17298, dbt-core #7597) is covered by the incremental + surrogate-key detectors.

Precision — the correct versions stay silent

The corpus pairs each pitfall with its fix and asserts no finding: a LEFT JOIN anti-join (WHERE right.key IS NULL), nullif-guarded division, NOT EXISTS, row_number() with an ORDER BY, and a dialect's native function (e.g. NVL on Snowflake). Several real false positives were found and removed via this corpus (blanket SAFE_CAST nag, ARRAY_AGG-without-WITHIN GROUP on the wrong dialect, ORDER BY DESC without NULLS LAST, and OLTP/index rules like correlated-subquery / function-on-filter that don't apply to columnar warehouses).

The bar: a first-time user opening a PR should see findings that map exactly to what they changed, are correct for their warehouse's dialect, and never nag about code they didn't touch.