Data Parity (Table Diff)

Validate that two tables — or two query results — are identical across databases, or diagnose exactly how they differ. Use for migration validation, ETL regression, and query refactor verification.

altimate-code ships a dedicated data_diff tool and a data-parity skill that orchestrates the full workflow: plan, inspect schema, confirm keys, profile, then diff.

Supported warehouse pairs

Works across any combination of:

PostgreSQL
Snowflake
BigQuery
Databricks (SQL Warehouses)
ClickHouse
MySQL
Redshift
SQL Server
Microsoft Fabric
DuckDB
SQLite
Oracle

Same-dialect comparisons use a fast FULL OUTER JOIN. Cross-database comparisons use a bisection hashing algorithm that streams checksums rather than raw rows — so you can diff a 100M-row Postgres table against its Snowflake replica without pulling the data out.

Quick start

altimate

In the TUI, just describe what you want to compare:

Compare orders in postgres_prod with orders in snowflake_dw using id as the primary key.

The agent will:

List your warehouse connections.
Inspect both schemas, propose primary keys, and flag audit/timestamp columns to exclude.
Confirm your choices.
Run a column profile first (cheap — no row scan).
Run the row-level diff only on columns that diverged.

Algorithms

Algorithm	When to use	Cost
`auto`	Default. Picks JoinDiff for same-dialect, HashDiff for cross-database.	Cheapest valid choice
`joindiff`	Same-database comparison. Fast.	One FULL OUTER JOIN
`hashdiff`	Cross-database. Works at any scale.	Bisection over checksums
`profile`	Compliance-safe. Column stats only — no row values leave the database.	Cheapest
`cascade`	Profile first, then HashDiff on columns that diverged. Balanced default for exploratory diffs.	Column stats + targeted row diff

Partitioning large tables

For tables beyond ~10M rows, partition the diff into independent batches:

Compare orders between postgres and snowflake, partitioned by order_date month.

Three partition modes:

Mode	How to trigger	Example
Date	Set `partition_column` + `partition_granularity`	`l_shipdate` + `month`
Numeric	Set `partition_column` + `partition_bucket_size`	`l_orderkey` + `100000`
Categorical	Set `partition_column` alone (no granularity/bucket)	`region`, `status`, `country`

Each partition is diffed independently. Results are aggregated with a per-partition breakdown so you can see which groups have differences.

SQL Server and Microsoft Fabric

Both sqlserver and fabric are supported. For Azure AD / Entra ID authentication, altimate-code recognizes all of the major flows through tedious:

`authentication`	Config fields	Use case
`azure-active-directory-password`	`azure_client_id`, `azure_tenant_id`, `user`, `password`	User credentials
`azure-active-directory-access-token` (or `access-token`)	`access_token`	Pre-fetched token
`service-principal-secret` (`service-principal`)	`azure_tenant_id`, `azure_client_id`, `azure_client_secret`	Service principals
`azure-active-directory-msi-vm` (`msi`)	`azure_client_id` (optional)	Azure VM managed identity
`azure-active-directory-msi-app-service`	`azure_client_id` (optional)	App Service managed identity
`azure-active-directory-default` (`default` / `CLI`)	—	DefaultAzureCredential chain (CLI, env, MSI)

All Azure AD connections force TLS encryption.

Compliance and sensitive data

PII / PHI / PCI data

data_diff prints up to 5 sample diff rows in tool output. Those rows become part of the conversation and are sent to your LLM provider.

When comparing tables that might contain regulated data:

Start with algorithm: "profile" — column-level statistics only, no row values leave the database.
If a row-level diff is genuinely required, scope it with a where_clause that excludes sensitive customers / accounts.
The data-parity skill asks for confirmation before sending sample rows to the LLM when the table name matches common regulated patterns (customers, patients, orders, payments, accounts, users).

Column auto-discovery and audit exclusion

When you omit extra_columns and the source is a plain table name, altimate-code:

Queries information_schema (or the dialect-specific equivalent) on both sides.
Excludes audit/timestamp columns by name pattern (updated_at, created_at, _fivetran_synced, _airbyte_emitted_at, etc.).
Queries column defaults and excludes anything with an auto-generating timestamp default (NOW(), CURRENT_TIMESTAMP, GETDATE(), SYSDATE, SYSTIMESTAMP).
Reports excluded columns so you can override if the timestamps are part of what you're validating.

When the source is a SQL query, only the key columns are compared unless you explicitly list extra_columns. Always provide extra_columns for query-mode comparisons.

The `data_diff` tool

Direct tool invocation (if you prefer not to use the skill):

data_diff(
  source = "orders",
  target = "orders",
  source_warehouse = "postgres_prod",
  target_warehouse = "snowflake_dw",
  key_columns = ["id"],
  algorithm = "auto",
)

See the tool reference for the full parameter list.