Change Data Capture (CDC)

CDC is a pattern for watching a database for changes and streaming those changes to downstream systems in real-time — without polling, without batch jobs, without touching the source tables after the initial setup.

The Core Idea

Traditional migration: “give me everything and I’ll figure out what’s different.” CDC: “tell me when something changes.”

Traditional polling:
  App → "anything new?" → DB → "here's everything" → App filters for changes
  Problem: hits DB every few seconds even when nothing changed

CDC:
  DB → "something changed" → CDC → "here's what changed" → downstream
  Problem: none — you only process actual changes

How It Works

CDC tools sit between the primary database and the outside world, reading the database’s transaction log and publishing change events.

Log-Based CDC (Preferred)

Databases write every change to a transaction log (WAL in Postgres, binlog in MySQL). CDC reads this log — it never touches the actual data tables.

User executes: UPDATE users SET phone = '555-0100' WHERE id = 42

DB writes to WAL:
  operation: UPDATE
  table: users
  before: {id: 42, phone: NULL}
  after:  {id: 42, phone: '555-0100'}
  timestamp: 1709294021

CDC reads WAL → transforms → publishes event → downstream

This is the gold standard. Zero performance overhead on source tables.

Trigger-Based CDC (Fallback)

For databases without accessible transaction logs, CDC adds triggers to every table that write change records to a shadow table, then reads the shadow table.

Users table trigger:
  ON UPDATE → INSERT INTO _users_changes (id, before, after, ts) VALUES (...)

Trade-off: adds write overhead to every table, and triggers can be disabled or missed in edge cases.

The CDC Event Shape

Every change event has the same structure regardless of the database:

// INSERT into users
{
  "op": "c",                          // create
  "table": "users",
  "ts_ms": 1709294021000,
  "after": { "id": 42, "name": "Darshan", "email": "d@example.com" }
}
 
// UPDATE users SET phone = '555-0100' WHERE id = 42
{
  "op": "u",                          // update
  "table": "users",
  "ts_ms": 1709294021000,
  "before": { "id": 42, "phone": null },
  "after":  { "id": 42, "phone": "555-0100" }
}
 
// DELETE FROM users WHERE id = 42
{
  "op": "d",                          // delete
  "table": "users",
  "ts_ms": 1709294021000,
  "before": { "id": 42, "name": "Darshan", "email": "d@example.com" }
}

op values: c (create), u (update), d (delete), r (read/snapshot).

Tooling

ToolDatabaseDeliveryNotes
DebeziumPostgres, MySQL, MongoDB, SQL Server, OracleKafka, WebhookOpen source, Apache license
AWS DMS20+ sourcesS3, Kafka, Redshift, etc.Managed, no-code setup
Oracle GoldenGateOracle, SQL Server, etc.ProprietaryEnterprise, expensive
Fivetran100+ connectorsData warehouseSaaS, pricing per row
MaxwellMySQL binlogKafkaLightweight, open source

Debezium Example

DebeziumEngine engine = DebeziumEngine.create(ChangeEvent.createUsingWebSocket())
    .using(new PostgresConnectorConfig.ScrollPositionDao(source ->
        new OffsetQuery().execute(source)
    ))
    .notifier(notification -> {
        notification.forEach(change -> {
            String table   = change.getHeader("table");
            String op      = change.getHeader("operation");
            Struct payload = change.getPayload();
 
            // route to downstream system
            if ("users".equals(table)) {
                routeUserChange(op, payload);
            }
        });
    })
    .using(config -> {
        config.set("database.hostname", "postgres-primary");
        config.set("database.port", "5432");
        config.set("database.dbname", "mydb");
        config.set("plugin.name", "pgoutput");  // Postgres WAL plugin
        config.set("table.include.list", "orders,users,products");
    });

Common Architectures

Zero-Downtime Database Migration

CDC is the bridge between “we need all historical data” and “we can’t afford downtime”:

Phase 1: Bulk snapshot
  Legacy DB → DMS/Debezium → New DB
  (all existing data copied in batches)

Phase 2: CDC sync (runs during and after snapshot)
  Legacy DB → CDC → Kafka → Consumer → New DB

Phase 3: Catch-up
  After snapshot completes, replay CDC backlog
  Monitor lag until< 1 second

Phase 4: Cutover
  Reads → New DB
  Writes → New DB (Legacy DB becomes read-only backup)

Audit Trail / Event Sourcing

Any DB write → CDC → Kafka → Audit consumer
                              → Search index (Elasticsearch)
                              → Cache invalidation (Redis)
                              → Notification service
                              → Compliance log (S3)

Multi-System Synchronization

Orders DB (Postgres) → CDC → Kafka → Consumer 1 → Analytics DB (ClickHouse)
                                          → Consumer 2 → Search (Elasticsearch)
                                          → Consumer 3 → Cache (Redis)

Delivery Guarantees

CDC guarantees at-least-once delivery — every change is sent, but network issues can cause duplicates.

Downstream consumers must be idempotent:

# Idempotent upsert: re-running produces the same result
def process_user_event(event):
    if event['op'] == 'c' or event['op'] == 'u':
        upsert_user(event['after']) # INSERT ... ON CONFLICT UPDATE
    elif event['op'] == 'd':
        delete_user(event['before']['id'])
 
# Running this twice = same final state (safe for CDC retries)

Exactly-once requires distributed transactions or a dedicated coordination layer (Kafka transactions, outbox pattern), which adds significant complexity.

The Hard Parts

Schema Evolution

The database adds a column. CDC has to handle old events (pre-column) and new events (with column) simultaneously.

Before: {id, name}
After:  {id, name, phone}

CDC event from before schema change:
  after: {id: 42, name: "Darshan"} # no phone field

CDC event from after schema change:
  after: {id: 42, name: "Darshan", phone: "555-0100"}

Consumer must tolerate missing fields (optional, not required).

WAL Retention

Postgres recycles WAL segments if CDC falls behind. If your CDC connector stops for more than wal_keep_size window, you lose the log history and must re-snapshot.

wal_keep_size = 1GB     # ~30 minutes of WAL for busy DB
wal_keep_size = 10GB   # ~5 hours of WAL

Set this before deploying CDC, not after you’ve fallen behind.

Transaction Grouping

Multiple writes in a single transaction should be grouped together. Without this, you can get partial transaction events:

BEGIN;
  UPDATE accounts SET balance = balance - 100 WHERE user_id = 1;
  UPDATE accounts SET balance = balance + 100 WHERE user_id = 2;
COMMIT;

CDC must emit these two changes as one atomic unit. Debezium groups by transaction_id from the WAL.

Initial Snapshot

Large tables (billions of rows) take hours to snapshot. During this time, CDC continues capturing changes. When the snapshot finishes, those accumulated changes are replayed.

Total migration time = snapshot_duration + catch-up_duration

Estimate: snapshot at 10,000 rows/second for a 100M row table = 2.7 hours, then catch-up on top.

CDC vs. Other Patterns

PatternApproachLatencyData Scope
CDCEvent-driven (log)SecondsChanges only
PollingQuery-based (batch)Minutes to hoursFull table or incremental column
Dual-writeApplication-levelSynchronousEvery write explicitly sent
Trigger-to-tableDB triggersNear-real-timeChanges only

CDC sits between polling (slow, batch) and dual-write (fast but application-level耦合). Log-based CDC is the best of all worlds — real-time, no application code changes, no performance hit — but requires database support for transaction log access.