← All posts

Designing Backfill Jobs That Do Not Take Production Down

Open office with desks and computers

Backfills are deceptively dangerous. The code is often simple: read old rows, compute a missing value, write it back. The danger is scale. A job that behaves perfectly on ten thousand rows can overload a database, fill a queue, or starve production traffic when it runs across hundreds of millions of records.

A production-safe backfill is designed like a service: observable, resumable, throttled, and boring to stop.

Make Progress Durable

Do not rely on an in-memory cursor for long backfills. Store progress in a table so the job can resume after deploys, crashes, or manual pauses.

type BackfillCheckpoint struct {
    JobName   string
    LastID    int64
    UpdatedAt time.Time
}

func (b *Backfill) Run(ctx context.Context) error {
    checkpoint, err := b.store.LoadCheckpoint(ctx, "campaign-currency")
    if err != nil {
        return err
    }
    return b.processFrom(ctx, checkpoint.LastID)
}

Chunk Everything

Large transactions are the enemy. Process small batches, commit often, and measure how long each batch takes. The batch size should be configurable because the right value in staging is rarely the right value in production.

  • Start with a small batch size.
  • Add a sleep between batches.
  • Increase slowly while watching database latency.
  • Pause automatically when production error rate increases.

Design for Idempotency

A backfill will be interrupted. Assume it will rerun the same row. The write path should be safe when applied multiple times. If the computation depends on external state, store enough information to avoid changing the result unexpectedly between runs.

UPDATE campaigns
SET budget_currency = ?
WHERE id = ?
  AND budget_currency IS NULL;

That `IS NULL` condition looks small, but it prevents the backfill from overwriting values that were fixed manually or written by new code.

Expose Controls

Every serious backfill should have a pause switch, progress metrics, and logs with job name, tenant, batch size, row count, and duration. When a backfill misbehaves, the fastest fix should be to pause it, not to SSH into a box and kill a process.

The best backfills are uneventful. They run slowly, steadily, and visibly. Nobody celebrates them, which is exactly the point.

Comments