Designing Backfill Jobs That Do Not Take Production Down
Backfills are deceptively dangerous. The code is often simple: read old rows, compute a missing value, write it back. The danger is scale. A job that behaves perfectly on ten thousand rows can overload a database, fill a queue, or starve production traffic when it runs across hundreds of millions of records. A production-safe backfill is designed like a service: observable, resumable, throttled, and boring to stop. Make Progress Durable Do not rely on an in-memory cursor for long backfills. Store progress in a table so the job can resume after deploys, crashes, or manual pauses. type BackfillCheckpoint struct { JobName string LastID int64 UpdatedAt time.Time } func (b *Backfill) Run(ctx context.Context) error { checkpoint, err := b.store.LoadCheckpoint(ctx, "campaign-currency") if err != nil { return err } return b.processFrom(ctx, checkpoint.LastID) } Chunk Everything Large transactions are the enemy. Process small batches, co...