Event-Driven Architecture on AWS: SQS, EventBridge and Idempotency

Abstract data flow and event stream visualization

Event-driven architecture sounds clean in diagrams: one service publishes an event, another service reacts, and the system becomes nicely decoupled. In production it is messier. Events arrive late, arrive twice, arrive out of order, or fail halfway through a workflow. AWS gives you strong building blocks, but the architecture still depends on how you handle those realities.

Use EventBridge for Routing, SQS for Work

The pattern I like is EventBridge for routing and SQS for durable work queues. EventBridge is good at publishing domain events and letting consumers subscribe without tight coupling. SQS is good at giving workers a queue they can drain, retry, and monitor.

{
  "source": "ads.campaigns",
  "detail-type": "CampaignBudgetChanged",
  "detail": {
    "companyId": 5000,
    "profileId": 50000100,
    "campaignId": 123456789,
    "oldBudget": 50.00,
    "newBudget": 75.00
  }
}

A budgeting service can publish that event once. A reporting sync, audit logger, and notification worker can each consume it independently.

Every Consumer Needs Idempotency

The most important rule: every consumer must assume duplicate delivery. SQS is at-least-once. EventBridge can retry. Your worker can crash after writing to the database but before deleting the message. If the handler cannot safely process the same event twice, it is not production-ready.

func (h *Handler) Handle(ctx context.Context, event CampaignBudgetChanged) error {
    key := fmt.Sprintf("budget-change:%d:%d", event.CampaignID, event.Version)

    acquired, err := h.idempotency.TryStart(ctx, key)
    if err != nil {
        return err
    }
    if !acquired {
        return nil
    }

    if err := h.applyBudgetChange(ctx, event); err != nil {
        return err
    }
    return h.idempotency.MarkDone(ctx, key)
}

Schema Evolution Matters

Events are contracts. Once multiple services consume an event, changing its shape becomes a migration. I prefer additive changes: add a field, keep old fields, update consumers gradually, then remove only when you are sure nothing depends on the old version.

Include event type and version.
Include tenant identifiers in every event.
Keep event payloads small; store large documents elsewhere and reference them.
Avoid publishing database rows as events. Publish business facts.

Monitoring the Event System

Queue depth alone is not enough. You need age of oldest message, handler error rate, retry count, dead-letter queue count, and consumer throughput. During incidents, age is often the most honest metric because it tells you how stale the product is becoming.

A good event-driven system feels boring when it works. That boringness comes from explicit contracts, idempotent consumers, and dashboards that show lag before users notice it.

Search This Blog