SQS Dead-Letter Queues and Visibility Timeout: What I Got Wrong

I spent an embarrassing amount of time debugging an SQS consumer that was processing messages twice and sometimes three times. The root cause turned out to be a fundamental misunderstanding of how visibility timeout, dead-letter queues, and consumer concurrency interact. Here is everything I wish I had known from the start.

SQS Delivery Guarantees: At-Least-Once, Not Exactly-Once

Before we dive into configuration, the most important thing to internalise: SQS guarantees at-least-once delivery. The same message may be delivered multiple times, even under normal operating conditions — not just during failures. This is not a bug; it is a deliberate design decision that enables SQS to provide high availability and scalability. Your consumer must be idempotent.

Visibility Timeout: The Most Misunderstood Setting

When you call ReceiveMessage, SQS makes each returned message invisible to other consumers for the visibility timeout duration. If you do not delete the message before the timeout expires, SQS assumes the consumer failed and makes the message visible again for another consumer to process.

The mistake I made was simple: my processing time was ~45 seconds but I had left the visibility timeout at the default 30 seconds. Messages were being processed twice — the first consumer was still working when the message became visible again and another consumer picked it up. The first consumer then successfully deleted the message, but the second had already started (and sometimes finished) processing it.

The rule of thumb: set visibility timeout to 6× your median processing time. For a service that takes 10 seconds to process a message, use 60 seconds. This gives you headroom for slow days and network latency.

// Configure queue with appropriate visibility timeout
_, err = sqsClient.SetQueueAttributes(ctx, &sqs.SetQueueAttributesInput{
    QueueUrl: &queueURL,
    Attributes: map[string]string{
        "VisibilityTimeout": "120", // 2 minutes for a ~20s consumer
    },
})

Extending Visibility Timeout for Long-Running Messages

For variable-length processing — where most messages take 5 seconds but some take 3 minutes — a fixed visibility timeout will either be too short (causing duplicates) or wasteful. The solution: extend the timeout dynamically while processing:

func (c *Consumer) processWithExtension(ctx context.Context, msg sqstypes.Message) error {
    // Extend visibility every 20s for a 30s initial timeout
    ticker := time.NewTicker(20 * time.Second)
    extCtx, cancelExt := context.WithCancel(ctx)
    defer func() {
        cancelExt()
        ticker.Stop()
    }()

    go func() {
        for {
            select {
            case <-extCtx.Done():
                return
            case <-ticker.C:
                _, err := c.sqs.ChangeMessageVisibility(extCtx, &sqs.ChangeMessageVisibilityInput{
                    QueueUrl:          &c.queueURL,
                    ReceiptHandle:     msg.ReceiptHandle,
                    VisibilityTimeout: 30,
                })
                if err != nil {
                    slog.Warn("failed to extend visibility", "error", err)
                }
            }
        }
    }()

    return c.doActualWork(ctx, msg)
}

Dead-Letter Queues: Your Safety Net

A dead-letter queue (DLQ) automatically receives messages that fail processing more than maxReceiveCount times. Configure a DLQ for every production queue — without it, a consistently failing message will recycle forever, consuming processing capacity and masking the bug.

// First, create the DLQ
dlq, _ := sqsClient.CreateQueue(ctx, &sqs.CreateQueueInput{
    QueueName: aws.String("my-queue-dlq"),
})

// Get the DLQ ARN
dlqAttrs, _ := sqsClient.GetQueueAttributes(ctx, &sqs.GetQueueAttributesInput{
    QueueUrl:       dlq.QueueUrl,
    AttributeNames: []sqstypes.QueueAttributeName{"QueueArn"},
})
dlqArn := dlqAttrs.Attributes["QueueArn"]

// Configure the main queue to use the DLQ
sqsClient.SetQueueAttributes(ctx, &sqs.SetQueueAttributesInput{
    QueueUrl: &mainQueueURL,
    Attributes: map[string]string{
        "RedrivePolicy": fmt.Sprintf(`{
            "maxReceiveCount": "5",
            "deadLetterTargetArn": "%s"
        }`, dlqArn),
    },
})

Set maxReceiveCount to 5 for most use cases. After 5 failed attempts, the message goes to the DLQ where you can inspect it, fix the bug, and either manually re-drive the messages or replay them through the main queue.

CloudWatch Alarm on DLQ Depth

A DLQ is useless if nobody notices when messages land in it. Set up a CloudWatch alarm that pages on-call when DLQ depth exceeds 1:

{
  "AlarmName": "my-queue-dlq-not-empty",
  "MetricName": "ApproximateNumberOfMessagesVisible",
  "Namespace": "AWS/SQS",
  "Dimensions": [{"Name": "QueueName", "Value": "my-queue-dlq"}],
  "Statistic": "Sum",
  "Period": 300,
  "EvaluationPeriods": 1,
  "Threshold": 0,
  "ComparisonOperator": "GreaterThanThreshold",
  "TreatMissingData": "notBreaching"
}

Batch Processing: The Right Concurrency Model

SQS can return up to 10 messages per receive call. There are two ways to process them: sequentially (simple but slow) or concurrently (faster but needs careful error handling). I use a bounded goroutine pool:

func (c *Consumer) processBatch(ctx context.Context, msgs []sqstypes.Message) {
    var wg sync.WaitGroup
    sem := make(chan struct{}, 5) // max 5 concurrent processors

    for _, msg := range msgs {
        msg := msg // capture
        sem <- struct{}{}
        wg.Add(1)
        go func() {
            defer func() {
                <-sem
                wg.Done()
            }()
            if err := c.process(ctx, msg); err != nil {
                slog.Error("processing failed", "msgID", *msg.MessageId, "error", err)
                // Do NOT delete — let it retry and eventually hit the DLQ
                return
            }
            c.delete(ctx, msg)
        }(msg)
    }
    wg.Wait()
}

The critical rule: delete messages individually only on success. Never use the batch delete API at the end of processing — if one message fails, you would delete the others before confirming their success.

Long Polling: Reducing Costs and Latency

Without long polling, SQS returns an empty response immediately if there are no messages. Your consumer will loop continuously, burning API calls (and money). Enable long polling by setting WaitTimeSeconds to 20:

output, err := sqsClient.ReceiveMessage(ctx, &sqs.ReceiveMessageInput{
    QueueUrl:            &c.queueURL,
    MaxNumberOfMessages: 10,
    WaitTimeSeconds:     20, // wait up to 20s for messages
    MessageAttributeNames: []string{"All"},
})

Long polling also reduces false-empty responses. SQS distributes messages across multiple hosts; a short poll might hit a host with no messages even if messages are queued elsewhere. Long polling queries all hosts before returning empty.

Message Attributes for Tracing and Priority

SQS message attributes allow you to carry metadata without polluting the message body. I use them for distributed tracing context propagation:

// Sender: inject trace context
carrier := propagation.MapCarrier{}
otel.GetTextMapPropagator().Inject(ctx, carrier)

attrs := map[string]sqstypes.MessageAttributeValue{}
for k, v := range carrier {
    attrs[k] = sqstypes.MessageAttributeValue{
        DataType:    aws.String("String"),
        StringValue: aws.String(v),
    }
}

sqsClient.SendMessage(ctx, &sqs.SendMessageInput{
    QueueUrl:          &queueURL,
    MessageBody:       aws.String(body),
    MessageAttributes: attrs,
})

// Receiver: extract trace context
carrier = propagation.MapCarrier{}
for k, v := range msg.MessageAttributes {
    if v.StringValue != nil { carrier[k] = *v.StringValue }
}
ctx = otel.GetTextMapPropagator().Extract(ctx, carrier)

What I Have Learned from Production

After running SQS-based services in production for two years, the lessons that matter:

Always design for idempotency — duplicate processing is normal, not exceptional
Set visibility timeout to 6× median processing time, use dynamic extension for outliers
Every queue needs a DLQ and a CloudWatch alarm on that DLQ
Process messages individually, delete individually
Use long polling always
Instrument receive count — a message approaching maxReceiveCount should emit a warning log so you can catch persistent failures before they hit the DLQ

SQS is one of the most reliable message queues available. Once you understand its at-least-once guarantee and tune the visibility timeout correctly, it will run for years without incident.

Search This Blog