SQS Dead-Letter Queues and Visibility Timeout: What I Got Wrong
I spent an embarrassing amount of time debugging an SQS consumer that was processing messages twice and sometimes three times. The root cause turned out to be a fundamental misunderstanding of how visibility timeout, dead-letter queues, and consumer concurrency interact. Here is everything I wish I had known from the start.
SQS Delivery Guarantees: At-Least-Once, Not Exactly-Once
Before we dive into configuration, the most important thing to internalise: SQS guarantees at-least-once delivery. The same message may be delivered multiple times, even under normal operating conditions — not just during failures. This is not a bug; it is a deliberate design decision that enables SQS to provide high availability and scalability. Your consumer must be idempotent.
Visibility Timeout: The Most Misunderstood Setting
When you call ReceiveMessage, SQS makes each returned message invisible to other consumers for the visibility timeout duration. If you do not delete the message before the timeout expires, SQS assumes the consumer failed and makes the message visible again for another consumer to process.
The mistake I made was simple: my processing time was ~45 seconds but I had left the visibility timeout at the default 30 seconds. Messages were being processed twice — the first consumer was still working when the message became visible again and another consumer picked it up. The first consumer then successfully deleted the message, but the second had already started (and sometimes finished) processing it.
The rule of thumb: set visibility timeout to 6× your median processing time. For a service that takes 10 seconds to process a message, use 60 seconds. This gives you headroom for slow days and network latency.
// Configure queue with appropriate visibility timeout
_, err = sqsClient.SetQueueAttributes(ctx, &sqs.SetQueueAttributesInput{
QueueUrl: &queueURL,
Attributes: map[string]string{
"VisibilityTimeout": "120", // 2 minutes for a ~20s consumer
},
})
Extending Visibility Timeout for Long-Running Messages
For variable-length processing — where most messages take 5 seconds but some take 3 minutes — a fixed visibility timeout will either be too short (causing duplicates) or wasteful. The solution: extend the timeout dynamically while processing:
func (c *Consumer) processWithExtension(ctx context.Context, msg sqstypes.Message) error {
// Extend visibility every 20s for a 30s initial timeout
ticker := time.NewTicker(20 * time.Second)
extCtx, cancelExt := context.WithCancel(ctx)
defer func() {
cancelExt()
ticker.Stop()
}()
go func() {
for {
select {
case <-extCtx.Done():
return
case <-ticker.C:
_, err := c.sqs.ChangeMessageVisibility(extCtx, &sqs.ChangeMessageVisibilityInput{
QueueUrl: &c.queueURL,
ReceiptHandle: msg.ReceiptHandle,
VisibilityTimeout: 30,
})
if err != nil {
slog.Warn("failed to extend visibility", "error", err)
}
}
}
}()
return c.doActualWork(ctx, msg)
}
Dead-Letter Queues: Your Safety Net
A dead-letter queue (DLQ) automatically receives messages that fail processing more than maxReceiveCount times. Configure a DLQ for every production queue — without it, a consistently failing message will recycle forever, consuming processing capacity and masking the bug.
// First, create the DLQ
dlq, _ := sqsClient.CreateQueue(ctx, &sqs.CreateQueueInput{
QueueName: aws.String("my-queue-dlq"),
})
// Get the DLQ ARN
dlqAttrs, _ := sqsClient.GetQueueAttributes(ctx, &sqs.GetQueueAttributesInput{
QueueUrl: dlq.QueueUrl,
AttributeNames: []sqstypes.QueueAttributeName{"QueueArn"},
})
dlqArn := dlqAttrs.Attributes["QueueArn"]
// Configure the main queue to use the DLQ
sqsClient.SetQueueAttributes(ctx, &sqs.SetQueueAttributesInput{
QueueUrl: &mainQueueURL,
Attributes: map[string]string{
"RedrivePolicy": fmt.Sprintf(`{
"maxReceiveCount": "5",
"deadLetterTargetArn": "%s"
}`, dlqArn),
},
})
Set maxReceiveCount to 5 for most use cases. After 5 failed attempts, the message goes to the DLQ where you can inspect it, fix the bug, and either manually re-drive the messages or replay them through the main queue.
CloudWatch Alarm on DLQ Depth
A DLQ is useless if nobody notices when messages land in it. Set up a CloudWatch alarm that pages on-call when DLQ depth exceeds 1:
{
"AlarmName": "my-queue-dlq-not-empty",
"MetricName": "ApproximateNumberOfMessagesVisible",
"Namespace": "AWS/SQS",
"Dimensions": [{"Name": "QueueName", "Value": "my-queue-dlq"}],
"Statistic": "Sum",
"Period": 300,
"EvaluationPeriods": 1,
"Threshold": 0,
"ComparisonOperator": "GreaterThanThreshold",
"TreatMissingData": "notBreaching"
}
Batch Processing: The Right Concurrency Model
SQS can return up to 10 messages per receive call. There are two ways to process them: sequentially (simple but slow) or concurrently (faster but needs careful error handling). I use a bounded goroutine pool:
func (c *Consumer) processBatch(ctx context.Context, msgs []sqstypes.Message) {
var wg sync.WaitGroup
sem := make(chan struct{}, 5) // max 5 concurrent processors
for _, msg := range msgs {
msg := msg // capture
sem <- struct{}{}
wg.Add(1)
go func() {
defer func() {
<-sem
wg.Done()
}()
if err := c.process(ctx, msg); err != nil {
slog.Error("processing failed", "msgID", *msg.MessageId, "error", err)
// Do NOT delete — let it retry and eventually hit the DLQ
return
}
c.delete(ctx, msg)
}(msg)
}
wg.Wait()
}
The critical rule: delete messages individually only on success. Never use the batch delete API at the end of processing — if one message fails, you would delete the others before confirming their success.
Long Polling: Reducing Costs and Latency
Without long polling, SQS returns an empty response immediately if there are no messages. Your consumer will loop continuously, burning API calls (and money). Enable long polling by setting WaitTimeSeconds to 20:
output, err := sqsClient.ReceiveMessage(ctx, &sqs.ReceiveMessageInput{
QueueUrl: &c.queueURL,
MaxNumberOfMessages: 10,
WaitTimeSeconds: 20, // wait up to 20s for messages
MessageAttributeNames: []string{"All"},
})
Long polling also reduces false-empty responses. SQS distributes messages across multiple hosts; a short poll might hit a host with no messages even if messages are queued elsewhere. Long polling queries all hosts before returning empty.
Message Attributes for Tracing and Priority
SQS message attributes allow you to carry metadata without polluting the message body. I use them for distributed tracing context propagation:
// Sender: inject trace context
carrier := propagation.MapCarrier{}
otel.GetTextMapPropagator().Inject(ctx, carrier)
attrs := map[string]sqstypes.MessageAttributeValue{}
for k, v := range carrier {
attrs[k] = sqstypes.MessageAttributeValue{
DataType: aws.String("String"),
StringValue: aws.String(v),
}
}
sqsClient.SendMessage(ctx, &sqs.SendMessageInput{
QueueUrl: &queueURL,
MessageBody: aws.String(body),
MessageAttributes: attrs,
})
// Receiver: extract trace context
carrier = propagation.MapCarrier{}
for k, v := range msg.MessageAttributes {
if v.StringValue != nil { carrier[k] = *v.StringValue }
}
ctx = otel.GetTextMapPropagator().Extract(ctx, carrier)
What I Have Learned from Production
After running SQS-based services in production for two years, the lessons that matter:
- Always design for idempotency — duplicate processing is normal, not exceptional
- Set visibility timeout to 6× median processing time, use dynamic extension for outliers
- Every queue needs a DLQ and a CloudWatch alarm on that DLQ
- Process messages individually, delete individually
- Use long polling always
- Instrument receive count — a message approaching
maxReceiveCountshould emit a warning log so you can catch persistent failures before they hit the DLQ
SQS is one of the most reliable message queues available. Once you understand its at-least-once guarantee and tune the visibility timeout correctly, it will run for years without incident.
Comments
Post a Comment