Trace Context Propagation Across Go Workers and AWS Queues
Distributed tracing is straightforward for HTTP calls. A request comes in, middleware starts a span, headers propagate to the next service, and the trace forms a nice chain. Queues break that chain unless you explicitly carry trace context through the message.
For systems built with Go workers, SQS, EventBridge, and background jobs, trace context propagation is the difference between seeing a complete workflow and seeing disconnected islands.
Put Trace Context in Message Attributes
Do not hide trace metadata inside business payloads. Use message attributes when the transport supports them. For SQS, the W3C `traceparent` header can be stored as an attribute and extracted by the consumer.
func addTraceAttributes(ctx context.Context, attrs map[string]types.MessageAttributeValue) {
carrier := propagation.MapCarrier{}
otel.GetTextMapPropagator().Inject(ctx, carrier)
for k, v := range carrier {
attrs[k] = types.MessageAttributeValue{
DataType: aws.String("String"),
StringValue: aws.String(v),
}
}
}
Extract Before Starting Work
On the worker side, extract context before creating the processing span. That makes the worker span a child of the original request span instead of a new root trace.
func contextFromMessage(ctx context.Context, msg types.Message) context.Context {
carrier := propagation.MapCarrier{}
for k, v := range msg.MessageAttributes {
if v.StringValue != nil {
carrier[k] = *v.StringValue
}
}
return otel.GetTextMapPropagator().Extract(ctx, carrier)
}
Name Spans Around Business Work
Queue spans should be named for the business operation, not the infrastructure. `sqs.receive` is less useful than `campaign.sync_report`. Infrastructure details belong as attributes.
- messaging.system = aws.sqs
- messaging.destination = report-sync-queue
- tenant.company_id = 5000
- ads.profile_id = 50000100
- job.type = campaign-report-sync
Handle Retries Clearly
Retries should be visible in traces. Include receive count, attempt number, and idempotency key. When a message lands in the dead-letter queue, the trace should make it obvious which dependency failed and how many times.
Once trace context flows through queues, production debugging changes. Instead of starting from logs and reconstructing causality, you can open one trace and see the HTTP request, event publication, queue delay, worker processing, database calls, and downstream API calls in order.
Comments
Post a Comment