Distributed Tracing in Go with OpenTelemetry
When a request takes 800ms instead of the expected 50ms, distributed tracing tells you exactly which service, which database call, and which line of code is responsible. Without it, debugging latency regressions in a microservices system means reading logs across five services, correlating timestamps by hand, and guessing at causality. I implemented OpenTelemetry across our Go services at the platform and it has changed how we debug production issues.
Why OpenTelemetry?
OpenTelemetry (OTel) is the CNCF standard for observability instrumentation. The key advantage over vendor-specific SDKs (DataDog tracer, X-Ray SDK, etc.) is portability: you write the instrumentation once and can send it to any compatible backend — Jaeger, Zipkin, Honeycomb, Datadog, Grafana Tempo — by changing an exporter configuration. We started with Jaeger and migrated to Grafana Tempo without touching application code.
Setting Up the Tracer Provider
func InitTracing(ctx context.Context, cfg TracingConfig) (*sdktrace.TracerProvider, error) {
// OTLP HTTP exporter — works with Jaeger, Grafana, most backends
exporter, err := otlptracehttp.New(ctx,
otlptracehttp.WithEndpoint(cfg.CollectorEndpoint), // "jaeger:4318"
otlptracehttp.WithInsecure(),
)
if err != nil {
return nil, fmt.Errorf("create exporter: %w", err)
}
// Resource identifies this service in traces
res, err := resource.Merge(
resource.Default(),
resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String(cfg.ServiceName),
semconv.ServiceVersionKey.String(cfg.Version),
semconv.DeploymentEnvironmentKey.String(cfg.Environment),
),
)
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
// Sample 10% in production, 100% in development
sdktrace.WithSampler(sdktrace.ParentBased(
sdktrace.TraceIDRatioBased(cfg.SampleRate),
)),
)
otel.SetTracerProvider(tp)
// W3C TraceContext propagation — works across any OTel-instrumented service
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
return tp, nil
}
// In main():
tp, err := InitTracing(ctx, TracingConfig{
ServiceName: "campaign-service",
CollectorEndpoint: "otel-collector:4318",
SampleRate: 0.10, // 10% in production
Environment: os.Getenv("ENV"),
})
if err != nil { log.Fatal(err) }
defer tp.Shutdown(context.Background())
HTTP Server Instrumentation
The otelhttp package wraps your HTTP handler and automatically creates spans for each request, extracts incoming trace context from headers, and records HTTP status codes:
import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
func NewServer(handler http.Handler) *http.Server {
// Wrap the handler — all requests get automatic spans
traced := otelhttp.NewHandler(handler, "campaign-service",
otelhttp.WithFilter(func(r *http.Request) bool {
// Don't trace health checks — they are noise
return r.URL.Path != "/health" && r.URL.Path != "/metrics"
}),
)
return &http.Server{
Handler: traced,
Addr: ":8080",
}
}
// HTTP client instrumentation — propagates trace context to downstream services
func NewHTTPClient() *http.Client {
return &http.Client{
Transport: otelhttp.NewTransport(http.DefaultTransport),
}
}
Database Instrumentation
Database calls are usually the biggest source of latency. Instrument them with the otelsql driver wrapper:
import "github.com/XSAM/otelsql"
// Wrap the driver
driverName, err := otelsql.Register("postgres",
otelsql.WithAttributes(semconv.DBSystemPostgreSQL),
otelsql.WithTracerProvider(otel.GetTracerProvider()),
)
db, err := sql.Open(driverName, dsn)
// All queries through this db now create spans automatically
For DynamoDB and other AWS services, instrument the AWS SDK v2 with the OTel middleware:
import "github.com/aws/aws-sdk-go-v2/config"
import "go.opentelemetry.io/contrib/instrumentation/github.com/aws/aws-sdk-go-v2/otelaws"
cfg, err := config.LoadDefaultConfig(ctx)
otelaws.AppendMiddlewares(&cfg.APIOptions) // adds OTel spans to all AWS SDK calls
dynamoClient := dynamodb.NewFromConfig(cfg) // all DynamoDB calls are traced
Manual Spans for Business Logic
Automatic instrumentation covers infrastructure calls. For important business operations, add manual spans with relevant attributes:
func (s *BidService) UpdateBids(ctx context.Context, campaignID string, bids []BidUpdate) error {
ctx, span := otel.Tracer("bid-service").Start(ctx, "BidService.UpdateBids",
trace.WithAttributes(
attribute.String("campaign.id", campaignID),
attribute.Int("bids.count", len(bids)),
),
)
defer span.End()
// All calls within this function will be children of this span
valid, err := s.validateBids(ctx, bids)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "bid validation failed")
return err
}
span.SetAttributes(attribute.Int("bids.valid", len(valid)))
if err := s.adsClient.UpdateKeywordBids(ctx, campaignID, valid); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "amazon ads api error")
return err
}
span.SetStatus(codes.Ok, "")
return nil
}
Context Propagation Across Service Boundaries
OTel uses HTTP headers to propagate trace context between services. With otelhttp on both client and server, this happens automatically. For message queues, inject and extract manually:
// SQS: inject trace context into message attributes before sending
func sendWithTrace(ctx context.Context, client *sqs.Client, queueURL, body string) error {
carrier := propagation.MapCarrier{}
otel.GetTextMapPropagator().Inject(ctx, carrier)
attrs := map[string]sqstypes.MessageAttributeValue{}
for k, v := range carrier {
attrs[k] = sqstypes.MessageAttributeValue{
DataType: aws.String("String"),
StringValue: aws.String(v),
}
}
_, err := client.SendMessage(ctx, &sqs.SendMessageInput{
QueueUrl: &queueURL,
MessageBody: &body,
MessageAttributes: attrs,
})
return err
}
// SQS: extract trace context when receiving
func processWithTrace(ctx context.Context, msg sqstypes.Message) {
carrier := propagation.MapCarrier{}
for k, v := range msg.MessageAttributes {
if v.StringValue != nil {
carrier[k] = *v.StringValue
}
}
ctx = otel.GetTextMapPropagator().Extract(ctx, carrier)
// Traces from this point are connected to the original publisher's trace
ctx, span := otel.Tracer("consumer").Start(ctx, "processMessage")
defer span.End()
// ... process message
}
Sampling Strategy
Do not trace 100% of requests in production. At our traffic levels, 100% sampling would generate gigabytes of trace data per day and significantly impact latency. Our strategy:
- 10% of normal traffic: enough to catch latency regressions and get statistical signal
- 100% of errors: every error trace is sampled — the
ParentBasedsampler preserves sampling decisions from upstream services, so an error in one service propagates the "sample this" decision through the entire trace - 100% of slow requests: add a custom sampler that forces sampling when latency exceeds a threshold
type ThresholdSampler struct {
base sdktrace.Sampler
threshold time.Duration
}
func (s ThresholdSampler) ShouldSample(p sdktrace.SamplingParameters) sdktrace.SamplingResult {
// Check if this is a continuation of an already-sampled trace
if p.ParentContext.HasRemoteParent && p.ParentContext.IsRemote() {
return sdktrace.AlwaysSample().ShouldSample(p)
}
return s.base.ShouldSample(p)
}
What We Learned
After running OTel in production for 6 months, the things that provided the most value:
- Database query spans were the biggest win — immediately surfaced N+1 queries we did not know existed
- Span attributes on business operations (campaign ID, advertiser profile) made it easy to find traces for specific advertisers during incident investigations
- SQS message propagation allowed tracing an event from the Amazon SNS notification through our entire processing pipeline to the final database write
- Sampling at 10% was more than enough signal for latency analysis; 100% sampling for errors meant we never missed an incident trace
The upfront investment to instrument the codebase was about two days. We have recovered that cost many times over in reduced mean time to diagnosis during production incidents.
Comments
Post a Comment