Posts

personal blog

Things I find interesting

Notes on backend architecture, distributed systems, AWS, Go, and building software that scales.

Go AWS Backend Distributed Systems Microservices AI / LLM Docker / ECS

Designing Backfill Jobs That Do Not Take Production Down

- May 27, 2026

Backfills are deceptively dangerous. The code is often simple: read old rows, compute a missing value, write it back. The danger is scale. A job that behaves perfectly on ten thousand rows can overload a database, fill a queue, or starve production traffic when it runs across hundreds of millions of records. A production-safe backfill is designed like a service: observable, resumable, throttled, and boring to stop. Make Progress Durable Do not rely on an in-memory cursor for long backfills. Store progress in a table so the job can resume after deploys, crashes, or manual pauses. type BackfillCheckpoint struct { JobName string LastID int64 UpdatedAt time.Time } func (b *Backfill) Run(ctx context.Context) error { checkpoint, err := b.store.LoadCheckpoint(ctx, "campaign-currency") if err != nil { return err } return b.processFrom(ctx, checkpoint.LastID) } Chunk Everything Large transactions are the enemy. Process small batches, co...

Amazon Ads Bulk Operations: Designing for Partial Failure

- May 20, 2026

Bulk operations are where clean API abstractions go to suffer. Updating one campaign budget is simple. Updating ten thousand bids across hundreds of advertiser profiles is a different system. Some updates succeed, some fail validation, some hit rate limits, some time out, and the product still needs to tell the user exactly what happened. The main design principle is to treat partial failure as the normal case. If the code assumes all-or-nothing success, the first real advertiser account will break the workflow. Represent Work Explicitly A bulk operation should become a durable job with child items. Each item has its own status, request payload, response payload, retry count, and error message. This makes the operation resumable and auditable. type BulkItemStatus string const ( ItemPending BulkItemStatus = "PENDING" ItemRunning BulkItemStatus = "RUNNING" ItemSucceeded BulkItemStatus = "SUCCEEDED" ItemFailed BulkItemStatus = ...

Amazon Ads API at Scale: Rate Limiting, Pagination and Bulk Operations in Go

- May 14, 2026

After three years of building and maintaining the platform — a platform that manages Amazon advertising campaigns for thousands of advertisers — I have made every mistake possible with the Amazon Ads API. This post is a practical guide to operating the API at scale: how to stay within rate limits across thousands of advertiser profiles, how to paginate correctly, and how to bulk-process operations without hammering the API into returning 429s. The Scale Problem When you have one advertiser, the Amazon Ads API is straightforward. When you have 2,000 advertisers, each with dozens of campaigns, hundreds of ad groups, and thousands of keywords, the same operations become an engineering challenge. A nightly sync that takes 3 seconds per advertiser profile takes over an hour across the fleet. Any operation that requires multiple API calls per entity — reading, computing, then writing — multiplies that cost. The constraints you need to design around: Rate limits are per profile (per ...

From Logs to Alerts: SLOs for Go APIs on AWS

- April 24, 2026

Logs are useful after something breaks. SLOs are useful before users start sending screenshots. The shift from log-based debugging to service-level objectives is one of the biggest maturity jumps a backend team can make. For Go APIs on AWS, I like starting with a small set of SLOs that match user pain: availability, latency, and freshness. Everything else can grow from there. Define What Good Means A service-level indicator is the measurement. A service-level objective is the target. For an API, the indicators are usually request success rate and latency. For a data pipeline, freshness matters too. 99.9% of API requests should return non-5xx responses over 30 days. 95% of dashboard requests should complete under 500ms. 95% of reporting data should be less than 15 minutes stale. Instrument at the Edge Measure user-visible behavior at the edge of the service. Handler middleware is a good place for request count, status, and duration. Do not build an SLO from internal function timi...

Cost-Aware LLM Routing: Reducing AI API Bills by 60%

- April 09, 2026

Three months after shipping the AI feature, our Anthropic API bill had grown faster than the revenue it was generating. The naive solution was to reduce usage. The right solution was to use the right model for each task. A cost-aware router that directs simple tasks to cheaper models and complex reasoning to powerful ones reduced our monthly AI spend by 60% while maintaining — and in some cases improving — output quality. The Insight: Not All Tasks Are Equal We were using Claude Opus for everything. Extracting a number from a JSON field does not need the same model as synthesising a 500-word campaign performance narrative. Classifying a keyword into one of five categories does not need the same model as generating a multi-step bid adjustment strategy. Using Opus for classification is like using a Ferrari to go grocery shopping. The Anthropic model family maps naturally to task complexity: Claude Haiku : fast, cheap (~50× cheaper than Opus per token), excellent for structured ex...

Trace Context Propagation Across Go Workers and AWS Queues

- March 26, 2026

Distributed tracing is straightforward for HTTP calls. A request comes in, middleware starts a span, headers propagate to the next service, and the trace forms a nice chain. Queues break that chain unless you explicitly carry trace context through the message. For systems built with Go workers, SQS, EventBridge, and background jobs, trace context propagation is the difference between seeing a complete workflow and seeing disconnected islands. Put Trace Context in Message Attributes Do not hide trace metadata inside business payloads. Use message attributes when the transport supports them. For SQS, the W3C `traceparent` header can be stored as an attribute and extracted by the consumer. func addTraceAttributes(ctx context.Context, attrs map[string]types.MessageAttributeValue) { carrier := propagation.MapCarrier{} otel.GetTextMapPropagator().Inject(ctx, carrier) for k, v := range carrier { attrs[k] = types.MessageAttributeValue{ DataType: aws.Str...

Distributed Tracing in Go with OpenTelemetry

- March 05, 2026

When a request takes 800ms instead of the expected 50ms, distributed tracing tells you exactly which service, which database call, and which line of code is responsible. Without it, debugging latency regressions in a microservices system means reading logs across five services, correlating timestamps by hand, and guessing at causality. I implemented OpenTelemetry across our Go services at the platform and it has changed how we debug production issues. Why OpenTelemetry? OpenTelemetry (OTel) is the CNCF standard for observability instrumentation. The key advantage over vendor-specific SDKs (DataDog tracer, X-Ray SDK, etc.) is portability: you write the instrumentation once and can send it to any compatible backend — Jaeger, Zipkin, Honeycomb, Datadog, Grafana Tempo — by changing an exporter configuration. We started with Jaeger and migrated to Grafana Tempo without touching application code. Setting Up the Tracer Provider func InitTracing(ctx context.Context, cfg TracingConfig) (...