Posts

Showing posts from August, 2025

ECS Worker Autoscaling with Queue Depth and Lag Metrics

Image
CPU-based autoscaling works well for web services. It works poorly for queue workers. A worker can be at 20% CPU and still be dangerously behind because the queue is receiving messages faster than it can process them. For SQS workers on ECS, the better scaling signal is backlog per task and message age. The Metric That Matters The metric I start with is backlog per running task. If there are 20,000 visible messages and 20 ECS tasks, each task effectively owns 1,000 messages. If the processing rate is known, that number can be translated into expected drain time. backlog_per_task = visible_messages / max(running_tasks, 1) For workloads with variable processing time, combine it with approximate age of oldest message. Queue depth tells you how much work exists. Age tells you whether users are waiting too long. Scaling Policy Shape A simple target tracking policy can work, but I prefer step scaling for important worker pools because it lets you react aggressively when lag is high an...