From Logs to Alerts: SLOs for Go APIs on AWS
Logs are useful after something breaks. SLOs are useful before users start sending screenshots. The shift from log-based debugging to service-level objectives is one of the biggest maturity jumps a backend team can make.
For Go APIs on AWS, I like starting with a small set of SLOs that match user pain: availability, latency, and freshness. Everything else can grow from there.
Define What Good Means
A service-level indicator is the measurement. A service-level objective is the target. For an API, the indicators are usually request success rate and latency. For a data pipeline, freshness matters too.
- 99.9% of API requests should return non-5xx responses over 30 days.
- 95% of dashboard requests should complete under 500ms.
- 95% of reporting data should be less than 15 minutes stale.
Instrument at the Edge
Measure user-visible behavior at the edge of the service. Handler middleware is a good place for request count, status, and duration. Do not build an SLO from internal function timings unless users experience that boundary.
func MetricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
rw := statusRecorder{ResponseWriter: w, status: http.StatusOK}
next.ServeHTTP(&rw, r)
duration := time.Since(start)
requestDuration.WithLabelValues(r.Method, routeName(r), strconv.Itoa(rw.status)).Observe(duration.Seconds())
requestTotal.WithLabelValues(r.Method, routeName(r), strconv.Itoa(rw.status)).Inc()
})
}
Alert on Burn Rate
Alerting directly on error rate is noisy. Burn-rate alerts are better because they ask whether the service is consuming its error budget too quickly. A five-minute spike might be worth a page if it burns the budget aggressively; a small blip can go to Slack.
The practical setup is two windows: a short window for fast detection and a longer window for confidence. If both are burning too quickly, page someone.
Logs Still Matter
SLOs tell you that users are hurting. Logs and traces tell you why. The three should connect through labels: route, tenant, dependency, status class, and trace ID. During an incident, the alert should lead to the dashboard, the dashboard should lead to traces, and traces should lead to logs.
The end state is calmer operations. You stop paging on every weird metric and start paging on user-visible reliability. That is the kind of alerting people can trust.
Comments
Post a Comment