← All posts

Graceful Shutdown in Go: The Right Way

Graceful shutdown is one of those things every backend service needs but few teams implement correctly — until something breaks in production. I learned this the hard way when a deployment left in-flight database writes half-committed, causing data inconsistencies that took a full day to resolve. Here is the complete pattern I have settled on for Go services, including the edge cases most tutorials skip.

What Happens Without Graceful Shutdown

When a container orchestrator (ECS, Kubernetes) wants to stop a task, it sends SIGTERM to the process. By default, most Go servers ignore it — the process keeps accepting new connections, existing requests finish eventually, and the orchestrator gives up and sends SIGKILL, killing everything immediately. Any in-flight HTTP requests, open database transactions, or messages being processed from a queue are abandoned mid-operation. On a busy service this is guaranteed to cause problems.

The real damage depends on what your service does:

  • HTTP APIs: clients receive connection reset errors instead of responses
  • Database writes: transactions are rolled back (actually fine for most cases, but jarring)
  • SQS consumers: messages become visible again after the visibility timeout — they will be processed twice
  • External API calls: the call may have gone out but you never received or processed the response

The Basic Signal Handler

Go makes signal handling straightforward with the os/signal package. The pattern is to start the server in a goroutine, block the main goroutine on a signal channel, and then call Shutdown when the signal arrives:

package main

import (
    "context"
    "log/slog"
    "net/http"
    "os"
    "os/signal"
    "syscall"
    "time"
)

func main() {
    srv := &http.Server{
        Addr:         ":8080",
        Handler:      buildRouter(),
        ReadTimeout:  10 * time.Second,
        WriteTimeout: 30 * time.Second,
        IdleTimeout:  120 * time.Second,
    }

    // Start the server in the background
    serverErr := make(chan error, 1)
    go func() {
        slog.Info("server starting", "addr", srv.Addr)
        if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
            serverErr <- err
        }
    }()

    // Wait for SIGTERM or SIGINT (Ctrl+C in development)
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)

    select {
    case err := <-serverErr:
        slog.Error("server failed to start", "error", err)
        os.Exit(1)
    case sig := <-quit:
        slog.Info("shutdown signal received", "signal", sig)
    }

    // Give existing requests up to 30 seconds to finish
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    if err := srv.Shutdown(ctx); err != nil {
        slog.Error("forced shutdown", "error", err)
        os.Exit(1)
    }

    slog.Info("server stopped cleanly")
}

The http.Server.Shutdown call stops accepting new connections and waits for existing requests to complete. It respects the context timeout — if requests do not finish within 30 seconds, it returns an error and you can decide whether to force-exit or wait longer.

Graceful Shutdown for Long-Running Background Jobs

An HTTP server is straightforward because the framework handles in-flight requests. Background workers are harder: you need to track active work yourself using a sync.WaitGroup and signal them to stop via a context cancellation.

type Worker struct {
    wg     sync.WaitGroup
    cancel context.CancelFunc
    done   chan struct{}
}

func NewWorker() *Worker {
    return &Worker{done: make(chan struct{})}
}

func (w *Worker) Start(parentCtx context.Context) {
    ctx, cancel := context.WithCancel(parentCtx)
    w.cancel = cancel

    go w.run(ctx)
}

func (w *Worker) run(ctx context.Context) {
    defer close(w.done)

    ticker := time.NewTicker(5 * time.Second)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            slog.Info("worker stopping, draining remaining work")
            w.wg.Wait() // finish any in-progress tasks
            return
        case <-ticker.C:
            w.wg.Add(1)
            go func() {
                defer w.wg.Done()
                w.processOneBatch(ctx)
            }()
        }
    }
}

func (w *Worker) Shutdown(timeout time.Duration) error {
    w.cancel() // signal the worker to stop accepting new work

    select {
    case <-w.done:
        return nil // clean shutdown
    case <-time.After(timeout):
        return fmt.Errorf("worker did not shut down within %s", timeout)
    }
}

SQS Consumers: The Special Case

SQS consumers need extra care because in-flight messages have a visibility timeout. If your consumer receives a message, starts processing it, and then gets killed, the message will reappear in the queue after the visibility timeout expires. This means duplicate processing — which is why idempotency is essential for SQS consumers.

For graceful shutdown of an SQS consumer specifically:

type SQSConsumer struct {
    sqs      *sqs.Client
    queueURL string
    wg       sync.WaitGroup
    stop     chan struct{}
}

func (c *SQSConsumer) Run(ctx context.Context) {
    defer close(c.stop) // signal that the consumer has stopped

    for {
        // Check if shutdown was requested
        select {
        case <-ctx.Done():
            slog.Info("SQS consumer: waiting for in-flight messages to complete")
            c.wg.Wait()
            slog.Info("SQS consumer: stopped cleanly")
            return
        default:
        }

        output, err := c.sqs.ReceiveMessage(ctx, &sqs.ReceiveMessageInput{
            QueueUrl:            &c.queueURL,
            MaxNumberOfMessages: 10,
            WaitTimeSeconds:     5, // short poll during shutdown
        })
        if err != nil {
            if ctx.Err() != nil { return } // context cancelled during long-poll
            slog.Error("receive error", "error", err)
            time.Sleep(time.Second)
            continue
        }

        for _, msg := range output.Messages {
            c.wg.Add(1)
            go func(m sqstypes.Message) {
                defer c.wg.Done()
                if err := c.processAndDelete(context.Background(), m); err != nil {
                    slog.Error("message processing failed", "error", err, "id", *m.MessageId)
                }
            }(msg)
        }
    }
}

func (c *SQSConsumer) processAndDelete(ctx context.Context, msg sqstypes.Message) error {
    // Process the message
    if err := c.handle(ctx, msg); err != nil {
        return err
    }
    // Delete only on success
    _, err := c.sqs.DeleteMessage(ctx, &sqs.DeleteMessageInput{
        QueueUrl:      &c.queueURL,
        ReceiptHandle: msg.ReceiptHandle,
    })
    return err
}

Coordinating Multiple Servers and Workers

A real service might have an HTTP server, an SQS consumer, and background maintenance goroutines all running simultaneously. You need to shut them all down in the right order: stop accepting new work first, finish existing work, then exit.

func main() {
    ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGTERM, syscall.SIGINT)
    defer stop()

    // Start HTTP server
    httpServer  := newHTTPServer()
    sqsConsumer := newSQSConsumer()

    var g errgroup.Group

    g.Go(func() error { return httpServer.Run() })
    g.Go(func() error { return sqsConsumer.Run(ctx) })

    // Wait for shutdown signal
    <-ctx.Done()
    stop() // release the signal channel
    slog.Info("shutting down")

    shutdownCtx, cancel := context.WithTimeout(context.Background(), 45*time.Second)
    defer cancel()

    // Shutdown in order: HTTP first (stop new work), then consumer
    if err := httpServer.Shutdown(shutdownCtx); err != nil {
        slog.Error("http shutdown error", "error", err)
    }
    if err := sqsConsumer.Shutdown(30 * time.Second); err != nil {
        slog.Error("consumer shutdown error", "error", err)
    }

    if err := g.Wait(); err != nil {
        slog.Error("service error", "error", err)
        os.Exit(1)
    }
}

Infrastructure Configuration

Graceful shutdown only works if the orchestrator gives you enough time. Configure your infrastructure to match your application's drain time:

ECS: set stopTimeout in the task definition to at least 60 seconds. The default is 30 which is too short for services with long-running requests or large SQS message batches:

{
  "containerDefinitions": [{
    "stopTimeout": 60
  }]
}

Application Load Balancer: set deregistration_delay.timeout_seconds on the target group. This gives the load balancer time to stop sending new connections before ECS terminates the task. Set it to match your stopTimeout minus 5 seconds.

Testing Graceful Shutdown

The simplest test: run your service locally, send it some slow requests (use sleep in a handler), then send SIGTERM while the requests are in flight. They should complete cleanly. A proper integration test would:

  1. Start the service in a goroutine
  2. Send several concurrent requests with artificial delays
  3. Call os.Process.Signal(syscall.SIGTERM)
  4. Assert all in-flight requests completed with 200
  5. Assert new connections after shutdown are refused

Summary

Graceful shutdown is a one-time engineering investment: set up the signal handler, implement the drain logic for each component, configure the infrastructure timeouts. You get clean deployments, no lost requests, and no data integrity issues from abruptly killed processes. Once you have it in place, you will never want to deploy without it.

Comments