← All posts

Cost-Aware LLM Routing: Reducing AI API Bills by 60%

Three months after shipping the AI feature, our Anthropic API bill had grown faster than the revenue it was generating. The naive solution was to reduce usage. The right solution was to use the right model for each task. A cost-aware router that directs simple tasks to cheaper models and complex reasoning to powerful ones reduced our monthly AI spend by 60% while maintaining — and in some cases improving — output quality.

The Insight: Not All Tasks Are Equal

We were using Claude Opus for everything. Extracting a number from a JSON field does not need the same model as synthesising a 500-word campaign performance narrative. Classifying a keyword into one of five categories does not need the same model as generating a multi-step bid adjustment strategy. Using Opus for classification is like using a Ferrari to go grocery shopping.

The Anthropic model family maps naturally to task complexity:

  • Claude Haiku: fast, cheap (~50× cheaper than Opus per token), excellent for structured extraction, classification, summarisation, and data parsing
  • Claude Sonnet: balanced, good for moderate-complexity reasoning, free-form analysis, and most production use cases
  • Claude Opus: powerful, expensive, best for complex multi-step reasoning, nuanced analysis, and tasks where quality matters more than cost

Task Classification

The router needs to determine which tier a task belongs to before selecting a model. We classify along three dimensions:

type TaskDimensions struct {
    // How many tokens will the input be? (estimate before the call)
    InputTokenEstimate int

    // Does the task require multi-step reasoning or simple lookup?
    RequiresReasoning bool

    // Is the output structured (JSON schema) or free-form prose?
    OutputType OutputType // Structured | FreeForm

    // How important is high accuracy for this specific task?
    AccuracyRequirement AccuracyLevel // High | Standard | Best-effort
}

type TaskTier int
const (
    TierFast      TaskTier = iota // Haiku
    TierStandard                  // Sonnet
    TierPowerful                  // Opus
)

func classifyTask(d TaskDimensions) TaskTier {
    // Large context always needs a capable model
    if d.InputTokenEstimate > 50_000 { return TierPowerful }

    // Explicit reasoning requirement
    if d.RequiresReasoning && d.AccuracyRequirement == High { return TierPowerful }

    // Small structured tasks → fast model
    if d.OutputType == Structured &&
       d.InputTokenEstimate < 3_000 &&
       d.AccuracyRequirement != High {
        return TierFast
    }

    // Everything else → standard model
    return TierStandard
}

The Router Implementation

type LLMRouter struct {
    fast     *anthropic.Client // Haiku
    standard *anthropic.Client // Sonnet
    powerful *anthropic.Client // Opus
    metrics  *RouterMetrics
}

func (r *LLMRouter) Complete(ctx context.Context, task Task) (string, error) {
    tier   := classifyTask(task.Dimensions)
    client := r.clientForTier(tier)
    model  := r.modelForTier(tier)

    start := time.Now()
    resp, err := client.Messages.New(ctx, anthropic.MessageNewParams{
        Model:     anthropic.F(model),
        MaxTokens: anthropic.F(int64(task.MaxTokens)),
        System:    anthropic.F([]anthropic.TextBlockParam{anthropic.NewTextBlock(task.System)}),
        Messages:  anthropic.F(task.Messages),
    })
    elapsed := time.Since(start)

    if err != nil {
        r.metrics.RecordError(model, tier, err)
        // Fallback: if standard model fails, try powerful model
        if tier == TierStandard {
            return r.completeWithModel(ctx, TierPowerful, task)
        }
        return "", err
    }

    r.metrics.RecordCall(model, tier, elapsed,
        resp.Usage.InputTokens, resp.Usage.OutputTokens)

    return resp.Content[0].Text, nil
}

func (r *LLMRouter) clientForTier(t TaskTier) *anthropic.Client {
    switch t {
    case TierFast:     return r.fast
    case TierStandard: return r.standard
    default:           return r.powerful
    }
}

func (r *LLMRouter) modelForTier(t TaskTier) anthropic.Model {
    switch t {
    case TierFast:     return anthropic.ModelClaude3HaikuLatest
    case TierStandard: return anthropic.ModelClaude3_5SonnetLatest
    default:           return anthropic.ModelClaude3OpusLatest
    }
}

Task Definitions for our AI product

In our codebase, every AI-powered feature is a concrete task type with its dimensions defined upfront:

// Fast tier: structured extraction — is this keyword relevant to this product?
var KeywordRelevanceTask = TaskDimensions{
    InputTokenEstimate:  500,
    RequiresReasoning:   false,
    OutputType:          Structured,
    AccuracyRequirement: Standard,
}

// Standard tier: campaign performance summary — what happened this week?
var CampaignSummaryTask = TaskDimensions{
    InputTokenEstimate:  5_000,
    RequiresReasoning:   false,
    OutputType:          FreeForm,
    AccuracyRequirement: Standard,
}

// Powerful tier: strategy recommendation — how should we restructure this account?
var AccountStrategyTask = TaskDimensions{
    InputTokenEstimate:  20_000,
    RequiresReasoning:   true,
    OutputType:          FreeForm,
    AccuracyRequirement: High,
}

Cost Tracking

The router records cost for every call. Anthropic's pricing is per-token, and we track it to know which features are expensive and to detect anomalies (a spike in Opus usage usually means a routing bug):

var pricing = map[anthropic.Model]struct{ input, output float64 }{
    anthropic.ModelClaude3HaikuLatest:        {0.25 / 1e6, 1.25 / 1e6},
    anthropic.ModelClaude3_5SonnetLatest:     {3.00 / 1e6, 15.00 / 1e6},
    anthropic.ModelClaude3OpusLatest:         {15.00 / 1e6, 75.00 / 1e6},
}

func (m *RouterMetrics) RecordCall(model anthropic.Model, tier TaskTier,
    duration time.Duration, inputTokens, outputTokens int64) {

    p := pricing[model]
    cost := float64(inputTokens)*p.input + float64(outputTokens)*p.output

    m.callsTotal.WithLabelValues(string(model), tier.String()).Inc()
    m.latencySeconds.WithLabelValues(string(model)).Observe(duration.Seconds())
    m.costDollars.WithLabelValues(string(model)).Add(cost)
    m.inputTokensTotal.WithLabelValues(string(model)).Add(float64(inputTokens))
    m.outputTokensTotal.WithLabelValues(string(model)).Add(float64(outputTokens))
}

A/B Testing Model Quality

Before committing a task to a cheaper model, validate that the quality is acceptable. We ran shadow mode for two weeks: both the original model and the proposed cheaper model processed each request, and we compared outputs on a sample:

func (r *LLMRouter) shadowTest(ctx context.Context, task Task) {
    if rand.Float64() > 0.05 { return } // 5% shadow rate

    go func() {
        cheaperTier := task.Tier - 1
        if cheaperTier < TierFast { return }

        resp, err := r.completeWithModel(context.Background(), cheaperTier, task)
        if err != nil { return }

        // Store both responses for manual review
        r.shadowLog.Store(ShadowResult{
            TaskType:     task.Type,
            OriginalTier: task.Tier,
            CheaperTier:  cheaperTier,
            Original:     task.LastResponse,
            Cheaper:      resp,
        })
    }()
}

Results in Production

After six weeks of running the router in production:

Task TypeOld ModelNew ModelQuality ChangeCost Change
Keyword relevanceOpusHaikuNo change-98%
Campaign summaryOpusSonnet 3.5No change-80%
Account strategyOpusOpus0%
Data extractionSonnetHaikuNo change-92%

Overall: 60% reduction in monthly API spend. Haiku handles ~70% of our requests by volume (classification, extraction, structured tasks). Sonnet handles ~25%. Opus handles ~5% (complex account strategy, sensitive high-stakes recommendations).

The lesson: matching the model to the task is not a compromise — for simple structured tasks, Haiku often outperforms Opus because the simpler model follows concise instructions more faithfully and returns cleaner JSON. Build the router early, before the bill grows large enough to hurt.

Comments