Cost-Aware LLM Routing: Reducing AI API Bills by 60%
Three months after shipping the AI feature, our Anthropic API bill had grown faster than the revenue it was generating. The naive solution was to reduce usage. The right solution was to use the right model for each task. A cost-aware router that directs simple tasks to cheaper models and complex reasoning to powerful ones reduced our monthly AI spend by 60% while maintaining — and in some cases improving — output quality.
The Insight: Not All Tasks Are Equal
We were using Claude Opus for everything. Extracting a number from a JSON field does not need the same model as synthesising a 500-word campaign performance narrative. Classifying a keyword into one of five categories does not need the same model as generating a multi-step bid adjustment strategy. Using Opus for classification is like using a Ferrari to go grocery shopping.
The Anthropic model family maps naturally to task complexity:
- Claude Haiku: fast, cheap (~50× cheaper than Opus per token), excellent for structured extraction, classification, summarisation, and data parsing
- Claude Sonnet: balanced, good for moderate-complexity reasoning, free-form analysis, and most production use cases
- Claude Opus: powerful, expensive, best for complex multi-step reasoning, nuanced analysis, and tasks where quality matters more than cost
Task Classification
The router needs to determine which tier a task belongs to before selecting a model. We classify along three dimensions:
type TaskDimensions struct {
// How many tokens will the input be? (estimate before the call)
InputTokenEstimate int
// Does the task require multi-step reasoning or simple lookup?
RequiresReasoning bool
// Is the output structured (JSON schema) or free-form prose?
OutputType OutputType // Structured | FreeForm
// How important is high accuracy for this specific task?
AccuracyRequirement AccuracyLevel // High | Standard | Best-effort
}
type TaskTier int
const (
TierFast TaskTier = iota // Haiku
TierStandard // Sonnet
TierPowerful // Opus
)
func classifyTask(d TaskDimensions) TaskTier {
// Large context always needs a capable model
if d.InputTokenEstimate > 50_000 { return TierPowerful }
// Explicit reasoning requirement
if d.RequiresReasoning && d.AccuracyRequirement == High { return TierPowerful }
// Small structured tasks → fast model
if d.OutputType == Structured &&
d.InputTokenEstimate < 3_000 &&
d.AccuracyRequirement != High {
return TierFast
}
// Everything else → standard model
return TierStandard
}
The Router Implementation
type LLMRouter struct {
fast *anthropic.Client // Haiku
standard *anthropic.Client // Sonnet
powerful *anthropic.Client // Opus
metrics *RouterMetrics
}
func (r *LLMRouter) Complete(ctx context.Context, task Task) (string, error) {
tier := classifyTask(task.Dimensions)
client := r.clientForTier(tier)
model := r.modelForTier(tier)
start := time.Now()
resp, err := client.Messages.New(ctx, anthropic.MessageNewParams{
Model: anthropic.F(model),
MaxTokens: anthropic.F(int64(task.MaxTokens)),
System: anthropic.F([]anthropic.TextBlockParam{anthropic.NewTextBlock(task.System)}),
Messages: anthropic.F(task.Messages),
})
elapsed := time.Since(start)
if err != nil {
r.metrics.RecordError(model, tier, err)
// Fallback: if standard model fails, try powerful model
if tier == TierStandard {
return r.completeWithModel(ctx, TierPowerful, task)
}
return "", err
}
r.metrics.RecordCall(model, tier, elapsed,
resp.Usage.InputTokens, resp.Usage.OutputTokens)
return resp.Content[0].Text, nil
}
func (r *LLMRouter) clientForTier(t TaskTier) *anthropic.Client {
switch t {
case TierFast: return r.fast
case TierStandard: return r.standard
default: return r.powerful
}
}
func (r *LLMRouter) modelForTier(t TaskTier) anthropic.Model {
switch t {
case TierFast: return anthropic.ModelClaude3HaikuLatest
case TierStandard: return anthropic.ModelClaude3_5SonnetLatest
default: return anthropic.ModelClaude3OpusLatest
}
}
Task Definitions for our AI product
In our codebase, every AI-powered feature is a concrete task type with its dimensions defined upfront:
// Fast tier: structured extraction — is this keyword relevant to this product?
var KeywordRelevanceTask = TaskDimensions{
InputTokenEstimate: 500,
RequiresReasoning: false,
OutputType: Structured,
AccuracyRequirement: Standard,
}
// Standard tier: campaign performance summary — what happened this week?
var CampaignSummaryTask = TaskDimensions{
InputTokenEstimate: 5_000,
RequiresReasoning: false,
OutputType: FreeForm,
AccuracyRequirement: Standard,
}
// Powerful tier: strategy recommendation — how should we restructure this account?
var AccountStrategyTask = TaskDimensions{
InputTokenEstimate: 20_000,
RequiresReasoning: true,
OutputType: FreeForm,
AccuracyRequirement: High,
}
Cost Tracking
The router records cost for every call. Anthropic's pricing is per-token, and we track it to know which features are expensive and to detect anomalies (a spike in Opus usage usually means a routing bug):
var pricing = map[anthropic.Model]struct{ input, output float64 }{
anthropic.ModelClaude3HaikuLatest: {0.25 / 1e6, 1.25 / 1e6},
anthropic.ModelClaude3_5SonnetLatest: {3.00 / 1e6, 15.00 / 1e6},
anthropic.ModelClaude3OpusLatest: {15.00 / 1e6, 75.00 / 1e6},
}
func (m *RouterMetrics) RecordCall(model anthropic.Model, tier TaskTier,
duration time.Duration, inputTokens, outputTokens int64) {
p := pricing[model]
cost := float64(inputTokens)*p.input + float64(outputTokens)*p.output
m.callsTotal.WithLabelValues(string(model), tier.String()).Inc()
m.latencySeconds.WithLabelValues(string(model)).Observe(duration.Seconds())
m.costDollars.WithLabelValues(string(model)).Add(cost)
m.inputTokensTotal.WithLabelValues(string(model)).Add(float64(inputTokens))
m.outputTokensTotal.WithLabelValues(string(model)).Add(float64(outputTokens))
}
A/B Testing Model Quality
Before committing a task to a cheaper model, validate that the quality is acceptable. We ran shadow mode for two weeks: both the original model and the proposed cheaper model processed each request, and we compared outputs on a sample:
func (r *LLMRouter) shadowTest(ctx context.Context, task Task) {
if rand.Float64() > 0.05 { return } // 5% shadow rate
go func() {
cheaperTier := task.Tier - 1
if cheaperTier < TierFast { return }
resp, err := r.completeWithModel(context.Background(), cheaperTier, task)
if err != nil { return }
// Store both responses for manual review
r.shadowLog.Store(ShadowResult{
TaskType: task.Type,
OriginalTier: task.Tier,
CheaperTier: cheaperTier,
Original: task.LastResponse,
Cheaper: resp,
})
}()
}
Results in Production
After six weeks of running the router in production:
| Task Type | Old Model | New Model | Quality Change | Cost Change |
|---|---|---|---|---|
| Keyword relevance | Opus | Haiku | No change | -98% |
| Campaign summary | Opus | Sonnet 3.5 | No change | -80% |
| Account strategy | Opus | Opus | — | 0% |
| Data extraction | Sonnet | Haiku | No change | -92% |
Overall: 60% reduction in monthly API spend. Haiku handles ~70% of our requests by volume (classification, extraction, structured tasks). Sonnet handles ~25%. Opus handles ~5% (complex account strategy, sensitive high-stakes recommendations).
The lesson: matching the model to the task is not a compromise — for simple structured tasks, Haiku often outperforms Opus because the simpler model follows concise instructions more faithfully and returns cleaner JSON. Build the router early, before the bill grows large enough to hurt.
Comments
Post a Comment