Posts

Showing posts from February, 2026

LLM Evaluation Harnesses in Go: Shipping AI Features Safely

Image
The first version of an AI feature is usually judged by vibes. You run twenty examples, the output looks good, and everyone gets excited. The problem is that vibes do not survive production. Prompts change, models change, input data changes, and suddenly the feature starts producing recommendations that are plausible but wrong. An evaluation harness turns AI quality into something you can test before every deploy. It will never be perfect, but it is much better than clicking around manually and hoping the model still behaves. Build A Golden Dataset Start with real inputs from the product, anonymized and reduced to the fields the model actually needs. For each input, store the expected properties of a good answer. Not always the exact output, but the constraints that matter. type EvalCase struct { Name string Input RecommendationInput MustInclude []string MustAvoid []string MaxCostCents int } For campaign recommendations, a case might require the...

Structured Outputs with Claude API: Production Patterns in Go

Image
The difference between a demo LLM integration and a production one often comes down to structured outputs. In a demo, free-form text is fine — you are showing a human-readable result. In production, you need to reliably parse the response into typed data structures, validate it, handle failures gracefully, and integrate it into downstream systems that expect specific types. This post covers the patterns that have worked in our Go services at the platform. Why Free-Form Text Fails in Production LLMs are probabilistic. Even with a deterministic system prompt, the same input can produce slightly different output formats across calls. "Return the ACOS as a number" might sometimes produce 23.5 , sometimes 23.5% , sometimes "ACOS: 23.5%" . Any of these can happen, and your production system must handle all of them or crash. Structured outputs — combined with JSON schema validation — eliminate this class of problem. Instead of parsing the LLM response as free text, y...