LLM Evaluation Harnesses in Go: Shipping AI Features Safely
The first version of an AI feature is usually judged by vibes. You run twenty examples, the output looks good, and everyone gets excited. The problem is that vibes do not survive production. Prompts change, models change, input data changes, and suddenly the feature starts producing recommendations that are plausible but wrong. An evaluation harness turns AI quality into something you can test before every deploy. It will never be perfect, but it is much better than clicking around manually and hoping the model still behaves. Build A Golden Dataset Start with real inputs from the product, anonymized and reduced to the fields the model actually needs. For each input, store the expected properties of a good answer. Not always the exact output, but the constraints that matter. type EvalCase struct { Name string Input RecommendationInput MustInclude []string MustAvoid []string MaxCostCents int } For campaign recommendations, a case might require the...