LLM Evaluation Harnesses in Go: Shipping AI Features Safely

AI assistant interface on a screen

The first version of an AI feature is usually judged by vibes. You run twenty examples, the output looks good, and everyone gets excited. The problem is that vibes do not survive production. Prompts change, models change, input data changes, and suddenly the feature starts producing recommendations that are plausible but wrong.

An evaluation harness turns AI quality into something you can test before every deploy. It will never be perfect, but it is much better than clicking around manually and hoping the model still behaves.

Build A Golden Dataset

Start with real inputs from the product, anonymized and reduced to the fields the model actually needs. For each input, store the expected properties of a good answer. Not always the exact output, but the constraints that matter.

type EvalCase struct {
    Name        string
    Input       RecommendationInput
    MustInclude []string
    MustAvoid   []string
    MaxCostCents int
}

For campaign recommendations, a case might require the model to mention wasted spend, avoid suggesting budget increases on paused campaigns, and stay under a cost ceiling.

Test Structure First

If the model returns JSON, validate JSON before judging quality. A beautiful recommendation is useless if the backend cannot parse it. This is where Go's type system helps.

func validateRecommendation(r Recommendation) error {
    if r.Title == "" {
        return errors.New("missing title")
    }
    if r.Action.Type == "" {
        return errors.New("missing action type")
    }
    if r.Confidence < 0 || r.Confidence > 1 {
        return errors.New("confidence out of range")
    }
    return nil
}

Use Multiple Evaluators

I like combining deterministic checks with model-based judging. Deterministic checks catch structure, forbidden actions, missing fields, and cost limits. A judge model can compare reasoning quality, relevance, and whether the answer follows the prompt. The key is to make judge prompts boring and consistent.

Parse success rate.
Schema validation errors.
Forbidden recommendation count.
Average cost per case.
Judge score by category.

Run Evals In CI and Before Prompt Changes

Prompts are code. They should go through review and tests. The evaluation harness should run on a representative subset in CI, and the full suite before major prompt or model changes. Keep historical scores so regressions are visible.

The goal is not to pretend LLMs are deterministic services. They are not. The goal is to build enough measurement around them that shipping changes feels like engineering instead of fortune telling.

Search This Blog