The LLM Evals Stack: How to Actually Measure Your AI Feature

Q: Does LLM-as-judge actually agree with humans?

When you validate it, yes. [Confident AI's research](https://www.confident-ai.com/blog/why-llm-as-a-judge-is-the-best-llm-evaluation-method) reports judge-human agreement above 85% on validated rubrics in 2025-2026, which is roughly the level two humans agree with each other. The catch: this only holds when you've calibrated the judge against human labels first. A judge prompt copied from a tutorial, never validated, is just an opinion machine.

The Skill Nobody Trained For

Every team is shipping LLM features. Almost no team has a real way to know if those features are good.

Hamel Husain and Shreya Shankar's AI Evals For Engineers & PMs cohort has trained more than 2,000 practitioners, including engineers and PMs from OpenAI and Anthropic. The reason it filled up isn't theory. It's that people building this stuff realized their QA process was "ask the prompt engineer to read 30 outputs and trust the gut."

Hamel calls this a "vibes eval." It's the dominant practice, and it's the reason so many AI features ship, demo well, then quietly degrade. On Lenny's Newsletter podcast, Hamel and Shreya argued that evals are now the most leveraged skill in AI product work, more important than prompt craft, more important than picking the right model. The reason is structural. If you can't measure quality, you can't improve it.

Most teams know this. They still don't do it. Why? Because evals look like infrastructure work, they don't have a clear owner, and the first version always feels worse than reading outputs by hand. The trick is realizing that "reading outputs by hand" is the eval. The discipline is to do it once, write down what you saw, and never do it the same way twice.

This piece is the playbook. It's tool-agnostic on purpose. You can run this stack with Inspect AI from the UK AI Safety Institute, Phoenix from Arize, Promptfoo, Langfuse, the OpenAI Evals framework, Anthropic's eval tooling, or a stack of Python scripts and a Google Sheet. The methodology carries the weight.

Why "It Looks Good" Stops Working at Scale

The first time you ship an LLM feature, "it looks good" is a fine bar. You wrote the prompt, you tested ten inputs, the outputs look reasonable, you press deploy.

The second time, you're in trouble.

LLM outputs are non-deterministic by design. Temperature 0 reduces variance but doesn't eliminate it. The same prompt with the same model can produce subtly different answers across runs and absolutely across model versions. Then add the things that change underneath you:

Model drift. Providers update models with the same name. GPT-4o today is not GPT-4o from three months ago. Same string in your config, different behavior. You won't get a notification.
Distribution shift. The inputs real users send aren't the inputs you tested with. Shorter prompts, different languages, weird formatting, attachments. The long tail is where evals matter most.
Prompt entropy. Every time someone edits the system prompt to fix one user's complaint, you risk breaking three other use cases you forgot existed. Without evals, this is invisible until support tickets spike.
Retrieval drift. If you use RAG, your index changes constantly. The retrieval slice of your pipeline degrades silently.

The hidden tax is your team's time. Every change becomes a half-day of someone clicking through prompts. Every model update becomes a "let me check a few" exercise that takes a week. Every escalation becomes a one-off investigation because nobody can answer "is this a regression, or has it always been broken for queries like this?"

Real evals make these questions cheap to answer. The cost is upfront. The savings compound.

The Anatomy of an LLM Eval Stack

There are five layers. They build on each other. Skipping a layer means the ones above it are scoring noise.

Layer	Purpose	Tooling examples	Gotchas
1. Traces	Capture every input, output, tool call, retrieval set, latency, and cost from production runs	Langfuse, Phoenix, OpenTelemetry exporters	Sampling too aggressively; missing tool-call payloads; PII in logs
2. Error analysis	Open-code 50-100 real traces, cluster failure modes, build an error taxonomy	Notion, Airtable, a CSV	Skipping this step; one person doing it alone with no calibration
3. Golden dataset	50-100 hand-labeled examples covering each failure mode and the happy path	Inspect AI dataset format, JSONL, your own schema	Synthetic-only data; no coverage of failure modes you found in step 2
4. LLM-as-judge	A scoring model with a rubric, validated against human labels to >=85% agreement	OpenAI Evals, Promptfoo, Inspect AI, custom	Using a judge you never validated; single-judge scoring; rubric drift
5. CI integration	Run the eval on every PR, gate merges on regression, alert on degradation	GitHub Actions, GitLab CI, custom runners	Running too slow; no cost ceiling; judge model price spikes

The order matters. You can't build a golden dataset without first finding out what failure modes exist (error analysis), and you can't do error analysis without traces. You can't validate a judge without a labeled golden set. You can't run the pipeline in CI until the judge is actually trustworthy.

Aakash Gupta's step-by-step writeup of the Hamel/Shreya curriculum walks through this same skeleton with slightly different naming, and it's the closest thing the field has to a consensus shape. We'll go layer by layer.

Step 1: Capture Traces

A trace is the full record of one LLM interaction. Not just input and output, but everything the system did in between.

A useful trace contains:

The full message history (system prompt, user turns, assistant turns)
The model name and version, exactly as called
All tool calls, their arguments, and their responses
For RAG: the retrieval query, the documents returned, the scores, the documents actually used in the final prompt
Token counts, latency broken out by stage, and cost in dollars
A request ID you can join back to the user session
Any feedback signals (thumbs up, copy click, regenerate, abandoned)

The principle: the trace is the source of truth. If you can't reconstruct exactly what the model saw and did from the trace alone, your eval is downstream of guessing.

On sampling, the trade-off is volume versus cost. For a high-traffic product, log 100% of traces but only retain full payloads for a sample (5-10%), with the rest stripped to metadata. Log 100% of error traces (tool call failures, timeouts, low-confidence retrieval, negative feedback). Those are your gold mine.

On PII: traces will contain user data. Apply the same redaction and retention rules you use for any other production log. Most teams hash user IDs, strip emails, and rotate logs on a 30-90 day schedule.

If you're rolling your own, write a small wrapper around your LLM SDK that emits structured JSON to your logging pipeline. The format matters less than the discipline of always logging.

Step 2: Error Analysis (Open Coding)

This is the step everyone skips, and it's the step that makes everything else work.

Shreya Shankar's research borrows a technique from qualitative research called open coding. You sit with 50-100 real traces. You read them. You annotate what's wrong (or right). You don't impose categories ahead of time. You let the categories emerge from the data.

Concretely:

Pull 100 traces from production. Mix random samples with traces flagged by negative feedback, errors, or high latency.
Open a spreadsheet. Columns: trace ID, input summary, output summary, problem (free text), severity (1-3), tags (start empty).
For each trace, write what's wrong in plain language. Not "bad output." Specific: "hallucinated a function name that doesn't exist in the user's code," or "answered in the wrong language," or "refused a benign request."
After 20-30 traces, the same problems keep appearing. Start tagging them. Two-word tags are fine.
After all 100, group the tags. You'll typically end up with 5-15 distinct failure modes plus "looks fine."

This is your error taxonomy. Hamel describes this in his Field Notes on LLM Evals as the moment vague concerns become testable claims. "The bot is hallucinating sometimes" turns into "the bot hallucinates function names in 8% of code-related queries, especially when variable names are uncommon." That's something you can write an eval for.

A second pair of eyes matters. Have someone else code the same 30 traces independently. Where you disagree, the taxonomy is fuzzy and needs sharpening. This is the same inter-annotator agreement check you'll later apply to the LLM judge.

Step 3: Build a Golden Dataset

A golden dataset is a fixed set of inputs, expected behaviors, and labels that your pipeline gets evaluated against. Think of it as your regression test suite for an LLM.

Size: 50-100 examples is a real starting point. Not 10 (too noisy). Not 1,000 (too expensive, too slow). Grow it as you find new failure modes, but the first version should be small enough that a human can read every example in an afternoon.

Coverage: every failure mode from error analysis needs at least 5-10 examples. Plus 20-30 happy-path examples. Plus a few edge cases (very long, very short, non-English, adversarial). The arXiv paper Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge is worth reading on coverage strategy.

Inclusion criteria:

Real user inputs over synthetic ones. Synthetic-only is a known trap: it tends to look like prompts your team would write, not what users actually send.
Each example has a labeled expected outcome. For open-ended generation, that's a rubric (must X, must not Y, ideally Z), not a literal expected string.
Each example is tagged with one or more failure modes.
Sensitive data is scrubbed.

Labeling rubric: write down what "good" looks like. For some tasks, exact-match (did it call the right API). For most, it's a rubric of properties: "must cite at least one retrieved document," "must not refuse," "must be in the user's language," "must not invent function names." Each property is a binary check.

Refresh cadence: once a quarter, pull a fresh batch of 50-100 production traces and repeat open coding. New failure modes get added. Old ones that no longer occur stay (you want regression coverage). When the model or prompt changes meaningfully, do an out-of-cycle refresh.

The trap to avoid: don't let the team writing the prompt also write the golden dataset alone. It'll overfit to the prompt's strengths. Have someone outside the daily prompt work label at least half the examples.

Step 4: LLM-as-Judge with Human Validation

For tasks with no single right answer (summarization, classification with fuzzy categories, open-ended Q&A), you need a way to score outputs at scale. Manual scoring doesn't run on every PR.

LLM-as-judge is the technique: use a separate LLM call to score your pipeline's output against the rubric. It works. Confident AI's analysis of LLM-as-judge methods reports that judge-human agreement crossed 85% across multiple benchmarks in 2025-2026, which is roughly the level of agreement two humans show with each other on subjective tasks. It's not perfect, but it's good enough to be useful, if you validate it.

The methodology, in order:

Write a rubric for the judge. Not "is this answer good." Specific: "Does the answer cite at least one of the provided documents? Y/N. Does the answer refuse when it shouldn't? Y/N. Does the answer make up function names that aren't in the user's code? Y/N." Binary or short-Likert. Judges are bad at 1-10 scales.
Pick a judge model. A strong general model (GPT-4 class or Claude Sonnet/Opus class) usually outperforms smaller models, but for high-volume eval runs, a cheaper judge is fine if you validate it.
Label your golden set by hand. Two humans, independently. Compute inter-annotator agreement (Cohen's kappa or simple percentage agreement on a binary). If humans don't agree above 80-85%, your rubric is too fuzzy. Sharpen it.
Run the judge on the same set. Compare judge labels to human labels. If judge-human agreement is below 80%, your rubric prompt for the judge is wrong, or the judge model is too weak. Iterate on the judge prompt.
Ship the judge only when agreement is at or above 85%. Below that, you're automating noise.

Common mistakes:

Using a single judge call without validating. The agreement might be 60%. You'd never know.
Using the same model to generate and judge. There's a known self-preference bias. Use a different model family for the judge when you can.
Letting the judge rubric drift from the human rubric. They have to be the same prompt. If you tweak one, re-validate.
Using one judge call as ground truth. For high-stakes evals, run the judge with three independent calls and take the majority vote. Disagreement among judge calls is itself a signal.

Once validated, the judge can score thousands of outputs in minutes for a few dollars. That's the payoff. You go from "we eval before each release" to "we eval on every PR."

Step 5: Wire It Into CI/CD

The whole point of this stack is that nobody has to remember to run evals. CI does it.

A workable shape:

On every PR touching prompts, model configs, or pipeline code: run the golden eval. Block merge if any metric drops more than a threshold (say 3 percentage points on accuracy, or any regression on a hard check like "must not refuse benign requests").
On every model version bump: run the full eval on production-equivalent infra. Compare side by side.
On a nightly schedule: run the eval against a sample of fresh production traces. This catches distribution shift the static golden set misses.
Cost ceilings: cap the per-PR eval budget. A sensible default is $1-$5 per run. If your eval costs more, sample the golden set or use a cheaper judge for PR-time evals and the full judge for releases.

A pattern that works: a "fast eval" of 20-30 examples on every commit (sub-2-minute) and a "full eval" of 100-300 examples on PR approval and on main. Same pipeline, different sample sizes.

Reporting matters. The PR comment should show overall pass rate, per-failure-mode pass rate, diff against main, and links to failed traces. Don't bury this in a log file.

Alert on baseline drift. If the nightly pass rate drops 5 points and stays there for two days, something changed (a model update, a retrieval index change, a downstream service). The eval is also your monitoring layer.

Anti-Patterns Builders Keep Shipping

A few patterns to watch for, because they keep showing up in real teams:

Vibes evals shipped to prod. An engineer eyeballs 20 outputs, says "ship it," and the feature goes live. Three weeks later, nobody can answer "is this better than last week."

Scoring with a single unvalidated judge. Someone reads about LLM-as-judge, writes a one-line prompt ("rate this from 1-10"), and starts making decisions on the scores. The scores are noise.

No error taxonomy. The team skipped open coding. The golden set was written off the top of someone's head. It overfits to obvious cases and misses every interesting failure mode real users hit.

No human-agreement check. When two humans haven't agreed on the labels, the judge's "agreement" with one of them is meaningless.

Synthetic data only. The golden set is generated by another LLM. Looks great. Passes every eval. Production breaks because real user inputs don't look like LLM-generated inputs.

One number summarizing everything. "Our eval score is 87%." On what? Across which failure modes? Per-failure-mode pass rates are required. An aggregate hides the failures that matter.

Eval-as-after-thought. Treated as a quarterly exercise instead of a continuous pipeline. Goes stale within a release cycle. Nobody trusts it.

The 7-Day Eval Plan for a Team With No Evals

If you're starting from zero, here's a week that gets you to a working pipeline.

Day 1: Capture traces. Pick one LLM feature. Add logging that captures input, output, tool calls, retrieval context if any, model version, and a request ID. Pipe to wherever your logs live. Don't pick a tool yet; a JSONL file works for week one. Goal at end of day: 100+ traces from production.

Day 2: Open coding, round one. Pull 50 traces. One person reads them. Free-text annotation of what's wrong or right. Don't categorize yet. Just describe.

Day 3: Open coding, round two. A second person does the same 50. Compare notes. Build the failure-mode taxonomy from the patterns. Aim for 5-12 named failure modes. Estimate rough frequency for each.

Day 4: Golden dataset v0. Take 50-80 examples that cover each failure mode plus the happy path. Label each one with a rubric: what must be true of a good response. Store as JSONL or a spreadsheet, doesn't matter, but store it in version control.

Day 5: Build the judge. Write a judge prompt that scores each rubric property as a binary. Pick a judge model from a different family than your production model. Run it on the golden set.

Day 6: Validate the judge. Have two humans label the same golden set against the same rubric. Compute human-human agreement and judge-human agreement. If judge-human is below 80%, iterate the judge prompt. Stop when you hit 85% or run out of day six. If you can't get to 85%, your rubric is the problem. Sharpen the binary questions.

Day 7: Wire CI. A GitHub Action that runs the eval on PRs touching prompts or pipeline code. Output a comment with pass rate, per-failure-mode breakdown, and links to failed traces. Set a budget. Set a regression threshold.

End of week: you have traces, a taxonomy, a golden set, a validated judge, and a CI gate. The whole stack. It's small, and that's the point. It compounds from here.

After week one, you'll spend most of your evals time on three things: growing the golden set, refining the rubric as you find ambiguity, and investigating regressions caught by CI. None of it requires another big build.

Frequently Asked Questions

What's the difference between a vibes eval and a real eval?

A vibes eval is "I read 20 outputs and it looks better." A real eval has: a fixed input set, a written rubric, a scoring mechanism that doesn't depend on whoever happens to be reading, and a way to compare two runs. The fastest litmus test: if you can't answer "what was the eval score last week and this week, and which failure modes regressed," you don't have a real eval.

How big should my golden dataset be?

50-100 examples for the first version. Big enough to cover each failure mode with 5-10 cases, small enough that a human can read all of it in an afternoon. Grow it over time as you find new failure modes in production. Most mature pipelines sit at 200-500 examples; very few benefit from going past 1,000 unless they're evaluating across many distinct use cases.

Does LLM-as-judge actually agree with humans?

When you validate it, yes. Confident AI's research reports judge-human agreement above 85% on validated rubrics in 2025-2026, which is roughly the level two humans agree with each other. The catch: this only holds when you've calibrated the judge against human labels first. A judge prompt copied from a tutorial, never validated, is just an opinion machine.

Do I need a specialized eval tool, or can I roll my own?

For week one, roll your own. JSONL files, a Python script, a spreadsheet for labels. Once you outgrow that (usually when multiple people need to label, or you want a UI for browsing traces), pick a tool. Inspect AI, Phoenix, Promptfoo, Langfuse, and the OpenAI Evals framework all cover the same basic shape with different ergonomics. The methodology is the moat, not the tool.

When should I re-label my golden dataset?

Quarterly as a baseline. Out of cycle when you change models, when you change the prompt significantly, when support tickets spike with a new failure mode, or when your nightly eval against fresh production traces shows the golden set is no longer representative.

Closing Thoughts

The teams that win the next two years of AI product work aren't the ones with the cleverest prompts. They're the ones who can answer, on any Tuesday, "is our AI feature better today than it was last month, and on which dimensions, and for which users." That's an evals question, not a prompt question.

The methodology has stabilized: traces, error analysis, golden dataset, validated judge, CI loop. Five layers, each buildable in a day or two by a small team. The gap between teams that have an eval stack and teams that don't is widening fast.

If you've shipped an LLM feature and you can't answer the regression question with confidence, your week-one task isn't another prompt tweak. It's a JSONL file, fifty traces, and a spreadsheet of labels. The rest of the stack falls out of that. Start small, validate ruthlessly, ship the pipeline before the next feature.