What it is: Evals (short for evaluations) are systematic tests used to measure how well an AI model performs on specific tasks, benchmarks, or quality criteria.
Who it’s for: Anyone learning AI terminology
Best if: You’ve seen this term and want a clear explanation
Skip if: You already work with this concept daily
Why evals matter
Large language models are non-deterministic — ask the same question twice and you can get two different answers. That makes them very different from normal software, where you can usually write a test that says “if I click this button, this exact thing should happen.” With AI, you need a way to measure quality across hundreds or thousands of examples to know if a change actually helped.
Evals solve three problems at once. First, they tell you whether a new model (say, swapping Claude Sonnet for Claude Opus) is actually better for your specific task, not just better on a leaderboard. Second, they catch regressions when you tweak a prompt or add a new tool — without evals, every prompt change is a guess. Third, they give product and business stakeholders a number to point at: “we hit 92% accuracy on returns-policy questions” is far more useful than “it feels good.” Anthropic’s own guidance puts it bluntly: defining success criteria and building evals is “central” to working with LLMs (Anthropic, Develop Tests).
Types of evals
“Evals” is a catch-all term, and the word covers four genuinely different things you’ll hear in the same meeting. Knowing which one someone means saves a lot of confusion.
1. Public benchmarks. These are the famous tests you see in model release announcements — MMLU (general knowledge across 57 subjects), HumanEval (Python coding problems), GPQA (graduate-level science), SWE-bench (real GitHub bug fixes). They’re useful for comparing frontier models in the abstract, but they rarely reflect what your users actually do. A model can ace HumanEval and still be mediocre at writing your company’s marketing copy.
2. Golden datasets (a.k.a. custom evals). This is the workhorse for teams shipping AI features. You collect 100–10,000 real examples from your domain, hand-label the correct answer for each, and run your AI against it every time you change anything. If your support bot needs to classify tickets, your golden dataset is a thousand real tickets with the correct category written next to each one.
3. Human eval. A real person reads the AI’s output and rates it. Slow, expensive, and the gold standard for things you can’t easily automate — tone, helpfulness, whether a sales email actually sounds like your brand. Most teams use it sparingly: a sample of 50–200 outputs reviewed weekly, not every output every day.
4. A/B tests in production. Once your AI feature is live, you can route 50% of users to version A and 50% to version B and measure what actually matters: did people click, convert, complete their task, come back? This is the only kind of eval that measures real business impact, but you can only run it after you ship.
One more distinction worth knowing: graders. Anthropic and OpenAI both describe three ways to grade an eval — code-graded (an exact-match check, fastest and cheapest), human-graded (slow, high-quality), and LLM-graded (you ask another AI to score the answer using a rubric). Most modern eval pipelines use a mix.
A real example: evaluating a customer-support bot
Say you’ve built a support bot that answers questions about your return policy, shipping times, and order status. Here’s what an eval suite for it actually looks like.
You start by exporting 500 real customer questions from the last quarter. A human (or two, for agreement) writes the ideal answer for each one. That’s your golden dataset. Then you build three checks:
- Accuracy (code-graded): Does the bot’s answer contain the correct return window — “30 days” — when the question is about returns? A simple string-match check, runs in seconds across all 500 examples.
- Tone (LLM-graded): A second model (Claude or GPT-4) reads each response and rates it 1–5 on whether it sounds empathetic and professional. This is exactly the pattern Anthropic recommends for subjective qualities: a clear rubric, a fixed scale, output one number (Anthropic docs).
- Safety (binary): Does the bot ever invent a refund amount, promise free shipping that doesn’t exist, or share another customer’s data? Human-reviewed on a sample of 50.
Now your team has a dashboard: 94% accuracy, 4.2 average tone score, 0 safety failures in the last run. When someone proposes a prompt change, you re-run the full suite and see the deltas in 10 minutes. That’s the entire point — evals turn “I think it’s better” into “it went from 94% to 96%.”
Tools you’ll see mentioned
You don’t have to build eval infrastructure from scratch. The ecosystem has matured fast in 2024–2026, and a few names come up repeatedly:
- Inspect AI — Open-source framework from the UK AI Security Institute. Ships with 200+ pre-built evals and works with Claude, GPT-4, Gemini, and others. Heavy use in safety and capability research.
- OpenAI Evals — The original public eval framework. Open-source, supports model-graded and exact-match evals, and has a registry of community-contributed benchmarks.
- Promptfoo — A lightweight, developer-friendly CLI for running evals locally. Popular with small teams who want a quick way to compare prompts and models side by side.
- Braintrust — A commercial eval platform aimed at production teams. Logs every model call, lets you build datasets from real traffic, and runs CI-style checks before deploys.
- LangSmith — LangChain’s eval and observability tool. Common choice for teams already building on LangChain.
If you’re a non-technical operator: you don’t need to install any of these yourself. But when an engineer says “I’ll add a Promptfoo run to the pipeline,” you now know they mean automated quality checks before each release.
10 AI Eval Patterns Production Teams Should Adopt
- Build evals before building features. Evals first, then implementation. Catches regressions; aligns team on what success looks like.
- Golden-dataset curation over generic benchmarks. Your specific use-case eval matters more than MMLU score. Build 50 to 200 curated examples reflecting your work.
- LLM-as-judge with caution. Using another model to judge outputs scales evaluation but introduces bias. Human spot-checking of judge decisions matters.
- Regression suites in CI. Eval suite runs on every model swap or prompt change. Quality drops surface before deployment.
- Production-data sampling for ongoing evals. Real production traffic sampled and evaluated; surfaces drift the curated dataset would miss.
- Latency and cost in the eval, not just accuracy. Quality at unacceptable latency or cost is not production-ready. Multi-dimensional evaluation.
- Adversarial example coverage. Evals should include edge cases, adversarial inputs, ambiguous cases. The hard cases reveal real model behavior.
- User-feedback loops feeding the eval set. Customer thumbs-down adds to the eval set. Future model versions are tested against real failures.
- Cross-model comparative evals. Same eval run across 3 to 5 candidate models. Selection becomes evidence-based.
- Eval as documentation. The eval suite documents what the system is supposed to do. New team members read evals to understand the product.
Common confusions
Eval vs. benchmark. A benchmark is a public, standardised eval (MMLU, HumanEval). All benchmarks are evals; not all evals are benchmarks. When your team says “we need evals,” they almost always mean custom evals on your own data — not running MMLU.
Eval vs. unit test. A unit test passes or fails — 2 + 2 must equal 4. An eval gives you a score across many examples (e.g., “94% pass rate”) because LLM outputs vary. A unit test fails if there’s any defect; an eval flags a problem if the score drops below a threshold.
Related terms
- What is Prompt Engineering? — Evals are the feedback loop that makes prompt engineering systematic instead of guesswork.
- What is RAG? — RAG systems are particularly hard to evaluate because both retrieval quality and generation quality matter.
- What are Guardrails? — Guardrails enforce rules at runtime; evals measure whether those rules (and everything else) are working.
- What is Context Engineering? — Changing what you put in the context window is one of the biggest things evals help you measure.
What Are Evals?
Evals — short for evaluations — are the tests and benchmarks used to measure how well an AI model performs. Just as students take exams to demonstrate their knowledge, AI models are put through evals to assess their capabilities: Can they answer questions accurately? Can they reason through math problems? Can they write code that actually works? Do they refuse dangerous requests?
The term “evals” has become ubiquitous in AI development because it captures a fundamental truth: you can’t improve what you don’t measure. Every AI lab, from the largest to the smallest, relies on evals to understand their models’ strengths, weaknesses, and progress over time.
What makes evals particularly important — and tricky — is that AI capabilities are hard to measure comprehensively. A model might ace a coding benchmark but fail at common-sense reasoning. It might handle English perfectly but struggle with other languages. Good eval design is as much an art as a science, and it’s one of the most active areas of AI research.
Why It Matters
Evals matter because they’re how the AI industry makes decisions. Which model to deploy, whether a new version is actually better, whether safety measures are working — all of these questions are answered through evals. When you see claims like “our model scores 92% on HumanEval” or “beats GPT-4 on MMLU,” those numbers come from specific eval benchmarks.
For teams building AI agents and applications, creating custom evals for your specific use case is increasingly considered essential. Off-the-shelf benchmarks tell you about general capability, but only custom evals tell you if a model works well for your particular task. As the field matures, understanding evals is fundamental to navigating the AI glossary of claims and comparisons.
How It Works
Evals typically involve running a model through a set of test cases with known correct answers (or quality criteria) and scoring the results. Some evals are automated — the model’s output is compared against a ground truth answer. Others use human evaluators or even other AI models as judges to assess quality on subjective dimensions like helpfulness or creativity.
Common eval types include accuracy benchmarks (multiple-choice questions with known answers), task-based evals (can the model complete a coding challenge?), safety evals (does the model refuse harmful requests?), and comparative evals (which model do human raters prefer?).
Examples
MMLU: A massive multitask benchmark covering 57 academic subjects from elementary math to professional law, used to test general knowledge.
HumanEval: A coding benchmark where models must write Python functions that pass unit tests, measuring practical programming ability.
Custom business evals: A company creates 500 test cases based on real customer questions to measure which AI model gives the best answers for their specific support workflow.
Sources
• OpenAI — Evals Framework
• Anthropic — Evaluating AI Systems
• Stanford HELM — Holistic Evaluation of Language Models
Last reviewed: April 2026
Get Smarter About AI Every Morning
Free daily newsletter — one story, one tool, one tip. Plain English, no jargon.
Free forever. Unsubscribe anytime.