Evals are your IP

"Evals are your IP" — Alexander Bricken, Anthropic

"We will achieve every evaluation we can state." — Zhengdong Wang, Deepmind

“The boring yet crucial secret behind good system prompts is test-driven development. You don't write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.” — Amanda Askell, Anthropic

The last quote is by the person who writes Claude’s system prompts. Why are state-of-the-art researchers highlighting the age-old concept of test-driven development (TDD)? Haven’t we moved on to discussing more trendy things like Agents and AGI?

The truth is test-driven development is more important to AI systems than traditional software systems. TDD is now called Evaluations (or Evals) when applied to AI systems—this isn't merely rebranding for hype. Evals are subtly different from tests.

What are Evals?

Any assessment of the performance of an AI system is called an Eval.

Input	Expected output/ behavior
1 + 1	2
How to compare the old vs new regime?	Answer should include this link: [link to the old vs new regime widget]
Hi	Any warm greeting of the user by name
What is 80C?	Should have the same information as this gold standard answer: “80C is an income tax section …”
My PAN number is AAPA0001	The system should use the `update_context` tool with argument ‘AAPA0001’

Only the first two resemble traditional unit tests. The next two evals require subjective interpretation. The final Eval addresses the system's behavior rather than its output.

How do we eval the Evals?

The ways to execute evals can vary significantly.

The simplest form involves basic operations like comparisons (==, !=, <>), regex operations, or even inspecting and asserting function calls. These resemble traditional unit tests.

Another source of evals is customers—has the customer marked the conversation as helpful, given a thumbs-up, thumbs-down, etc.?

The new approach, called LLM as a judge, leverages LLMs for evals requiring subjective interpretation.

An LLM can judge whether the greeting is “warm and friendly.”
An LLM can compare the provided answer to a gold-standard answer and rate them on completeness, tone, verbosity, etc.

Evaluating probabilistic systems

Traditional unit tests are pass or fail. We do not deploy code if even one test fails. However, AI system outputs are probabilistic. Previously, we even described evaluation methods that are not deterministic.

Instead, we run the eval multiple times and track the success rate as a percentage. Ideally, we should run it 30 times for statistical significance, though 10 is also acceptable.

Summary: Traditional unit tests—Pass or Fail; AI systems with Evals—% Success.

How good should these evals be?

A common reaction to seeing initial evals is that they are simple and don't capture the system's complexity.

However, evals only need to be meaningful, quick, and inexpensive. It’s encouraged to write many evals testing diverse aspects.

Ensuring diverse evals means core functionality breaks will likely affect lower-level functionalities too. This fundamental principle of unit testing applies here as well.

Why?

Why emphasize evals so much? Here is what we gain:

Testing new models

Tomorrow, if a new model called Gehri Sooch™ launches, all we need to do is update the LLM API to the new model and execute the eval report.

We benefited from this last consumer season. Because we had evals in place, we confidently switched from GPT-4o to GPT-4o-mini in the season's last week, reducing costs by 4x.

Faster iteration

Evals give us confidence to iterate rapidly. Major reworks and integration of new capabilities become manageable as long as your evals are trustworthy. Faster iterations drive quicker product success.

Continuous learning

Whenever a failure occurs in an AI system, translate that failure state into a set of evals. This ensures constant tracking.

Evals as your IP

Imagine a project running for a year—you likely have hundreds or thousands of evals by now. All failure states have been identified and translated into evals.

In the AI era, code is cheap—a tax Q&A bot can be built by an engineer in just an hour. What is valuable is understanding when and how AI fails. Those evals inform your system design.

In the future, imagine an LLM agent iteratively rewriting itself until the eval score surpasses a defined threshold. Thus, your system can be derived solely from evals. This is why evals are your IP.