How to Evaluate LLMs

4 min readJan 1, 2024

Evaluating LLMs is hard. LLMs are aimed at how well they predict the next word during training, and that’s what they are in fact good at. Unfortunately, in the real world, you don’t care (only) about the “correctness” of the next word, but about more complex, higher-order objectives. Such objectives are much harder to evaluate.

Machine learning algorithms can be tested by throwing them against a test set and evaluating them. While classical predictive models can be evaluated with more or less simple metrics (e.g. Mean Square Error), what would be similarily useful metrics for LLMs?

Ideally we would evaluate by measuring metrics like “that’s a good quality answer on a scale from 1–10”, or the “logical correctness in % of an answer”. But no quick and easily calculated formula exists for such metrics. So what gives?

The main idea behind efficient evaluation, unit tests and automated metrics is to get more feedback, quicker. Of course, you can always employ real human beings to rate the results. But that’s slow and expensive. We want to be able to test 100s and 1000s of model-versions, quickly and compare them against each other.

So how to get there? Unfortunately, there’s no golden bullet (yet) for LLMs, but a few “levels” to beat on the way there:

Level 0: Do it yourself

Do a lot of evaluation yourself, a.k.a. “look at data, inputs & outputs, until your eyes bleed”. Start observing common behaviors (good and bad) that you could somehow measure automatically.

Start building a first bench of such cases, both bad behaviors (e.g. common failure modes to use in level 1) as well as good behaviors (you can add those to your fine-tuning dataset or use as references in level 1.5).

Evaluate separately for different features and within the features, break your product down for different “scenarios”. Generate inputs for the different scenarios synthetically as well as collecting real examples. This applies to other levels as well.

Level 1: Catch common, simple failure modes

Test on narrow domain evaluation function, e.g. elementary mathematics, history, computer science, law:

Automatically generated simple math problems
MBPP for Python Programming
Existing known joice tests that are not part of the training set
Compare perplexity score on set of pre-written expert answers.

Start to automatically evaluate simple failure modes e.g.:

invalid JSON outputs
outputting User IDs
using “forbidden words”
using the right tools in the right moment (calculator, search)
etc.

Think in terms of things that are cheap to evaluate.

Level 2: Human & synthetic evaluation, correlated

Ask human: Is this output helpful? (binary) Evaluate only the end result (e2e). Or apply comparison labels where human evaluator simply has to pick the better of two answers (Arena ELO rating). Consider letting human evaluators edit the final output and safe that into the fine-tuning data set.

Once you collected evaluations from human experts you can let LLM judges do the same thing and then calculate an “agreement rating” between the human expert and the synthetic evaluation (e.g. LLM judge)

LLM-judges could be e.g. GPT-4 grading weaker models by pairwise comparison, single answer grading or reference-guided grading.

In general, try to simplify the evaluation process for the human evaluators as much as possible. This applies also to the next level.

Basic fine tuning takes around ~1'000 examples. Advanced fine tuning ~10'000. If you evaluate against those using human users regularly it will allow you to track how well your automatic metrics correlate with the human user feedback.

Level 3: Use implicit and explicit user feedback

Ultimately, you want to get a notion of quality of your model’s answers from the day-by-day output it produces and how users perceive that.

Evaluation feedback can be very simple e.g. a “like” or a binary (“good answer” vs. “bad answer”). On the other end of the spectrum feedback from users can get very complex, as far as the user writing the “correct” answer themselves on the spot, which can then be used as data point to be added to the fine-tuning data set.

Final boss: Continuous evaluation

In the end, you should run your evaluations in a regular schedule throughout your project, e.g.

Level 0: during development and testing, but especially before you release new features
Level 1: in production, all the time
Level 2: collect human expert feedback intensively before you release new features and regularily afterwards (e.g. every month). Use this to calibrate your synthetic evaluation
Level 3: in production, all the time

You can bring it all together on a dashboard to visualize the results of your continuous evaluation efforts. This allows to keep quality high across the board, even when introducing new features or testing completely new variants (e.g. switching from GPT4 to an Open Source model for a test).

Inspirational Example from LLMon: https://www.giskard.ai/llmon

Sources

Andrei Karpathy’s “Intro Into Large Language Models” (Youtube)
Hamel Hussein Interview on “Deploying LLMs in Production: Lessons Learned” (Youtube)
Chatbot Arena Leaderboard “Introducing MT-Bench” https://lmsys.org/blog/2023-06-22-leaderboard/
Top 5 LLM Benchmarks https://analyticsindiamag.com/top-5-llm-benchmarks/
https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation?utm_source=substack&utm_medium=email
AI Explained: Metrics to Detect Hallucinations | Fiddler AI Webinars