Joe mentioned the new Petri Alignment tool from Anthropic. As I read through the GitHub repo, I was surprised to find how simple the code is. As I dug deeper, I realized it’s actually a plugin for Inspect AI, an open-source Python project by the UK’s AI Security Institute.

Inspect AI provides scaffolding for evaluating LLMs. I was pleasantly surprised to see just how many advanced features are already in place:

All I need to supply are three ingredients:

  1. Datasets — CSV files with id, input, and target columns.
  2. Solvers — chains of Python functions that take the input and produce the model’s output.
  3. Scorers — metrics that turn those outputs into scores. Regular expressions, F1, statistics like std or stderr, and even LLM-as-Judge are ready to go.

The tool is well documented, strongly typed, and actively maintained. Next up I want to use it to rewrite some of the benchmarks end-to-end and see how easily I can extend it to support more complex evaluation workflows.