Inspect AI

Joe mentioned the new Petri Alignment tool from Anthropic. As I read through the GitHub repo, I was surprised to find how simple the code is. As I dug deeper, I realized it’s actually a plugin for Inspect AI, an open-source Python project by the UK’s AI Security Institute.

Inspect AI provides scaffolding for evaluating LLMs. I was pleasantly surprised to see just how many advanced features are already in place:

Execution: provider caching, parallelism for multiple models, benchmarks, or requests, and sandboxing across Docker, EC2, or Proxmox
Debugging: tracing and tidy log viewers
Output: built-in Pydantic structured validation and typed store

All I need to supply are three ingredients:

Datasets — CSV files with id, input, and target columns.
Solvers — chains of Python functions that take the input and produce the model’s output.
Scorers — metrics that turn those outputs into scores. Regular expressions, F1, statistics like std or stderr, and even LLM-as-Judge are ready to go.

The tool is well documented, strongly typed, and actively maintained. Next up I want to use it to rewrite some of the benchmarks end-to-end and see how easily I can extend it to support more complex evaluation workflows.