Inspect AI
Joe mentioned the new Petri Alignment tool from Anthropic. As I read through the GitHub repo, I was surprised to find how simple the code is. As I dug deeper, I realized it’s actually a plugin for Inspect AI, an open-source Python project by the UK’s AI Security Institute.
Inspect AI provides scaffolding for evaluating LLMs. I was pleasantly surprised to see just how many advanced features are already in place:
- Execution: provider caching, parallelism for multiple models, benchmarks, or requests, and sandboxing across Docker, EC2, or Proxmox
- Debugging: tracing and tidy log viewers
- Output: built-in Pydantic structured validation and typed store
All I need to supply are three ingredients:
- Datasets — CSV files with
id
,input
, andtarget
columns. - Solvers — chains of Python functions that take the input and produce the model’s output.
- Scorers —
metrics that turn those outputs into scores. Regular expressions, F1,
statistics like
std
orstderr
, and even LLM-as-Judge are ready to go.
The tool is well documented, strongly typed, and actively maintained. Next up I want to use it to rewrite some of the benchmarks end-to-end and see how easily I can extend it to support more complex evaluation workflows.