The METR Task Standard


AI x-risk

At METR, I'm trying to reduce existential risk from artificial intelligence by helping to measure the autonomous capabilities of language model agents. For the past two months, I've helped METR develop a standard for tasks that evaluate these agents: the METR Task Standard. The standard formalizes METR's internal task format, now available for other people and organizations to use. Writing good tasks is expensive and time-consuming. The standard makes it easy to share tasks, avoiding duplicated work tasks and the headache of porting tasks from one organization's evaluations system to another.

Under the Task Standard, tasks are defined using between two and seven Python functions. Based on these functions, the standard specifies a process for constructing a task environment: a container or virtual machine with which a language model agent interacts to solve the task. The standard and its GitHub repo has a bunch of useful features for task authors:

I'm proud of my work on the standard. My colleague Ted Suzman and I are the main contributors so far. We extracted code from METR's internal evaluations platform into a TypeScript Driver interface and implementation. Our platform and the Task Standard workbench share this code. As the standard changes, it'll be easy to keep both codebases in sync.

If you're using the standard, I'm eager to talk to you! Please email me at the address on this website's homepage.