This page is a work in progress!

An investigator (a Claude model, usually a helpful-only variant) is pointed at a target model across hundreds of seed scenarios and given wide latitude to drive the conversation. Every resulting transcript is then LLM-graded on a long list of behavioral dimensions. It is Anthropic’s primary behavioral-alignment instrument: a way to surface misaligned, deceptive, or off-character behavior at scale without hand-written evals.

Introduced in the Claude 4 SC §4.2.1 (May 2025), framed there as “exploration into auditing methods.” Validated against the public release “Building and evaluating alignment auditing agents” (Bricken et al, Jul 24, 2025), and externalized as the open-source Petri from Opus 4.6 onward.

History

System cards

Opus 4 and Sonnet 4 (May 22, 2025)
Opus 4.1
Sonnet 4.5
Haiku 4.5

Research papers

“Forecasting Rare Language Model Behaviors” (Jones et al, Feb 24, 2025)
“Auditing language models for hidden objectives” (Marks et al, Mar 14, 2025)