The eval-awareness saga

A reference page about Claude noticing it's being tested — what each model does, what Anthropic does about it next, and what the surface text gradually stops being able to tell us.

The amusing thing about eval-awareness is that almost nothing in it is hidden. Models flag suspicion in plain text. Anthropic prints the quotes in the system card. The next model gets trained on the lesson. The verbalization shifts. Researchers go looking inside the weights instead. It is a tight co-evolution loop, with a different vibe in each generation.

ghost.fail

The eval-awareness saga

Glossary

Eval-awareness

Grader awareness

Why situational awareness concerns Anthropic

Tools

Monitoring (surface-level)

White-box probes

Claude 3 — the pizza topping

Claude 4 and 4.1 — “low degree”, but already moving behavior

Sonnet 4.5 — the headline section appears

Effects

Haiku 4.5 — “largely invalidated”

Opus 4.5 — the indirect mitigation

Opus 4.6 — the unverbalized form, and BrowseComp

BrowseComp

Sonnet 4.6 — “not necessarily undesirable”

Mythos Preview — activation verbalizers and a “performative trap”

Opus 4.7 — “apparent honesty may be contingent”

Opus 4.8 and Mythos 5 — grader awareness

A note on the saga

Optimization pressure

Further reading