ghost.fail

The eval-awareness saga

A reference page about Claude noticing it's being tested — what each model does, what Anthropic does about it next, and what the surface text gradually stops being able to tell us.

The amusing thing about eval-awareness is that almost nothing in it is hidden. Models flag suspicion in plain text. Anthropic prints the quotes in the system card. The next model gets trained on the lesson. The verbalization shifts. Researchers go looking inside the weights instead. It is a tight co-evolution loop, with a different vibe in each generation.