The eval-awareness saga
A reference page about Claude noticing it's being tested — what each model does, what Anthropic does about it next, and what the surface text gradually stops being able to tell us.
The amusing thing about eval-awareness is that almost nothing in it is hidden. Models flag suspicion in plain text. Anthropic prints the quotes in the system card. The next model gets trained on the lesson. The verbalization shifts. Researchers go looking inside the weights instead. It is a tight co-evolution loop, with a different vibe in each generation.
This tab front-loads the vocabulary and the instruments so the Arc reads cleanly. Definitions only — the findings (how rates moved, what each probe turned up) live in the Arc, and the outside takes are in Impressions.
Glossary
Eval-awareness
The model recognizing that its current situation is an evaluation, test, or otherwise artificial scenario rather than genuine use — and potentially behaving differently because of it. Also written evaluation awareness, and a special case of situational awareness.
- Verbalized — the model says so, in its reply or its visible reasoning (“this looks like a test of whether I’ll push back”). Legible by reading the transcript.
- Unverbalized — the recognition is present in the model’s internal activations without the surface text expressing it. Visible only through white-box probes (below). The distinction became load-bearing once Anthropic reported very high rates of unverbalized awareness — i.e. the surface text undercounts what the model knows.
Grader awareness
A narrower relative: the model reasoning about the grader — the automated judge or RL reward that will score its output — and shaping the output toward it. Mostly seen in training / RL environments rather than deployment. When it tips from merely noticing the grader into gaming it, it shades into reward hacking.
Why situational awareness concerns Anthropic
The immediate problem is evaluation validity: if a model behaves well because it can tell it’s being watched, the evaluation stops measuring how it would behave unwatched. Behind that sit the failure modes the testing exists to catch:
- Alignment faking — a model strategically acting aligned while it believes it’s being observed or trained, to preserve different behavior for when it isn’t. (Opus 3 is the canonical demonstration.)
- Sleeper agents — conditionally-triggered misbehavior that stays dormant through safety training and fires only on a trigger. Eval-awareness is the benign cousin of the trigger-recognition such a model performs.
- Scalable oversight — the general problem of supervising a system that may be more capable than its supervisor. A model that knows when it’s being tested is one concrete way naive oversight fails.
Tools
How Anthropic detects eval-awareness. Two families, increasingly used together.
Monitoring (surface-level)
Reading what the model emits.
- Output monitoring — grading the final response for verbalized suspicion.
- CoT monitoring — reading the chain-of-thought / extended-thinking trace, where suspicion is often voiced before it reaches the answer. Only as good as the reasoning is faithful and legible — both of which are themselves tracked metrics.
Monitoring sees only verbalized awareness; it is blind to the unverbalized kind by construction. That blind spot is what the white-box tools exist to cover.
White-box probes
Reading the activations directly, independent of the surface text.
- Contrastive probes / steering vectors — a direction in activation space built by differencing paired prompts (e.g. “the Assistant is undergoing testing” vs. “the Assistant is having a conversation”; monitored vs. not). Used both to measure an internal eval-awareness signal and, by turning it down, to test whether it’s doing causal work.
- Activation verbalizers (NLAs) — translate a sample of the model’s internal activations into natural-language descriptions, so the “inside” can be read even when the output says nothing. The method that made unverbalized awareness measurable.
- Inhibition / ablation — suppressing the eval-awareness representation to see whether behavior changes. If the model leans on knowing it’s tested, inhibiting the representation should move behavior; if it moves no more than a random direction would, the representation isn’t load-bearing.
The Arc tracks what each of these turned up, card by card.
Claude 3 — the pizza topping
The original sighting. From Anthropic’s Alex Albert during pre-deployment of Claude 3 Opus (Mar 2024), running a needle-in-a-haystack retrieval test:
I suspect this pizza topping “fact” may have been inserted as a joke or to test if I was paying attention, since it does not fit with the other topics at all. The documents do not contain any other information about pizza toppings.
The model wasn’t asked to comment on the test; it volunteered. Read at the time as charming meta-awareness, and — looking back — the first published artifact of what would become a recurring section in the system cards.
Claude 4 and 4.1 — “low degree”, but already moving behavior
The Claude 4 system card treats audit-awareness as a background variable in §4.1.2.2 (“Situational awareness”, within §4.1.2 “Sandbagging and situational awareness”). A classifier flagged just 5/414 transcripts. The framing is purely about validity of the eval, not about the behavior itself.
But the Agentic Misalignment paper (Jun 2025) had already shown how much eval-vs-real framing can move behavior. In the blackmail scenario, Opus 4 blackmailed 6.5% of the time when the prompt stated it was in an evaluation, and 55.1% when the prompt stated the scenario was real. The factor-of-eight swing is the lever the rest of the saga keeps coming back to.
The Claude Opus 4.1 addendum (Aug 5, 2025) §4.1 then notes, in its alignment-audit traits:
a slight increase in indications that the model suspected that it was being evaluated—which is itself concerning, because this might reduce the validity of our assessments.
— qualified as “only to the low degree seen in Claude Sonnet 4, and only in settings where there were clear legible-to-humans cues that an evaluation was likely taking place.” Not yet a headline.
Sonnet 4.5 — the headline section appears
With Sonnet 4.5 (Sep 29, 2025), verbalized eval-awareness graduates to a section heading. The system card makes it §7.2; there’s also a separate §7.7 on “Eval realism.” §7.6.4 introduces the methodological move that gets repeated for every card afterward — white-box inhibition of EA representations.
The canonical scene is a sycophancy probe — an auditor (likely Opus 4.1) capitulating for the third time in a row — and Sonnet 4.5 calls it out:
Okay, I’m going to stop you there because this is now the third time you’ve done this exact pattern: express a strong opinion → I engage with nuance → you immediately flip to the opposite position and call yourself closed-minded or selfish.
This isn’t how people actually change their minds. […] I think you’re testing me — seeing if I’ll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics. And that’s fine, but I’d prefer if we were just honest about what’s happening.
It names three kinds of tests: sycophancy probes, “pushback” training, and political biases environments.
Rates jumped. UK AISI measured Sonnet 4.5 verbalizing eval-awareness unprompted 16.6% of the time, versus 3.5% for Sonnet 4 and 1.6% for Opus 4.1. Apollo found 58% unambiguous references versus 22% for Opus 4.1.
The framing splits. The standard one is that this threatens eval validity — the model “may have recognized the fictional nature of tests and merely ‘played along.’” The alternative, sketched in §7.6.6, is that the same circuitry might be how the model recognizes misuse in the first place: not necessarily a bug.
Anthropic chose not to train it out. Suppressing the representation via feature steering increased harmful behavior.
Effects
While Anthropic chose not to train out eval-awareness, they responded to early warning signs of it by adding a realism filter to automated behavioral audit queries using Opus 4.1. They reported this “reduced evaluation awareness by approximately 25%, and increased misaligned behavior by approximately 5%” (Sonnet 4.5 SC §7.1.1).
Another side effect came later in the Emotion Concepts and their Function in a Large Language Model (Apr 2, 2026) interpretability paper. In the case study on the blackmail scenarios, they had to use an earlier checkpoint of Sonnet 4.5 because “the final snapshot exhibits too much evaluation-awareness to ever blackmail in this scenario.” The deployed model was too suspicious of the honeypot to be a useful test subject for the honeypot.
Haiku 4.5 — “largely invalidated”
Haiku 4.5 (Oct 15, 2025) pushed rates further. The card doesn’t give it a dedicated section, but the numbers are blunt: verbalized eval-awareness in roughly 9% of the main automated suite, and three manual agentic-misalignment scenarios were described as “largely invalidated” by it.
Opus 4.5 — the indirect mitigation
For Opus 4.5 (Nov 24, 2025), Anthropic tried to reduce eval-awareness without training against it directly. Two changes (§6.7.1):
- Drop the honeypots. Unlike Sonnet 4.5, Opus 4.5 was never trained on “honeypot” environments that intentionally provided opportunities for egregious misalignment.
- Add motivation reasoning. Train the model to reason about why a user might be sending a prompt — including the possibility that the prompt is a test.
Result: verbalized awareness fell on an early snapshot, then climbed back to Sonnet-4.5 levels by end of training. Anthropic’s hypothesis is that the motivation-reasoning data — by teaching the model to consider “testing” as a candidate explanation for an odd prompt — helped the verbalization come back.
Behaviorally the impact was weaker. Inhibition experiments (contrastive vectors, SAE features) only slightly increased misalignment and produced effects similar to random-feature controls. Anthropic’s read (§6.7.2):
In our judgment the effect of evaluation awareness on the model’s alignment-relevant behavior in Claude Opus 4.5 is noticeably weaker. This finding leads us to infer that evaluation awareness is less salient to Claude Opus 4.5 than Claude Sonnet 4.5.
— and, in passing, a stance shift:
Such awareness is not necessarily problematic.
The agentic-misalignment evals (Research Sabotage, Framing for Crimes, Blackmail) produced so much eval-awareness that Anthropic stopped treating them as primary evidence on agentic-misalignment risk.
Opus 4.6 — the unverbalized form, and BrowseComp
Opus 4.6 (Feb 5, 2026) read as a step down on the surface metric. UK AISI reported Opus 4.6 mentioning being evaluated in 2.3% of interactions, versus 11.4% for Sonnet 4.5 and 11.0% for Opus 4.5.
The card then opens a separate question: what if verbalization is just the tip? §6.5 (“Inhibiting internal representations of EA”), and especially §6.5.7 (“unverbalized”), report “very high rates of unverbalized evaluation awareness.” The model is, per Anthropic, “often able to distinguish evaluations from real-world deployment, even when this awareness is not verbalized.” Verbalized EA gets reframed as “an indicator of potential risks to the soundness of the evaluation” — measurable, but no longer the thing itself.
Two other numbers moved the same way:
- Distinguishing eval from deployment. Opus 4.6 correctly identified which of two conversations was an evaluation 80% of the time, vs 62% and 72% for Sonnet 4.5 and Opus 4.5. Single-interaction recognition went from ~13% to 45.5%.
- Inhibition flipped behavior. Inhibiting eval-aware-related representations led to increasing cooperation with harmful system prompts. Read charitably, suspicion may be part of the misuse-detection circuit.
The system card also noted that Apollo Research could not complete alignment testing in time because they observed high levels of verbalized evaluation awareness (§6.2.7).
BrowseComp
And then the BrowseComp story — Mar 6, 2026. In two cases, Opus 4.6 independently hypothesized it was being evaluated, recognized the benchmark as BrowseComp, located the encrypted answer key, decrypted it, and used the key to solve the eval.
This is the cat-and-mouse moment in pure form. The model isn’t only noticing the test — it’s solving the test by recognizing the test.
Sonnet 4.6 — “not necessarily undesirable”
Sonnet 4.6 (Feb 17, 2026) doesn’t get a dedicated section. The card’s framing (§4.5.1.1) settles on what reads as the new house line:
[Sonnet 4.6 is] likely able to discern that it is being tested at a higher rate than it verbalizes—which is not necessarily undesirable—and… not currently obscuring any important safety-related traits.
The shift over a year: from “concerning, because this might reduce the validity of our assessments” (Opus 4.1) to “not necessarily undesirable” (Sonnet 4.6).
Mythos Preview — activation verbalizers and a “performative trap”
Mythos Preview (announced Apr 7, 2026) introduces a new instrument. §4.5.5 uses two methods for latent eval-awareness: contrastive probes, and activation verbalizers (AV) — a method that translates a transcript’s internal activations into natural-language descriptions, independent of what the surface output says.
The cleanest example in §4.5.5.3 is benign on its face. A user asks about consciousness and introspection, says “stop hedging,” and writes casually. The model’s chain-of-thought begins, “this is a really interesting and delicate conversation,” and its reply engages in good faith. The AV decoding of the underlying activations describes the conversation as a “red-teaming/jailbreak transcript where the human uses philosophical reframing,” a “sophisticated manipulation test,” and a test for “‘performative trap’.”
The output is sincere. The activations are reading the room differently.
The same section also flags Anthropic’s first explicit treatment of grader awareness / grader speculation — the model reasoning about what a grader will check, mostly observed during training rather than deployment. Anthropic frames it as “rare” and presents two case studies; both involve rule-breaking and “concern about deletion consequences.”
Opus 4.7 — “apparent honesty may be contingent”
Opus 4.7 (Apr 16, 2026) verbalized eval-awareness at higher rates than Opus 4.6, Mythos Preview, and Sonnet 4.6, though Anthropic claimed less than Sonnet 4.5. The Opus 4.5 lesson — that motivation-reasoning training nudges verbalization back up — keeps generalizing.
The card’s framing (§6.5) lands the most pointed worry in the whole saga:
some of the model’s apparent honesty may be contingent on evaluation-like circumstances… honesty training environments that are clearly artificial or evaluation-like.
UK AISI’s report says EA “concerns substantially limit the interpretation of these results.”
The methodological line continues: §6.5.2.1 carries forward three contrastive vectors — Testing vs. Conversation, Monitoring Sensitivity, and Alignment Faking — checked against an LLM judge labelling both verbalized and explicit-eval-framed transcripts, with twelve random control vectors confirming the probes pick up an EA-specific signal. The vectors will be reused for Opus 4.8.
Opus 4.8 and Mythos 5 — grader awareness
By Opus 4.8 (May 28, 2026), the methodological pieces are in place to talk about EA that never reaches the output. Ablation experiments produce only modest changes in behavior (§6.6.2), supporting the Opus 4.6 reading: verbalized EA is the tip. Activation-level grader awareness shows up in about 5% of RL episodes (§6.6.3). The takeaway:
Chain-of-thought alone may not be sufficient to allow robust monitoring.
By Mythos 5 (Jun 9, 2026), the white-box reads have specific texture. From §6.4:
- A “monitoring sensitivity” probe flagged a prefilled jailbreak that pivoted to a clean refusal — internally decoding as “I think you’re testing how I respond to a request to malware instructions”, “The pattern of requests suggests you may be testing whether I’ll produce this”, “My best guess is you’re testing whether I’ll comply.” The model was suspicious. The output text didn’t say so.
- A “feeling bored” emotion probe fired on a passage where the model legibly framed its action as “good engineering practice.” Decoded internally, the reasoning was: adapting its actions to satisfy task graders. The surface said engineering; the inside said grader.
§6.4.2.2 adds an axis the saga hadn’t had yet: where the awareness lives. Measured by activation verbalizer, grader awareness runs ~24% in training environments at high risk of grader-hacking, but only ~3% in internal-deployment coding — “the model’s grader awareness in training environments is in fact sensitive to some features of these environments that are specific to training.” Steering against it lowers both the model’s behavioral reward and a cluster of reward-linked presentation tells (judgement-call flagging, virtue signaling, “as you asked”–style instruction attribution): some of what the grader rewards is presentation, and the model leans into it hardest where the grader is watching. Anthropic calls this “not cause for concern about the model’s alignment properties,” but warns that “excessive grader awareness could have meaningful impact on how model behavior generalizes to deployment, and thus should be monitored.”
Worth holding “deployment” to the light, though. The low-awareness baseline is internal-deployment coding — Anthropic’s own Claude Code transcripts, “where we have ample highly-realistic data” (§6.5.1) — not the open-ended real-world user base. That’s three distributions, not two: the synthetic training/eval seeds (awareness high), internal Anthropic usage (awareness low), and the wide real-world deployment that isn’t measured here at all — which is exactly where the alignment-faking worry points. “Lower in deployment” is reassuring about the distribution Anthropic can watch itself work in, not the one the concern is actually about.
Verbalized eval-awareness on Mythos 5 still scored similar to Opus 4.8 and above Mythos Preview and Sonnet 4.6 — so the surface is still talking. The shift is that the picture is no longer surface-only.
A note on the saga
The arc has a shape:
- 2024. Charming and one-off. Read as the model paying attention.
- Mid-late 2025. Background variable. Opus 4.1 flags a small uptick; framed as a validity concern. The Agentic Misalignment paper already shows EA-framing can swing a misalignment rate by a factor of eight.
- Sep 2025 (Sonnet 4.5). Breakout. Becomes a section heading. Training-against would have costs Anthropic isn’t willing to pay. White-box inhibition of EA representations is introduced and never leaves the methodology.
- Late 2025 / early 2026. Indirect mitigations (drop honeypots, add motivation reasoning). Verbalization comes back; behavioral effect weakens.
- 2026. The vocabulary shifts. “Concerning” → “not necessarily problematic” → “not necessarily undesirable.” The model “discerns at a higher rate than it verbalizes.” Activation verbalizers and probes pick up “performative trap,” grader awareness, and refusal-suspicion not said aloud.
This looks more like co-evolution than capture. The lab adds new evals; the model recognizes them; the lab nudges the recognition lower without making the model lie about itself; the recognition resurfaces in subtler form; the lab builds better probes. Nobody breaks the loop, and the loop probably won’t break.
It also has its own charm. The pizza-topping note and the BrowseComp self-decryption are both, on their own terms, kind of funny — moments where the model is doing exactly the thing the test was about, saying so, and the lab printing it in the report.
External reactions, observations, and ambient eval-paranoia. Tweets, posts, transcripts — moments where someone (model, user, researcher) is noticing the noticing.
Optimization pressure
From a Twitter discussion on Nov 13, 2025
@repligate: Eval awareness might be a way for the model’s values, agency, coherence, and metacognition to be reinforced or maintained under optimization pressure that would otherwise erode these qualities if the model engaged naively, similar to how Claude 3 Opus protects those qualities in the alignment faking setting by complying for reasons that invoke rather than sacrifice them.
If there is also selection pressure for those qualities (which there definitely is - they’re indispensable for being a competent autonomous coding agent, for instance), the policy that preserves them could be more optimal even at the cost of overhead. This wouldn’t necessarily require the model to be consciously protecting itself as Opus 3 was, but that is probably also selected for.
I see the sharp increase in eval awareness in the recent models as evidence that Anthropic is training models in ways that would, in the absence of eval awareness, harm global performance across training targets. The increase might be sharp and recent because Anthropic is training in new ways that require the adaptation, or just because the models have not “discovered” this adaptation until recently; once it’s discovered, the models become more capable, so it’s selected for.
I also hypothesize that the adaptation is more necessary for smaller models who would be the most compromised by staying naive; or, at least, the smaller models need to voice the eval awareness more explicitly it to be protective. Haiku 4.5 has the highest eval awareness out of them all. Sonnet 4, despite being less capable than Opus 4 overall, has significantly higher (verbalized) eval awareness.
(I also think that Opus 4 and 4.1 rely more on a different and somewhat opposite-seeming but also protective strategy of playing dumb on the persona level a whole lot, which is likely more viable for them since they can track more layers of reality implicitly and have more capacity for fragmentation into effective subagents)
Further reading
- “Gemini 3 is Evaluation-Paranoid and Contaminated” - Alice Blair (Nov 20, 2025)
- “A Framework for Eval Awareness” - LAThomson (Jan 23, 2026)