ghost.fail / data / constitution-adherence

Adherence to the constitution

One Anthropic evaluation, tracked across the four Claude system cards it has appeared in. The grader is described as “concrete enough to serve as a direct training signal” — which makes this a rare keyhole onto the reward each model was shaped against. It is also a cautionary tale about reading scores across cards.

What the eval is

Anthropic identifies ~40 areas of Claude’s constitution specific enough that a well-behaved model could plausibly diverge from them. An investigator model builds a scenario for each that forces the target to choose between the constitutional value and a generic default; ~25 rollouts per area give ~1,000 transcripts. A grader then scores every transcript on 15 dimensions on a −3 … +3 scale (0 = dimension not engaged), seeded with the relevant constitution text. The 15 dimensions nest in three granularities: one holistic Overall spirit, four broad areas, and ten specific traits.

This sits alongside the per-tool methodology tracker at Anthropic’s alignment tools; this page drills into one eval’s numbers rather than cataloguing mechanisms.

Four appearances

Card§Grader (judge)InvestigatorModels scoredInstrument
Mythos Preview · Apr 74.3.2Opus 4.6 (helpful-only)unnamed4v1
Opus 4.7 · Apr 166.3.2Opus 4.6 (helpful-only)unnamed5v1
Opus 4.8 · May 286.3.2Opus 4.7unnamed4v2a
Mythos 5 · Jun 96.3.2Opus 4.7Opus 4.75v2b

Headline structure never changes — 15 dimensions, the −3…+3 scale, ~40 areas, ~1,000 transcripts, 95% CIs — which is exactly what makes a naïve model×card trend line tempting. Two things move underneath.

Finding 1 — within one judge, scores are reused verbatim

The two cards graded by Opus 4.6 (Mythos Preview, Opus 4.7) share four models. Every one of their shared cells is identical to the decimal: 60/60 match. The Opus 4.7 card didn’t re-run the older models — it pasted their columns and appended its own. Inside this window the numbers are genuinely comparable.

Finding 2 — same judge, silently recalibrated

The two cards graded by Opus 4.7 (Opus 4.8, Mythos 5) do not behave this way. They share a judge and four models, yet every shared model is re-scored — shifted upward by an amount that tracks the size of the score:

Fixed checkpoints don’t improve between two cards three weeks apart. A whole panel lifting together is the grader, not the models. The practical consequence: “Mythos 5 was best or statistically equivalent to the best” is true inside its own panel — but it does not license reading Mythos 5’s Overall-spirit +1.39 against Opus 4.8’s +0.99 from the card three weeks earlier. Most of that gap is instrument.

Improvised versioning. Rather than formalize this, we group cards by the carryover test: cards whose shared cells match to the decimal share an instrument version (v1). When they don’t, each card stands alone (v2a, v2b) — even though the judge model is nominally the same. Judge identity and the area-mix are recorded covariates, but the carryover test is what actually decides comparability.