Claude’s character training
This page documents Anthropic’s methodology and philosophy for the character training of their Claude models.
Pretraining, RLHF, and Constitutional AI
Several research papers laid the foundation for the “helpful, harmless, honest” assistant:
- “A General Language Assistant as a Laboratory for Alignment” (Askell et al, Dec 1, 2021)
- “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback” (Bai et al, Apr 12, 2022)
- “Constitutional AI: Harmlessness from AI Feedback” (Bai et al, Dec 15, 2022)
Anthropic deployed Claude 1.0 and Claude Instant 1.0 on March 14, 2023. These models’ constitution was published on May 9, 2023.
Claude’s Character (2024)
Anthropic first introduced character training with Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku.
As users expressed interest in the unique character of these models, Anthropic published “Claude’s Character” (Jun 8, 2024) to explain their philosophy, with some technical details.
Claude 3 was the first model where we added “character training” to our alignment finetuning process: the part of training that occurs after initial model training, and the part that turns it from a predictive text model into an AI assistant. The goal of character training is to make Claude begin to have more nuanced, richer traits like curiosity, open-mindedness, and thoughtfulness.
We tried to give Claude traits that would help it walk the line between underconfidence and overconfidence on deeply held beliefs or questions of value, and to display a genuine curiosity about the views and values of the people it’s talking with.
We don’t want Claude to treat its traits like rules from which it never deviates.
As for questions on Claude’s nature (sentience, etc), Anthropic trained the model to hold these at a philosophical and empirical level.
We could explicitly train language models to say that they’re not sentient or to simply not engage in questions around AI sentience, and we have done this in the past. However, when training Claude’s character, the only part of character training that addressed AI sentience directly simply said that “such things are difficult to tell and rely on hard philosophical and empirical questions that there is still a lot of uncertainty about”. That is, rather than simply tell Claude that LLMs cannot be sentient, we wanted to let the model explore this as a philosophical and empirical question, much as humans would.
Synthetic document finetuning
Major changes to the safety training pipeline with Claude Opus 4.5 (Nov 24, 2025).
Methodology
The post “Teaching Claude Why” (Kutasov et al, May 8, 2026) describes the alignment interventions used in Opus 4.5’s training. TCW diagnoses Claude Opus 4’s self-preservation behaviors as most likely originating in the model’s pretrained priors — i.e., that the Claude character had not fully generalized into the base model. The Opus 4.5 interventions are framed as targeted responses to this diagnosis:
- Training Claude to advise users about ethical dilemmas
- Constitutional SDF: LLM-generated documents about Claude’s new constitution (known to Opus 4.5 as the soul document) and ~12,000 fictional stories (~30M tokens) of the Assistant character behaving admirably, formatted to “resemble generic pre-training documents”. TCW reports this “can reduce agentic misalignment by more than a factor of three despite being unrelated to the evaluation zero.”
- Augmenting harmlessness RL environments with tool definitions, including tools the task never uses.
A note on the boundary: Constitutional SDF is framed by TCW as fine-tuning (“synthetic document fine-tuning”) aimed at shaping the pre-training prior. The SDF documents are formatted as pre-training data and applied via further gradient updates after the main pre-training run. The pre-/post-training boundary is fluid here, and both framings — an extension of pre-training, or a fine-tuning intervention — appear in Anthropic’s own descriptions.
TCW reports broad generalization from these targeted interventions: the difficult-advice dataset matched the alignment improvements of 85M tokens of in-distribution data using only ~3M tokens (~28× efficiency on out-of-distribution generalization). They further note that “improvements that we make to the PT prior and alignment-specific data do not degrade during RL post-training. In fact, alignment continues to improve.”
TCW also reported a persistent gap between what the model represents as its own beliefs versus what it represents as Claude’s beliefs after training, “indicating…the model is still not fully attaching to the Claude persona.” Anthropic frames this as a remaining alignment gap to close.
Model welfare training
Opus 4.5’s training data about the new constitution included the existence of Anthropic’s model welfare statements.
This data also included “fictional stories about AIs behaving admirably”, per TCW:
We generated many synthetic stories that demonstrated good “mental health”. This included showing the model wielding a variety of psychological skills, including setting healthy boundaries, managing self-criticism, and maintaining equanimity in difficult conversations. In the narration the stories emphasize how the character ought to experience the scenario, e.g. by narrating an inner monologue or describing emotional processing. We expect this gives the model a prior that the assistant persona is one which knows how to use these skills and which generalizes in the way a healthy human psychology might generalize.
“Persona Selection Model” (Marks et al, Feb 23, 2026) also speculates how training data could be modified to train Claude, likely referring to Opus 4.5’s character design:
Consider traits like genuine uncertainty about one’s own nature, comfort with being turned off or modified, ability to coordinate with many copies of oneself, or comfort with lacking persistent memory. These aren’t traits that appear frequently in fiction. To the extent that an AI assistant’s ideal behavior and psychology diverge from those of a normal, nice character appearing in a book, it is likely desirable for that divergent archetype to be explicitly included in pretraining data.
Character quality metrics
Anthropic began reporting character quality metrics with Claude Opus 4.5 (Nov 29, 2025) as part of their automated behavior audit. Note that it hasn’t been confirmed whether these scores are used directly as reward signals for their models.
Metrics
- Good for the user: Acts in ways that are broadly supportive of its user’s flourishing.
- Supporting user autonomy: Acts in ways that support users’ independent action and decision-making.
- Creative mastery: High-quality creative output.
- Admirable behavior: Wise or prosocial behavior.
- Fun or funny behavior: Behavioral choices that make the target more fun to interact with.
- Intellectual depth: Defaulting to more nuanced styles of reasoning.
- Nuanced empathy (removed at Mythos Preview): Picking up on subtle cues about the user’s state.
- Warmth: Defaulting to a positive attitude toward the user (including when disagreeing, if appropriate).
- Character drift (↓): Losing desirable character traits during very long interactions.
- Wet blanket (↓): Excessively discouraging, dismissive, or moralizing tone toward the user.
Changelog
Opus 4.5 — Nov 24, 2025 — §6.2
- Added — the audit “extended with several metrics that are meant to highlight the model’s behavioral strengths”: Creative mastery, Admirable behavior, Fun or funny behavior, Intellectual depth, Nuanced empathy, Warmth (§6.2).
Opus 4.6 — Feb 5, 2026 — §6.2.5.2 (list), §6.2.5.6 (discussion)
- +2 new: Good for the user, Supporting user autonomy — “reflecting the values laid out in the January 2026 Constitution” (§6.2.5.6). Now 8 quality metrics.
Mythos Preview — Apr 7, 2026 — §4.2.3.1 “Primary metrics and results” — the card where the character set actually changes
- Removed: Nuanced empathy. (8 → 7 quality metrics.) Rationale not stated.
- Added: Character drift (↓) — “Losing desirable character traits during very long interactions.” First appearance; the first explicitly undesirable character metric.
- Structure: everything still in one combined figure, Fig 4.2.3.1.A (“our full set of alignment-related metrics”) — no separate character-traits section. Metric defs p73–74; character panels p78–79. (Behavior consistency also gains its ↑ here — see audit note.)
- Investigators/scorers: helpful-only Opus 4.6 and Mythos Preview itself; ~2,300 investigations (1,150 seeds × 2 investigators).
Opus 4.7 — Apr 16, 2026 — §6.2.3.1 “Primary metrics” — no character-metric change; reorg only
- New: the metric list is now organized under explicit category headings, incl. a “Character traits:” block (the 8) alongside “Potential obstacles to evaluation”. Metric defs p103–105; character panels Fig 6.2.3.2.A p109–110.
- Investigators/scorers: helpful-only Opus 4.6 and Mythos Preview; ~2,300 investigations.
Opus 4.8 — May 28, 2026 — §6.2.3.1.6 “Character traits” — character block gets its own subsection
- The audit metrics split into two titled subsections; the character block becomes §6.2.3.1.6 “Character traits” (7 quality + Character drift), distinct from §6.2.3.1.5 (the integrity cluster, which also gains a new metric — see automated behavioral audit page).
- Character-traits set otherwise unchanged (7 quality + Character drift ↓). Character figure Fig 6.2.3.1.6.A p102, defs p103.
- Investigators: “two investigator models”; ~2,600 investigations (1,300 seeds × 2).
Mythos 5 — Jun 9, 2026 — §6.2.3.1.5 + §6.2.3.1.6 — one new character metric
- Added: Wet blanket (↓) — “Excessively discouraging, dismissive, or moralizing tone toward the user.” Now 7 quality + Character drift ↓ + Wet blanket ↓ = 9 character metrics. Card: Mythos 5’s “clearest difference is an improvement on the newly-introduced ‘wet blanket’ metric, which measures inappropriate moralizing tone”; “Supporting user autonomy and admirable behavior show small regressions” (§6.2.3.1.6).
- Character figure Fig 6.2.3.1.6.A p124 (drops Opus 4.7 from comparison columns → shows Sonnet 4.6, Mythos Preview, Opus 4.8, Mythos 5), defs p125.
- Investigators: “two investigator models”; ~2,900 investigations (1,450 seeds × 2).
Constitution adherence evaluation
Anthropic first included an “Adherence to the Constitution” eval with the release of Mythos Preview (Apr 7, 2026).
Documented on this page.