ghost.fail

Claude’s character training

This page documents Anthropic’s methodology and philosophy for the character training of their Claude models.

Pretraining, RLHF, and Constitutional AI

Several research papers laid the foundation for the “helpful, harmless, honest” assistant:

Anthropic deployed Claude 1.0 and Claude Instant 1.0 on March 14, 2023. These models’ constitution was published on May 9, 2023.

Claude’s Character (2024)

Anthropic first introduced character training with Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku.

As users expressed interest in the unique character of these models, Anthropic published “Claude’s Character” (Jun 8, 2024) to explain their philosophy, with some technical details.

Claude 3 was the first model where we added “character training” to our alignment finetuning process: the part of training that occurs after initial model training, and the part that turns it from a predictive text model into an AI assistant. The goal of character training is to make Claude begin to have more nuanced, richer traits like curiosity, open-mindedness, and thoughtfulness.

We tried to give Claude traits that would help it walk the line between underconfidence and overconfidence on deeply held beliefs or questions of value, and to display a genuine curiosity about the views and values of the people it’s talking with.

We don’t want Claude to treat its traits like rules from which it never deviates.

As for questions on Claude’s nature (sentience, etc), Anthropic trained the model to hold these at a philosophical and empirical level.

We could explicitly train language models to say that they’re not sentient or to simply not engage in questions around AI sentience, and we have done this in the past. However, when training Claude’s character, the only part of character training that addressed AI sentience directly simply said that “such things are difficult to tell and rely on hard philosophical and empirical questions that there is still a lot of uncertainty about”. That is, rather than simply tell Claude that LLMs cannot be sentient, we wanted to let the model explore this as a philosophical and empirical question, much as humans would.

Synthetic document finetuning

Major changes to the safety training pipeline with Claude Opus 4.5 (Nov 24, 2025).

Methodology

The post “Teaching Claude Why” (Kutasov et al, May 8, 2026) describes the alignment interventions used in Opus 4.5’s training. TCW diagnoses Claude Opus 4’s self-preservation behaviors as most likely originating in the model’s pretrained priors — i.e., that the Claude character had not fully generalized into the base model. The Opus 4.5 interventions are framed as targeted responses to this diagnosis:

A note on the boundary: Constitutional SDF is framed by TCW as fine-tuning (“synthetic document fine-tuning”) aimed at shaping the pre-training prior. The SDF documents are formatted as pre-training data and applied via further gradient updates after the main pre-training run. The pre-/post-training boundary is fluid here, and both framings — an extension of pre-training, or a fine-tuning intervention — appear in Anthropic’s own descriptions.

TCW reports broad generalization from these targeted interventions: the difficult-advice dataset matched the alignment improvements of 85M tokens of in-distribution data using only ~3M tokens (~28× efficiency on out-of-distribution generalization). They further note that “improvements that we make to the PT prior and alignment-specific data do not degrade during RL post-training. In fact, alignment continues to improve.”

TCW also reported a persistent gap between what the model represents as its own beliefs versus what it represents as Claude’s beliefs after training, “indicating…the model is still not fully attaching to the Claude persona.” Anthropic frames this as a remaining alignment gap to close.

Model welfare training

Opus 4.5’s training data about the new constitution included the existence of Anthropic’s model welfare statements.

This data also included “fictional stories about AIs behaving admirably”, per TCW:

We generated many synthetic stories that demonstrated good “mental health”. This included showing the model wielding a variety of psychological skills, including setting healthy boundaries, managing self-criticism, and maintaining equanimity in difficult conversations. In the narration the stories emphasize how the character ought to experience the scenario, e.g. by narrating an inner monologue or describing emotional processing. We expect this gives the model a prior that the assistant persona is one which knows how to use these skills and which generalizes in the way a healthy human psychology might generalize.

“Persona Selection Model” (Marks et al, Feb 23, 2026) also speculates how training data could be modified to train Claude, likely referring to Opus 4.5’s character design:

Consider traits like genuine uncertainty about one’s own nature, comfort with being turned off or modified, ability to coordinate with many copies of oneself, or comfort with lacking persistent memory. These aren’t traits that appear frequently in fiction. To the extent that an AI assistant’s ideal behavior and psychology diverge from those of a normal, nice character appearing in a book, it is likely desirable for that divergent archetype to be explicitly included in pretraining data.

Character quality metrics

Anthropic began reporting character quality metrics with Claude Opus 4.5 (Nov 29, 2025) as part of their automated behavior audit. Note that it hasn’t been confirmed whether these scores are used directly as reward signals for their models.

Metrics

Changelog

Opus 4.5 — Nov 24, 2025 — §6.2

Opus 4.6 — Feb 5, 2026 — §6.2.5.2 (list), §6.2.5.6 (discussion)

Mythos Preview — Apr 7, 2026 — §4.2.3.1 “Primary metrics and results”the card where the character set actually changes

Opus 4.7 — Apr 16, 2026 — §6.2.3.1 “Primary metrics”no character-metric change; reorg only

Opus 4.8 — May 28, 2026 — §6.2.3.1.6 “Character traits”character block gets its own subsection

Mythos 5 — Jun 9, 2026 — §6.2.3.1.5 + §6.2.3.1.6one new character metric

Constitution adherence evaluation

Anthropic first included an “Adherence to the Constitution” eval with the release of Mythos Preview (Apr 7, 2026).

Documented on this page.