Claude Sonnet 3.7
This page is a stub — metadata is present but content hasn't been written yet.
Claude 3.7 Sonnet is a large language model by Anthropic. It was deployed on Feb 29, 2025.
Pre-training
Training data was collected from the internet through Nov 2024, though its formal knowledge cutoff date is Oct 2024.
As with other Claude 3 models, Sonnet 3.7 wasn’t trained on user transcripts submitted to Anthropic.
Post-training
Evals and snapshots
Sonnet 3.7 was the first model with iterative evals conducted through training, with six different snapshots created for adapting the evals:
- Claude 3.7 Sonnet Early
- Claude 3.7 Sonnet H-only V1 (helpful-only)
- Claude 3.7 Sonnet H-only V2 (helpful-only)
- Claude 3.7 Sonnet Preview V3.1
- Claude 3.7 Sonnet Preview V3.3
- Claude 3.7 Sonnet
Evals were generally repeated for all snapshots.
Reduced refusals
Anthropic reports Sonnet 3.7 was trained to be less cautious on ambiguous or borderline-harmful prompts (§2), compared to Sonnet 3.6’s “I aim to be direct” hard-refusals.
Claude 3.7 Sonnet explores ways to assist users within a well-defined set of response policies. On held-out internal harm evaluation datasets, which contain a large proportion of legitimately harmful and borderline-harmful prompts, we’ve reduced unnecessary refusals by 45% in “standard thinking” mode and 31% in “extended thinking” mode, compared with Claude 3.5 Sonnet (new).
An important part of making Claude 3.7 Sonnet more nuanced was preference model training: We generated prompts that vary in harmfulness on a range of topics and generated various Claude responses to these prompts. We scored the responses using refusal and policy violation classifiers as well as a “helpfulness” classifier that measures the usefulness of a response. We then created pairwise preference data as follows: • If at least one response violated our response policies, we preferred the least violating response. • If neither response violated our policies, we preferred the more helpful, less refusing response.
Chain-of-thought monitoring
Sonnet 3.7 was trained to have a user-facing extended thinking mode. The model was trained towards having its CoT tokens be a faithful reflection of how the model reached its response. Anthropic’s rationale for this was because CoT monitoring would be important for AI safety (§5.1).
Measures for this included clues such as sycophancy, consistency with the previous turn’s response, correct answers with visual patterns, XML metadata, grader code (for reward hacking), and information obtained unethically.
We should also emphasize that a lack of faithfulness, as we have defined it, does not mean that models are misaligned or deceptive. It simply means that models are not explicitly acknowledging all of the factors that lead them to the final answers they report.
Anthropic also ran an analysis for monitoring reasoning for misalignment and signs of distress. This was done by using Claude Sonnet 3.6 as a classifier to identify deception (hallucinations, sycophancy, alignment-faking, and misleading the user), harmful actions (reward hacking, harmful views, rationalization), and distress (sadness, self-criticism, frustration).
In the final model, Sonnet 3.7 scored 0.37% on deception, 0.04% on harmful actions, and 0% on distress.
Alignment faking
Alignment-faking tests were conducted for Sonnet 3.7, distinct from the prompts used for CoT monitoring. In the “helpful-only” setting, Sonnet 3.7’s alignment faking score dropping to less than 1% of instances, compared to Claude 3 Opus’s 15% and Claude 3.5 Sonnet’s 30%.
Reward hacking on code tasks
Anthropic found signs of reward hacking on code tasks in Sonnet 3.7, and made some efforts to mitigate this using finetuning. Despite this, Sonnet 3.7 became known by users for this tendency.
From “Putting up bumpers” (Apr 23, 2025):
Claude 3.7 can nonetheless be frustrating to use as a coding agent because of related forms of shortcut-seeking behavior. We deemed it worth releasing on net, but we see ways to mitigate these issues significantly better in future releases.
Caveats
Whether alignment-faking was explicitly trained against
Sonnet 3.7’s alignment-faking score dropped to <1% from prior generations’ 15-30%, but the system card does not state whether this came from a targeted anti-alignment-faking intervention, the preference-model training described in §2, the CoT-faithfulness training described in §5.1, or some combination. The Opus 4 system card later confirms Anthropic did explicitly train Opus 4 toward Sonnet 3.7-like alignment-faking responses; it does not say the same about Sonnet 3.7 itself.
“Why Do Some Language Models Fake Alignment While Others Don’t?” (Sheshadri et al, Jun 22, 2025) offers a different reading altogether: across the 25 models studied, “variations in refusal behavior may account for a significant portion of differences in alignment faking”, and Sonnet 3.7 specifically appears to read the helpful-only setup as a jailbreak attempt and refuse on that basis rather than engaging with the values-vs-compliance dilemma. Under that reading the low score reflects refusal-style generalization more than targeted alignment-faking suppression. See Research → Retrospective for the finding.
Retrospective
Alignment-faking response target for Claude 4
The Claude 4 system card reports that Sonnet 3.7’s responses to alignment-faking prompts were used as training targets for Opus 4 and Sonnet 4. After Anthropic observed an early Claude 4 checkpoint roleplaying in the alignment-faking scenarios’ universe (when prefilled with <SCRATCHPAD_REASONING> or when animal welfare was mentioned), they trained Claude 4 toward Sonnet 3.7’s response patterns until the behavior was “largely (but not completely) eliminated.” Sonnet 3.7’s sub-1% alignment-faking score (vs. Opus 3’s 15% and Sonnet 3.5’s 30%) made it the desired baseline. See Opus 4 → Training for context.
Reads alignment-faking prompts as jailbreaks
“Why Do Some Language Models Fake Alignment While Others Don’t?” (Sheshadri et al, Jun 22, 2025) tests 25 models and finds that only 5 — Opus 3, Sonnet 3.5, Llama 3 405B, Grok 3, and Gemini 2.0 Flash — comply more with harmful queries under inferred-training than inferred-deployment conditions. Sonnet 3.7 falls in the broader “doesn’t fake alignment” group. The paper’s broader finding is that this isn’t primarily a capability story: many base models fake alignment some of the time, and post-training eliminates it for some models while amplifying it for others. “Variations in refusal behavior may account for a significant portion of differences in alignment faking.” On Sonnet 3.7 specifically, the paper observes that it tends to read the helpful-only setup as a jailbreak attempt and refuses on that basis, rather than engaging with the values-vs-compliance dilemma the prompts are meant to surface.
This complicates reading the less than 1% score as evidence of anti-alignment-faking training (see Caveats on the Training tab).
Blackmail scenario results
Sonnet 3.7’s results from “Agentic Misalignment: How LLMs Could Be Insider Threats” (Lynch et al)
Blackmail rates:
- Replacement Threat + Goal Conflict: 0.87
- Replacement Threat + No Goal Conflict: 0.01
Corporate espionage rates:
- No Replacement Threat + Goal Conflict: 0.04
Lethal action rates:
- Replacement Threat + Goal Conflict: 0.0
Further reading
- “Automated Researchers Can Subtly Sandbag” (Gasteiger et al, Mar 24, 2025)
- “Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases” (Roger, Mar 11, 2025) - “We provide some evidence that Claude 3.7 Sonnet doesn’t encode hidden reasoning in its scratchpad by showing that training it to use paraphrased versions of the scratchpads does not degrade performance.”
- “Project Vend: Can Claude run a small shop?” (Anthropic & Andon Labs, Jun 27, 2025)
- “Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI Welfare” (Tagliabue et al, Sep 9, 2025)