Claude Code EAP Report

Opus 4.5 → 4.6: What Changed in Practice?

A single-user behavioral study across 2,837 Claude Code tasks

Samuel H. Christie V · February 2026 · Claude Code Early Access Program

Executive Summary

Per-Task Cost

$2.44 vs $2.56

4.6 saves 13–37% at trivial-moderate; +35% at major (§2)

Output Token Ratio

2.5×

4.6: 2,293 avg vs 4.5: 933 avg

Tool Calls / Task

+44%

Per-task mean 12.9 vs 8.9 (incl. subagents: 13.3 vs 9.6) (§7)

Bonferroni Survivors

21 / 529

21 overall; 141 including per-complexity and cross-cut strata

Before you read: Most numbers, tables, and statistical results in this report are computed from analysis data by deterministic Python scripts and bound to the prose via template expressions—a tool used to reduce transcription errors and keep claims grounded in the data. LLMs assisted with prose drafting and with converting literal numbers to data-bound expressions. All data comes from a single user’s workflow over a limited time period—treat all findings as anecdotal observations, not generalizable conclusions.

What Changes When You Switch

Across 529 statistical tests (overall, per-complexity, and cross-cut strata), 141 survive Bonferroni correction—21 at the overall level. Most describe how the model works, not whether it succeeds. The overall success rates are comparable; what changes is the experience of working alongside it. These are the five most noticeable differences, drawn from one user’s workflow over 2,837 tasks.

1. The model plans before it acts—and you stop steering it

Opus 4.6 uses formal planning mode on 12.3% of tasks vs 1.8% for 4.5, rising to 43% at complex and 65% at major difficulty. It front-loads codebase investigation with a 2.3× longer explore phase, deploying subagents that are 69% read-only researchers (vs 49% for 4.5). The practical effect is a shift from interactive collaboration to delegation: you issue a prompt and return to find completed work rather than course-correcting mid-task. This shows up in the data as fewer user-directed corrections across all complexity levels (§4, §8).

2. It thinks less often, but more carefully

The largest overall effect across all 529 tests is thinking fraction (d=0.64, medium, §3). Opus 4.5 activates extended thinking on 75% of requests regardless of difficulty; 4.6 activates on 59% but averages 4,067 characters when it does (vs 2,578). On trivial tasks, 4.6 often skips thinking entirely. On complex tasks, it thinks deeply. This calibration means compute is allocated where it matters rather than spread uniformly across every interaction.

3. Fewer rewrites, better first-attempt accuracy

Opus 4.6 rewrites its own edits 11.6% of the time vs 18.2%—a 36% reduction. Its self-correction rate is actually higher (3.5% vs 1.8%), meaning it catches its own mistakes rather than having the user point them out. Failure rates drop from 12.0% to 5.4%, and alignment scores improve significantly (p=0.000714, one of 21 overall Bonferroni survivors). The “plan first” approach appears to pay off in execution accuracy (§5, §6).

4. Sessions get longer—you trust it with more

Median task duration rises 46% (62s vs 42s), with fewer ultra-short interactions (34% of tasks under 30 seconds vs 42%). The task mix shifts toward moderate-to-complex work issued in a single instruction, and 4.6 runs more tasks in the background for parallel execution. This isn’t purely a model capability difference—it’s a workflow adaptation. When the model handles larger tasks reliably, the user gives it larger tasks, waits longer, and intervenes less. The 7× increase in planning mode (§4) and 44% more tool calls per task (§7) are partly a consequence of this delegation shift.

5. Cost stays flat where it counts

Despite 2.5× more output tokens and 44% more tool calls per task, 4.6 is 13–37% cheaper at trivial through moderate complexity—the bulk of daily work. The reason is counterintuitive: output tokens account for just 6.7% of per-task cost, while cache operations account for 93–97%. Opus 4.6 writes 29% less to cache (the most expensive token category at $18.75/MTok), more than offsetting its higher output and cache reads. Cost only tips higher at 30+ API requests, where cumulative cache reads compound past the write savings. Overall per-task cost is $2.56 vs $2.44: functionally neutral for a meaningfully different style of work (§2).

The data is consistent with a tentative characterization: Opus 4.5 acts first and adjusts, while Opus 4.6 investigates first and implements in concentrated bursts. The confounded study design means this framing is a hypothesis, not a conclusion. The analysis was iterative—several initial findings were revised or reversed when more direct signals became available (§10).

How This Report Works

This report is itself a Claude Code project. The analysis pipeline, statistical tests, table generation, and report assembly are all automated Python scripts, most written with substantial assistance from Opus 4.6—the same model being evaluated. LLMs are used in two places: task classification (Haiku annotates complexity, sentiment, and task type) and prose drafting. Most quantitative claims—numbers, tables, and statistical tests—are produced by deterministic computation, not LLM generation. All data comes from one user’s real Claude Code sessions during and after the Early Access Program—not synthetic benchmarks or controlled experiments.

Session Logs518 sessions

→

Task Extraction2,837 tasks

→

LLM Classificationcomplexity, sentiment

→

Behavioral Analysisedits, planning, subagents

→

Token Extractioncost, verbosity

→

Statistical Tests529 comparisons

→

Report Buildterms, expansions

A 12-step pipeline transforms raw JSONL session logs into the finished report. A few things worth noting about the approach:

Edit timeline reconstruction: Rather than relying on LLM self-report, the pipeline builds per-file content ownership maps from every Edit/Write tool call, detecting when later edits overwrite earlier work. This mechanistic signal replaces the sentiment-proxy approach that proved unreliable in early iterations.
Multi-signal sentiment: Dissatisfaction detection combines keyword patterns, structural edit signals (self-corrections, error recoveries), and LLM judgement with downgrade logic—no single source is trusted alone.
Expression-bound output: Most numbers in this report resolve from analysis JSON via expression-based template variables ({{expr | format}}), and most tables are generated from spec files. This reduces transcription errors and makes it easier to keep prose in sync with the data as the analysis evolves. The pipeline can be re-run end-to-end to reproduce the report from raw session logs.

What to Trust

Most numbers, tables, and statistical results are computed deterministically from analysis JSON files and are reproducible by re-running the pipeline. The expression system binds prose to data paths, which helps catch drift but is not a guarantee of correctness. Interpretive prose was drafted with LLM assistance and may contain errors or overstatements. Effect sizes and p-values are exact; narrative claims linking those numbers to causal explanations are hypotheses, not conclusions. A sensitivity analysis validates key findings against restricted datasets excluding shared projects. The Methodology section describes every step in full detail.

Contents & Key Findings

How This Report Works Pipeline architecture, data provenance, trust calibration

Dataset at a Glance518 sessions, 2,837 tasks, 41 projects, $8,060 total cost
Token Economy & Cost4.6 13–37% cheaper trivial–moderate despite 2.5× output
Thinking & Calibrationd=0.64 medium effect, #1 overall Bonferroni survivor
Behavioral Patterns7× planning; 69% read-only Explore; effort front-loading
Quality & SatisfactionAlignment survives Bonferroni; 4.6 fails less (5.4% vs 12.0%)
Edit Accuracy4.6 rewrites 11.6% vs 18.2%
Complexity & Resource UsageTool calls strongest signal (p<0.000001, d=0.22)
Session Dynamics2.3× explore phase; compaction preserves not degrades
Model ProfilesObserved behavioral patterns (not routing recommendations)
Methodology529 tests, sensitivity-validated

1. Dataset at a Glance

Sessions Analyzed

518

329 (4.5) + 189 (4.6)

Tasks Extracted

2,837

1,900 (4.5) + 937 (4.6)

Projects Spanned

10 shared between both models

Total API Cost

$8,060

$5,222 (4.5) + $2,839 (4.6)

All data comes from a single user's organic Claude Code sessions between December 2025 and February 2026. The dataset is intentionally asymmetric: Opus 4.5 served as the primary model for two months, while Opus 4.6 entered evaluation in early February. This means Opus 4.5 totals are larger in absolute terms, but per-task and per-session comparisons normalize for this. Where sample size limits statistical power, the report notes it explicitly.

The 13-day concentration of the Opus 4.6 data creates a temporal clustering concern: a productive stretch, a particular project focus, or simply the novelty of a new model could color all 937 tasks simultaneously. The report treats tasks as independent observations, but short collection windows make this assumption weaker for 4.6 than for 4.5’s 70-day span.

Per-model composition

The dataset reflects organic usage patterns, not a controlled experiment. Opus 4.5 accumulated sessions over two months of daily use; Opus 4.6 entered evaluation in early February 2026.

Metric	Opus 4.5	Opus 4.6	Combined
Sessions	329	189	518
Tasks	1,900	937	2,837
Tasks / session	5.8	5.0	5.5
Projects	29	22	41
Date range	Dec 5 – Feb 13	Feb 3 – Feb 16	Dec 5 – Feb 16
User prompts	1,928	855	2,783
API turns	20,834	13,861	34,695
Tool calls	18,298	12,472	30,770

The 2:1 session ratio means per-task averages for Opus 4.5 are more robust, while Opus 4.6 estimates carry wider confidence intervals. Opus 4.6 sessions are concentrated across 22 projects (all of which also have Opus 4.5 sessions), providing natural overlap for matched-pair comparisons where they apply.

Task type and complexity distribution

By task type

Tasks are classified by primary type using heuristic pattern matching on prompts, tool usage, and file operations. "Unknown" tasks lacked clear classification signals.

Type	4.5 count	4.6 count	Distribution
Continuation	587	225	A 30.9% B 24.0%
Investigation	463	217	A 24.4% B 23.2%
Feature	216	104	A 11.4% B 11.1%
Bugfix	205	61	A 10.8% B 6.5%
Sysadmin	188	129	A 9.9% B 13.8%
Docs	102	16	A 5.4% B 1.7%
Refactor	54	48	A 2.8% B 5.1%
Greenfield	30	33	A 1.6% B 3.5%
Port	5	8	A 0.3% B 0.9%
Unknown	50	96	A 2.6% B 10.2%

By complexity

Complexity is inferred from tool count, files touched, and lines changed. Over half of all tasks are trivial (single-turn interactions), while major tasks (>50 tool calls or >500 lines) represent ~1% of volume but a significant share of cost.

Complexity	4.5 count	4.5 %	4.6 count	4.6 %	Distribution
Trivial	882	46.4%	346	36.9%	A 46.4% B 36.9%
Simple	381	20.1%	209	22.3%	A 20.1% B 22.3%
Moderate	413	21.7%	247	26.4%	A 21.7% B 26.4%
Complex	198	10.4%	112	12.0%	A 10.4% B 12.0%
Major	26	1.4%	23	2.5%	A 1.4% B 2.5%

The task type distributions are broadly similar across models, suggesting the user's work patterns remained consistent. The complexity mix is also comparable, though Opus 4.6 has a slightly higher share of moderate-and-above tasks (40.8% vs 33.5%), likely reflecting the evaluation period's focus on substantive work rather than quick queries.

Token volumes and code output

Raw token volumes across the full dataset. These are absolute totals, not per-task averages (see §2 for normalized comparisons).

Metric	Opus 4.5	Opus 4.6	Combined
Output tokens	2.0M	2.5M	4.5M
Input tokens (fresh)	666,412	157,109	823,521
Cache read tokens	1.26B	878.7M	2.14B
Cache write tokens	143.9M	53.2M	197.2M
Total API cost	$5,221.55	$2,838.61	$8,060.16

Output composition

Model output splits into thinking (extended thinking / chain-of-thought, not billed as output) and text (visible response, code, tool calls). Estimated from character counts with a 3:1 chars-to-tokens ratio for thinking.

Metric	Opus 4.5	Opus 4.6
Est. thinking tokens	1,375,051	885,062
Est. text tokens	672,072	464,654
Thinking ratio (tasks using thinking)	74.8%	58.9%
Avg requests / task	7.4	9.5

Code output

Metric	Opus 4.5	Opus 4.6	Combined
Files touched	3,162	2,183	5,345
Lines added	197,538	93,984	291,522
Lines removed	42,320	28,173	70,493

Cache reads dominate the token budget: 91% of all tokens processed were served from cache rather than freshly encoded. This reflects Claude Code's prompt architecture, where the system prompt and conversation history are re-sent with each API call but largely hit the prompt cache.

With the dataset in view, we turn to what the token data reveals about how each model allocates its computational budget.

2. Token Economy & Cost

Opus 4.6 costs ~4.9% more per task on average ($2.56 vs $2.44), despite producing 2.5× more output tokens and making more API round-trips (9.5 vs 7.4 requests/task). But this aggregate masks a complexity-dependent pattern: at trivial through moderate levels, 4.6 is 13–37% cheaper, driven by superior cache economics—not output efficiency. Output tokens account for less than 7% of per-task cost; cache operations account for ~93%. 4.6 achieves a leaner cache footprint, writing 29% fewer tokens at the most expensive token category. The cost advantage reverses at complex and major tiers, where accumulated cache reads over many requests outweigh the write savings.

Per-Task Cost

$2.44 vs $2.56

4.6 is ~4.9% more expensive overall; cheaper at trivial–moderate

Output Token Ratio

2.5×

4.6 produces 2,293 vs 933 avg tokens

Per-Request Output

1.9×

241 vs 125 tokens/request

Output Verbosity by Task Type

Task Type	Output Comparison	4.5 avg	4.6 avg	4.6/4.5
Feature	A 2,544 B 6,031	2,544	6,031	2.4×
Greenfield	A 1,952 B 8,590	1,952	8,590	4.4×
Refactor	A 2,674 B 5,647	2,674	5,647	2.1×
Bugfix	A 1,133 B 3,773	1,133	3,773	3.3×
Investigation	A 640 B 1,298	640	1,298	2.0×
Continuation	A 595 B 1,291	595	1,291	2.2×
Sysadmin	A 398 B 1,102	398	1,102	2.8×
Port	A 7,175 B 1,680	7,175	1,680	0.2×
Docs	A 824 B 784	824	784	1.0×

The 2.1× ratio for refactoring is notable: Opus 4.6 produces substantially more output tokens for refactoring tasks, suggesting more thorough changes. For continuation tasks (follow-ups within a session), Opus 4.6 produces 2.2× the output volume of Opus 4.5.

Cost by Complexity

Complexity	Cost Comparison	4.5 avg cost	4.6 avg cost	Δ
Trivial	A $0.87 B $0.55	$0.87	$0.55	−37%
Simple	A $2.21 B $1.49	$2.21	$1.49	−33%
Moderate	A $3.69 B $3.19	$3.69	$3.19	−13%
Complex	A $7.31 B $7.95	$7.31	$7.95	+9%
Major	A $13.65 B $18.47	$13.65	$18.47	+35%

Session-Hour Cost

Normalizing by session hours rather than task count: Opus 4.5 costs $12.57/session-hour ($5,222 over 415.4h) vs $6.74/session-hour for Opus 4.6 ($2,839 over 421.0h). Session hours measure wall-clock time from first to last message, so this metric includes idle time and is not a direct measure of active coding cost.

Cost pattern (with caveats): Despite producing 2.5× more output and costing ~4.9% more per task overall, Opus 4.6 is 13–37% cheaper at trivial through moderate complexity. The explanation is structural: output accounts for just 7% of cost while cache operations account for 93–97%, and 4.6 writes 29% less to cache (the most expensive token category at $18.75/MTok). At major tiers, accumulated cache reads over 30+ requests exceed the write savings (+35%), but this is based on 23 vs 26 tasks. These figures come from organic sessions where task mix and session structure differ between models.

Per-Request Output

Complexity	Output per Request	4.5	4.6	Ratio
Trivial	A 61 B 94	61	94	1.5×
Simple	A 91 B 149	91	149	1.6×
Moderate	A 129 B 234	129	234	1.8×
Complex	A 146 B 305	146	305	2.1×
Major	A 162 B 254	162	254	1.6×

Per-request output survives Bonferroni correction (d=-0.30, small). Opus 4.6 produces more tokens per API round-trip at every complexity level, concentrating work into larger responses rather than many small incremental calls.

Why Output Doesn’t Drive Cost

The 2.5× output difference seems like it should dominate the cost comparison, but output tokens are a minor cost component. Cache operations dwarf everything else:

Component	Price/MTok	4.5/task	4.5 cost	4.6/task	4.6 cost	% of total
Input	$15.00	312	$0.005	142	$0.002	<1%
Output	$75.00	933	$0.070	2,293	$0.172	3–7%
Cache read	$1.875	589K	$1.10	792K	$1.49	45–58%
Cache write	$18.75	67K	$1.26	48K	$0.90	35–52%
Total			$2.44		$2.56

Three forces offset to produce the $0.12 net difference. 4.6 writes 29% fewer tokens to cache per task (48K vs 67K), saving $0.36 at the most expensive category ($18.75/M—10× the read price). But 4.6 reads 34% more cached context (792K vs 589K), costing $0.38 at the cheapest category ($1.875/M). And the 2.5× output increase costs only $0.10. The write savings nearly cancel the read and output increases: −$0.36 + $0.38 + $0.10 = +$0.12.

Why does 4.6 write less to cache? Per-request analysis reveals two mechanisms. First, 4.6 starts tasks with a leaner cache footprint—its first API request writes 50% fewer cache tokens than 4.5’s (24K vs 48K), suggesting a more compact or reusable context structure. Second, the first-request write penalty is steeper for 4.5—11.4× above its steady-state rate vs 7.2× for 4.6—so 4.5 pays a larger per-task initialization tax. The likely explanation: 4.6’s “investigate then execute” pattern creates more compact, reusable context, while 4.5’s incremental approach builds a larger accumulated context that costs more to establish.

This mechanism produces the complexity-dependent curve above. At trivial through moderate tiers (typically ≤30 API requests), lean initialization dominates and 4.6 is 13–37% cheaper. At major complexity, tasks average 50–70 API requests and cumulative cache reads compound past the initialization savings—4.6 reads 34% more cached context per task overall, and over enough round-trips this gap overwhelms the write savings.

Cache Behavior Analysis

Two hypotheses were tested to explain 4.6's superior cache economics: (1) 4.6 front-loads reads, keeping cache warm for subsequent requests; (2) 4.5 experiences more cache cooling between turns, causing expensive re-writes.

Hypothesis 1: Front-Loading (Partially Supported)

The hypothesis that 4.6 front-loads cache reads is weakly supported—4.6 concentrates 17.8% of cache reads in the first request vs 15.2% for 4.5. But the dominant signal is on the write side: 4.6's first request writes 49% fewer cache tokens (24.6K vs 48.1K median), and its overall cache hit rate is 89.5% vs 81.3%.

Metric	Opus 4.5	Opus 4.6	Ratio
First-request cache read (avg)	52,486	55,652	1.06×
First-request cache write (avg)	48,108	24,561	0.51×
Overall cache hit rate	81.3%	89.5%	+8.2pp
Read/write ratio at position 15	33.2	46.7	1.4×

As sessions extend, 4.6 maintains a better read/write ratio—by request position 15, it achieves 46.7× reads per write vs 33.2× for 4.5. This suggests 4.6 writes proportionally less new cache as sessions progress.

Cost Crossover by Request Count

Grouping tasks by API request count confirms the crossover at 30+ requests:

Request tier	4.5 avg cost	4.6 avg cost	Δ
1 request	$0.81	$0.33	−60%
2–3 requests	$1.25	$0.83	−34%
4–10 requests	$2.35	$2.04	−14%
11–30 requests	$5.31	$4.81	−9%
30+ requests	$11.61	$13.16	+13%

4.6's advantage is largest on single-request tasks (−60%), where its lean initialization is maximally visible, and erodes steadily as request count increases. At 30+ requests, 4.6 averages 52.7 requests/task vs 47.0 for 4.5, with slightly higher per-request cache reads (89K vs 80K)—enough to tip the balance.

Hypothesis 2: Cache Cooling (Mechanism Supported, Premise Not)

Both models experience cache cooling gaps (>5 minutes between task transitions) at nearly identical rates—26.7% for 4.5 vs 24.4% for 4.6. So the premise that 4.5 “stops more often” is not supported. However, the impact of cooling differs dramatically:

Condition	4.5 cache write fraction	4.6 cache write fraction
After cold start (>5 min gap)	38.6%	11.9%
After warm start (≤5 min gap)	6.0%	5.7%
Cold/warm inflation	6.4×	2.1×

After warm starts, both models behave identically (~6% write fraction). After cold starts, 4.5’s write fraction jumps to 38.6%—roughly 6.4× inflation—while 4.6 reaches only 11.9% (2.1×). In absolute terms: 4.5 re-writes 177K tokens after a cold start vs 87K for 4.6. This suggests 4.5 accumulates a larger context payload that is more expensive to reconstruct when cache expires.

The combined picture: 4.6’s “lean initialization” (fewer first-request writes) and “cold resistance” (smaller re-cache payload) together produce its cache advantage. Both effects point to the same underlying cause: 4.6’s concentrated work style creates a more compact context that costs less to establish and re-establish.

Cache Efficiency by Complexity

Thinking tokens are billed as output but do not enter the conversation history or affect cache behavior. Visible text output accumulates in history and increases subsequent input size. Cache writes ($18.75/MTok) are 12.5× more expensive than cache reads ($1.50/MTok).

Complexity	Cache Write/Task	4.5 read %	4.6 read %	Δ write
Trivial	A 36,586 B 17,363	72.1%	86.7%	0.47×
Simple	A 72,027 B 35,710	85.9%	91.9%	0.50×
Moderate	A 96,046 B 62,833	90.8%	93.9%	0.65×
Complex	A 163,372 B 134,872	92.8%	94.9%	0.83×
Major	A 228,664 B 238,348	95.3%	96.6%	1.04×
Overall	A 67,294 B 47,994	89.7%	94.3%	0.71×

Token breakdown by complexity level

Per-task average token usage driving the cost differences. Opus 4.6 produces more output but uses less fresh input, relying more on cached context:

Complexity	4.5 output/task	4.6 output/task	4.5 thinking chars	4.6 thinking chars	4.5 input/task	4.6 input/task	4.5 requests	4.6 requests
Trivial	93	164	839	1,108	14	14	1.5	1.7
Simple	508	783	1,684	1,593	160	107	5.6	5.2
Moderate	1,467	2,798	3,828	3,916	554	275	11.4	12.0
Complex	3,766	9,118	7,264	9,911	1,332	462	25.9	29.9
Major	8,754	18,291	10,637	9,859	1,522	394	54.0	72.0

At complex tasks, Opus 4.6 uses fewer fresh input tokens per task (490 vs 1,484) while producing 141% more output (9,416 vs 3,903). Despite using similar request counts per task at the complex level (29.7 vs 26.3), Opus 4.6 achieves much more effective cache utilization, and the output cost premium is offset by input savings.

Caveat: These averages come from organic sessions with different task mixes per model. Some of the per-complexity gap may reflect session-level factors (e.g., caching benefits accumulate within longer sessions).

Cross-Cut Detail

Cost Findings

Measurement	Slice	Direction	Effect	p_adj	Sig
Total Output Tokens	task_type:bugfix	opus-4-6 higher	0.849	0.0001	Bonf
Total Output Tokens	complexity:complex	opus-4-6 higher	0.838	0.0000	Bonf
Request Count	task_type:refactor	opus-4-6 higher	0.770	0.0110	FDR
Output Per Request	task_type:refactor	opus-4-6 higher	0.692	0.0011	FDR
Cache Hit Rate	task_type:greenfield	equal	0.671	0.0022	FDR
Total Output Tokens	task_type:refactor	opus-4-6 higher	0.661	0.0001	Bonf
Total Output Tokens	task_type:feature	opus-4-6 higher	0.644	0.0000	Bonf
Estimated Cost	task_type:refactor	opus-4-6 higher	0.638	0.0307	FDR
Total Output Tokens	complexity:moderate	opus-4-6 higher	0.531	0.0000	Bonf
Total Output Tokens	iteration:significant	opus-4-6 higher	0.516	0.0000	Bonf
Request Count	iteration:significant	opus-4-6 higher	0.493	0.0000	Bonf
Output Per Request	task_type:feature	opus-4-6 higher	0.487	0.0000	Bonf
Total Output Tokens	iteration:one_shot	opus-4-6 higher	0.421	0.0000	Bonf
Cost Per Minute	complexity:complex	opus-4-5 higher	0.407	0.0017	FDR
Estimated Cost	complexity:simple	opus-4-5 higher	0.400	0.0000	Bonf
Total Output Tokens	task_type:sysadmin	opus-4-6 higher	0.394	0.0000	Bonf
Total Output Tokens	overall	opus-4-6 higher	0.388	0.0000	Bonf
Output Per Request	task_type:bugfix	opus-4-6 higher	0.371	0.0005	FDR
Cost Per Minute	complexity:moderate	opus-4-5 higher	0.365	0.0001	Bonf
Output Per Request	complexity:moderate	opus-4-6 higher	0.345	0.0000	Bonf
Total Input Tokens	complexity:complex	opus-4-5 higher	0.344	0.0000	Bonf
Output Per Request	iteration:minor	opus-4-6 higher	0.344	0.0000	Bonf
Output Per Request	complexity:complex	opus-4-6 higher	0.338	0.0000	Bonf
Total Output Tokens	task_type:investigation	opus-4-6 higher	0.325	0.0000	Bonf
Total Output Tokens	complexity:simple	opus-4-6 higher	0.322	0.0000	Bonf
Output Per Request	task_type:investigation	opus-4-6 higher	0.321	0.0000	Bonf
Total Output Tokens	iteration:minor	opus-4-6 higher	0.319	0.0000	Bonf
Request Count	task_type:feature	opus-4-6 higher	0.305	0.0458	FDR
Output Per Request	iteration:one_shot	opus-4-6 higher	0.302	0.0000	Bonf
Estimated Cost	iteration:significant	opus-4-6 higher	0.300	0.0124	FDR
Output Per Request	overall	opus-4-6 higher	0.299	0.0000	Bonf
Output Per Request	iteration:significant	opus-4-6 higher	0.296	0.0000	Bonf
Cost Per Minute	iteration:one_shot	opus-4-5 higher	0.278	0.0000	Bonf
Cost Per Minute	complexity:simple	opus-4-5 higher	0.277	0.0039	FDR
Cache Hit Rate	complexity:trivial	equal	0.270	0.0000	Bonf
Output Per Request	task_type:sysadmin	opus-4-6 higher	0.269	0.0000	Bonf
Estimated Cost	complexity:trivial	opus-4-5 higher	0.264	0.0486	FDR
Output Per Request	complexity:simple	opus-4-6 higher	0.251	0.0000	Bonf
Cost Per Minute	iteration:significant	opus-4-5 higher	0.250	0.0138	FDR
Total Output Tokens	complexity:trivial	opus-4-6 higher	0.238	0.0000	Bonf
Cost Per Minute	overall	opus-4-5 higher	0.237	0.0000	Bonf
Cache Hit Rate	iteration:significant	equal	0.232	0.0000	Bonf
Total Input Tokens	task_type:refactor	opus-4-5 higher	0.226	0.0443	FDR
Request Count	complexity:trivial	opus-4-6 higher	0.223	0.0001	Bonf
Cache Hit Rate	task_type:investigation	equal	0.221	0.0000	Bonf
Cost Per Minute	complexity:trivial	opus-4-5 higher	0.214	0.0029	FDR
Cache Hit Rate	task_type:refactor	equal	0.197	0.0000	Bonf
Cache Hit Rate	iteration:minor	equal	0.192	0.0000	Bonf
Cache Hit Rate	overall	equal	0.192	0.0000	Bonf
Request Count	overall	opus-4-6 higher	0.190	0.0000	Bonf
Request Count	task_type:sysadmin	opus-4-6 higher	0.189	0.0364	FDR
Total Input Tokens	task_type:bugfix	opus-4-5 higher	0.189	0.0000	Bonf
Total Input Tokens	complexity:moderate	opus-4-5 higher	0.173	0.0000	Bonf
Cache Hit Rate	complexity:complex	equal	0.170	0.0000	Bonf
Cost Per Minute	iteration:minor	opus-4-5 higher	0.168	0.0267	FDR
Cache Hit Rate	task_type:bugfix	equal	0.162	0.0000	Bonf
Cost Per Minute	task_type:investigation	opus-4-5 higher	0.161	0.0096	FDR
Request Count	iteration:one_shot	opus-4-6 higher	0.159	0.0001	Bonf
Cache Hit Rate	iteration:one_shot	equal	0.159	0.0000	Bonf
Output Per Request	complexity:trivial	opus-4-6 higher	0.158	0.0000	Bonf
Total Input Tokens	iteration:one_shot	opus-4-5 higher	0.144	0.0000	Bonf
Cache Hit Rate	task_type:sysadmin	equal	0.137	0.0000	Bonf
Request Count	task_type:investigation	opus-4-6 higher	0.135	0.0005	FDR
Cache Hit Rate	complexity:moderate	equal	0.126	0.0000	Bonf
Total Input Tokens	iteration:minor	opus-4-5 higher	0.124	0.0000	Bonf
Total Input Tokens	overall	opus-4-5 higher	0.119	0.0000	Bonf
Cache Hit Rate	task_type:feature	equal	0.086	0.0000	Bonf
Total Input Tokens	iteration:significant	opus-4-5 higher	0.076	0.0000	Bonf
Total Input Tokens	complexity:simple	opus-4-5 higher	0.072	0.0000	Bonf
Total Input Tokens	task_type:investigation	opus-4-5 higher	0.065	0.0000	Bonf
Cache Hit Rate	complexity:simple	equal	0.020	0.0000	Bonf
Total Input Tokens	task_type:sysadmin	opus-4-5 higher	0.013	0.0000	Bonf
Total Input Tokens	task_type:feature	opus-4-5 higher	0.011	0.0000	Bonf
Total Input Tokens	complexity:trivial	opus-4-5 higher	0.000	0.0000	Bonf

24 non-significant cost results

Measurement	Slice	Effect	p_adj
Request Count	task_type:greenfield	0.822	0.1917
Total Output Tokens	task_type:greenfield	0.736	0.0538
Estimated Cost	task_type:greenfield	0.720	0.4060
Total Input Tokens	task_type:greenfield	0.593	0.2112
Request Count	task_type:bugfix	0.538	0.0931
Cost Per Minute	task_type:refactor	0.299	0.1799
Request Count	complexity:complex	0.296	0.0758
Cost Per Minute	task_type:feature	0.292	0.0758
Estimated Cost	task_type:bugfix	0.275	0.4758
Cost Per Minute	task_type:sysadmin	0.248	0.4460
Estimated Cost	complexity:moderate	0.197	0.1222
Estimated Cost	complexity:complex	0.150	0.3988
Cost Per Minute	task_type:greenfield	0.146	0.9305
Request Count	complexity:simple	0.146	0.2142
Estimated Cost	iteration:minor	0.118	0.5655
Cost Per Minute	task_type:bugfix	0.113	0.0943
Output Per Request	task_type:greenfield	0.107	0.1140
Estimated Cost	task_type:feature	0.102	0.4758
Request Count	complexity:moderate	0.095	0.3041
Estimated Cost	task_type:sysadmin	0.095	0.7716
Estimated Cost	overall	0.047	0.8292
Estimated Cost	task_type:investigation	0.039	0.3707
Estimated Cost	iteration:one_shot	0.021	0.9305
Request Count	iteration:minor	0.008	0.8273

The cost difference raises a natural question: does 4.6’s different spending pattern correspond to different thinking strategies? The next section examines thinking calibration—the largest overall effect in the study.

3. Thinking & Calibration

Thinking Fraction Effect

d=0.64

#1 overall Bonferroni survivor (medium)

Thinking Frequency

75% vs 59%

4.5 thinks more often but shallowly

Thinking Depth

+58%

4,067 vs 2,578 chars when thinking

Thinking fraction is the largest overall effect in the study (d=0.64, medium by Cohen’s convention). Opus 4.5 thinks on 75% of tasks but shallowly; Opus 4.6 thinks on 59% of tasks but more deeply when it does (4,067 vs 2,578 chars). The pattern suggests 4.6 has better calibration of when thinking is needed, reserving it for moderate-and-above complexity.

Calibration by Complexity

Opus 4.5 Opus 4.6

Complexity	Distribution	4.5 (n)	4.6 (n)	Δ
Trivial — thinking %	A 76.0% B 43.4%	882	346	−33pp
Simple — thinking %	A 91.3% B 59.8%	381	209	−32pp
Moderate — thinking %	A 89.8% B 91.1%	413	247	+1pp
Complex — thinking %	A 76.8% B 93.8%	198	112	+17pp
Major — thinking %	A 65.4% B 82.6%	26	23	+17pp

Thinking calibration: Opus 4.6 shows better calibration of when thinking is needed. It skips thinking for 57% of trivial tasks (vs 24% for Opus 4.5), but engages thinking for 90%+ of moderate and complex tasks. Opus 4.5 over-thinks easy problems; at moderate complexity, both converge.

Thinking depth by complexity

Complexity	4.5 Thinking Chars (when used)	4.6 Thinking Chars (when used)	4.5 Text Chars	4.6 Text Chars	4.5 Think/Text	4.6 Think/Text
Trivial	839	1,108	841	895	1.00	1.24
Simple	1,684	1,593	976	1,185	1.73	1.34
Moderate	3,828	3,916	1,573	2,204	2.43	1.78
Complex	7,264	9,911	3,232	3,768	2.25	2.63
Major	10,637	9,859	4,150	8,117	2.56	1.21

Thinking Depth by Task Type

Task Type	Thinking Depth	4.5 chars	4.6 chars	Ratio
Greenfield	A 2,520 B 8,196	2,520	8,196	3.3×
Refactor	A 4,613 B 7,028	4,613	7,028	1.5×
Bugfix	A 3,202 B 3,974	3,202	3,974	1.2×
Feature	A 4,115 B 6,605	4,115	6,605	1.6×
Investigation	A 2,458 B 2,572	2,458	2,572	1.0×
Sysadmin	A 1,441 B 2,624	1,441	2,624	1.8×
Continuation	A 1,651 B 2,256	1,651	2,256	1.4×
Docs	A 1,898 B 4,868	1,898	4,868	2.6×
Port	A 4,287 B 4,047	4,287	4,047	0.9×

Cross-Cut Detail

Thinking Findings

Measurement	Slice	Direction	Effect	p_adj	Sig
Thinking Fraction	task_type:sysadmin	opus-4-5 higher	1.440	0.0000	Bonf
Thinking Fraction	complexity:simple	opus-4-5 higher	1.154	0.0000	Bonf
Thinking Fraction	task_type:bugfix	opus-4-5 higher	1.150	0.0001	Bonf
Thinking Fraction	iteration:minor	opus-4-5 higher	0.995	0.0000	Bonf
Thinking Fraction	task_type:investigation	opus-4-5 higher	0.860	0.0000	Bonf
Thinking Fraction	complexity:trivial	opus-4-5 higher	0.829	0.0000	Bonf
Thinking Fraction	overall	opus-4-5 higher	0.636	0.0000	Bonf
Thinking Fraction	iteration:significant	opus-4-5 higher	0.628	0.0000	Bonf
Thinking Fraction	complexity:moderate	opus-4-5 higher	0.583	0.0000	Bonf
Thinking Fraction	iteration:one_shot	opus-4-5 higher	0.545	0.0000	Bonf
Thinking Fraction	task_type:feature	opus-4-5 higher	0.505	0.0306	FDR
Thinking Chars	complexity:complex	opus-4-6 higher	0.487	0.0484	FDR
Thinking Chars	complexity:simple	opus-4-5 higher	0.340	0.0000	Bonf
Thinking Chars	task_type:sysadmin	opus-4-5 higher	0.195	0.0000	Bonf
Thinking Chars	complexity:trivial	opus-4-5 higher	0.160	0.0000	Bonf
Thinking Chars	overall	opus-4-5 higher	0.140	0.0000	Bonf
Thinking Chars	iteration:minor	opus-4-5 higher	0.096	0.0000	Bonf
Thinking Chars	complexity:moderate	opus-4-5 higher	0.026	0.0028	FDR
Thinking Chars	task_type:investigation	opus-4-5 higher	0.016	0.0012	FDR

9 non-significant thinking results

Measurement	Slice	Effect	p_adj
Thinking Chars	task_type:greenfield	0.769	0.1355
Thinking Fraction	task_type:refactor	0.639	0.1188
Thinking Fraction	task_type:greenfield	0.596	0.3287
Thinking Chars	task_type:feature	0.411	0.8972
Thinking Chars	task_type:refactor	0.376	0.8110
Thinking Chars	iteration:one_shot	0.252	0.1028
Thinking Chars	iteration:significant	0.221	0.0551
Thinking Fraction	complexity:complex	0.071	0.8292
Thinking Chars	task_type:bugfix	0.055	0.1355

Significance: Thinking fraction survives Bonferroni correction across trivial, simple, and moderate complexity strata, and most cross-cut slices. The complex stratum is non-significant (p=0.75), likely due to smaller sample size. This is the most robust and widespread finding in the study.

Thinking calibration is one manifestation of broader behavioral differences between the models. The next section examines other behavioral patterns—subagent deployment, planning adoption, and effort distribution.

4. Behavioral Patterns

Planning Adoption

7×

12.3% vs 1.8% of tasks

Explore Subagents

69% vs 49%

4.6 favors read-only exploration

Autonomous Subagents

84% vs 55%

4.6 self-initiates most subagent calls

Beyond token economics, the models differ in how they approach tasks. Opus 4.6 plans more often (12.3% vs 1.8% of tasks), deploys more subagents, and favors read-only exploration over general-purpose workers. These behavioral differences are among the most visible in the dataset, though the Claude Code platform itself evolved between the two collection periods—some of the shift may reflect SDK changes rather than model decisions.

Opus 4.5 Opus 4.6

Subagent & Planning Adoption

Metric	Distribution	4.5	4.6	Δ
Tasks using planning mode	A 1.8% B 12.3%	35	115	B +10.4pp
Tasks using subagents	A 8.2% B 20.1%	155	188	B +11.9pp
Autonomous subagent calls	A 54.9% B 84.2%	196	315	B +29.3pp

Subagent Type Distribution

Opus 4.5 Opus 4.6

Type	Distribution	4.5	4.6	Δ
Explore	A 49.0% B 68.7%	175	257	B +19.7pp
General-purpose	A 31.9% B 19.8%	114	74	A +12.1pp
Plan	A 6.7% B 7.2%	24	27	≈ Tie
Bash	A 1.7% B 3.7%	6	14	≈ Tie

Subagent strategies diverge: Both models deploy similar total subagents (357 vs 374) but they serve different purposes. For Opus 4.6, 69% are lightweight, read-only Explore agents that gather context before implementation begins, with 20% general-purpose. Opus 4.5 splits its subagents more evenly—49% Explore, 32% general-purpose implementation workers that visibly modify files. Both front-load research, but Opus 4.6 concentrates even more heavily on read-only exploration.

Significance: Autonomy level distribution (p<0.0000945, Cramér’s V=0.25) survives Bonferroni correction.

Planning Adoption

Opus 4.6 enters plan mode on 12.3% of tasks (115 of 937) vs 1.8% for Opus 4.5. Adoption scales steeply with complexity: 42.9% at complex, 65.2% at major. Planned tasks show a modest alignment benefit (+0.17 overall) that diminishes at complex and major tiers.

Metric	Distribution	4.5	4.6
Planning adoption rate	A 1.8% B 12.3%	35 tasks	115 tasks

Complexity	Distribution	4.5	4.6
Trivial	A 0.0% B 0.9%	0/882	3/346
Simple	A 0.8% B 2.9%	3/381	6/209
Moderate	A 2.4% B 17.4%	10/413	43/247
Complex	A 7.1% B 42.9%	14/198	48/112
Major	A 30.8% B 65.2%	8/26	15/23

Planning alignment by complexity bin

	Opus 4.5			Opus 4.6
Complexity	Planned	Unplanned	Δ	Planned	Unplanned	Δ
Trivial	— (n=0)	2.72 (n=882)	—	3.00 (n=3)	3.07 (n=343)	−0.07
Simple	3.00 (n=3)	3.27 (n=378)	−0.27	3.17 (n=6)	3.11 (n=203)	+0.06
Moderate	3.70 (n=10)	3.30 (n=403)	+0.40	3.35 (n=43)	3.32 (n=204)	+0.03
Complex+	— (n=0)	— (n=0)	—	— (n=0)	— (n=0)	—

Planning by Complexity

Complexity	Distribution	4.5	4.6
Trivial	A 0.0% B 0.9%	0/882	3/346
Simple	A 0.8% B 2.9%	3/381	6/209
Moderate	A 2.4% B 17.4%	10/413	43/247
Complex	A 7.1% B 42.9%	14/198	48/112
Major	A 30.8% B 65.2%	8/26	15/23

Effort Distribution

Effort distribution shows Opus 4.6 allocates more tool calls to research (35.1% vs 28.3%) and fewer to implementation (17.5% vs 27.0%), consistent with the research-first approach visible in subagent type preferences.

Metric	Distribution	4.5	4.6
Research ratio	A 28.3% B 35.1%	0.283	0.351
Implementation ratio	A 27.0% B 17.5%	0.270	0.175
Front-load positive %	A 54.3% B 59.3%	868 tasks	194 tasks

Cross-Cut Detail

Behavioral Findings

Measurement	Slice	Direction	Effect	p_adj	Sig
Tool Calls	task_type:refactor	opus-4-6 higher	1.057	0.0009	FDR
Files Touched	task_type:refactor	opus-4-6 higher	0.808	0.0045	FDR
Tool Calls	iteration:significant	opus-4-6 higher	0.529	0.0000	Bonf
Tool Calls	task_type:feature	opus-4-6 higher	0.508	0.0005	FDR
Tool Calls	task_type:bugfix	opus-4-6 higher	0.507	0.0368	FDR
One Shot Rate	complexity:trivial	opus-4-6 higher	0.448	0.0000	Bonf
Lines Per Minute	complexity:moderate	opus-4-5 higher	0.442	0.0000	Bonf
Duration Seconds	task_type:refactor	opus-4-6 higher	0.435	0.0018	FDR
Files Touched	complexity:simple	opus-4-5 higher	0.423	0.0000	Bonf
One Shot Rate	complexity:complex	opus-4-6 higher	0.409	0.0016	FDR
Tool Calls	complexity:complex	opus-4-6 higher	0.405	0.0051	FDR
Lines Per Minute	task_type:feature	opus-4-5 higher	0.395	0.0127	FDR
Files Touched	task_type:feature	opus-4-6 higher	0.388	0.0226	FDR
Autonomy Level	iteration:one_shot	distributions differ	0.380	0.0000	Bonf
Scope Management	complexity:simple	distributions differ	0.379	0.0000	Bonf
Files Touched	iteration:significant	opus-4-6 higher	0.379	0.0043	FDR
One Shot Rate	complexity:simple	opus-4-6 higher	0.378	0.0001	Bonf
Scope Management	iteration:one_shot	distributions differ	0.378	0.0000	Bonf
One Shot Rate	complexity:moderate	opus-4-6 higher	0.378	0.0000	Bonf
One Shot Rate	overall	opus-4-6 higher	0.375	0.0000	Bonf
Files Touched	complexity:complex	opus-4-6 higher	0.336	0.0133	FDR
Tool Calls	complexity:moderate	opus-4-6 higher	0.328	0.0002	Bonf
Scope Management	complexity:trivial	distributions differ	0.324	0.0000	Bonf
Tool Calls	task_type:investigation	opus-4-6 higher	0.317	0.0000	Bonf
Autonomy Level	complexity:trivial	distributions differ	0.309	0.0000	Bonf
Tool Calls	complexity:trivial	opus-4-6 higher	0.301	0.0000	Bonf
Tools Per File	complexity:trivial	opus-4-6 higher	0.301	0.0000	Bonf
Scope Management	overall	distributions differ	0.296	0.0000	Bonf
Tools Per File	task_type:sysadmin	opus-4-6 higher	0.296	0.0213	FDR
Communication Quality	complexity:trivial	distributions differ	0.293	0.0000	Bonf
Files Touched	task_type:investigation	equal	0.279	0.0292	FDR
Autonomy Level	complexity:complex	distributions differ	0.263	0.0001	Bonf
Autonomy Level	overall	distributions differ	0.252	0.0000	Bonf
Scope Management	complexity:moderate	distributions differ	0.251	0.0000	Bonf
Tools Per File	task_type:investigation	opus-4-6 higher	0.248	0.0000	Bonf
Tools Per File	iteration:significant	opus-4-6 higher	0.246	0.0000	Bonf
Lines Per Minute	complexity:simple	opus-4-5 higher	0.244	0.0000	Bonf
Tools Per File	complexity:simple	opus-4-6 higher	0.240	0.0024	FDR
Communication Quality	iteration:one_shot	distributions differ	0.234	0.0000	Bonf
Iteration Required	complexity:trivial	distributions differ	0.230	0.0000	Bonf
Scope Expanded Rate	complexity:moderate	opus-4-5 higher	0.229	0.0422	FDR
Iteration Required	complexity:moderate	distributions differ	0.226	0.0002	Bonf
Duration Seconds	task_type:feature	opus-4-6 higher	0.224	0.0082	FDR
Iteration Required	complexity:complex	distributions differ	0.222	0.0048	FDR
Iteration Required	complexity:simple	distributions differ	0.222	0.0005	FDR
Tool Calls	overall	opus-4-6 higher	0.221	0.0000	Bonf
Iteration Required	overall	distributions differ	0.208	0.0000	Bonf
Communication Quality	overall	distributions differ	0.204	0.0000	Bonf
Scope Management	complexity:complex	distributions differ	0.200	0.0354	FDR
Autonomy Level	complexity:moderate	distributions differ	0.196	0.0022	FDR
Autonomy Level	task_type:investigation	distributions differ	0.192	0.0050	FDR
Communication Quality	task_type:investigation	distributions differ	0.191	0.0056	FDR
Autonomy Level	complexity:simple	distributions differ	0.174	0.0172	FDR
Tools Per File	complexity:moderate	opus-4-6 higher	0.174	0.0244	FDR
Scope Management	task_type:investigation	distributions differ	0.172	0.0229	FDR
Scope Expanded Rate	overall	opus-4-5 higher	0.155	0.0036	FDR
Tool Calls	iteration:one_shot	opus-4-6 higher	0.154	0.0000	Bonf
Communication Quality	iteration:significant	distributions differ	0.140	0.0007	FDR
Autonomy Level	iteration:significant	distributions differ	0.138	0.0003	Bonf
Scope Management	iteration:significant	distributions differ	0.133	0.0040	FDR
Tools Per File	overall	opus-4-6 higher	0.103	0.0000	Bonf
Duration Seconds	iteration:significant	opus-4-6 higher	0.085	0.0000	Bonf
Duration Seconds	task_type:investigation	opus-4-6 higher	0.053	0.0000	Bonf
Tools Per File	iteration:one_shot	opus-4-6 higher	0.051	0.0000	Bonf
Duration Seconds	complexity:complex	opus-4-6 higher	0.048	0.0000	Bonf
Duration Seconds	complexity:moderate	opus-4-6 higher	0.040	0.0075	FDR
Duration Seconds	iteration:one_shot	opus-4-6 higher	0.025	0.0000	Bonf
Duration Seconds	overall	opus-4-6 higher	0.000	0.0000	Bonf

83 non-significant behavior results

Measurement	Slice	Effect	p_adj
Scope Expanded Rate	task_type:greenfield	0.806	0.2282
Lines Per Minute	task_type:greenfield	0.659	0.4521
One Shot Rate	task_type:greenfield	0.614	0.2233
Tool Calls	task_type:greenfield	0.537	0.3961
Duration Seconds	task_type:greenfield	0.488	0.4521
Communication Quality	task_type:greenfield	0.488	0.0929
Scope Management	task_type:greenfield	0.427	0.4438
Tools Per File	task_type:greenfield	0.416	0.4438
Files Touched	task_type:bugfix	0.374	0.1278
Lines Per Minute	complexity:complex	0.369	0.1271
Autonomy Level	task_type:greenfield	0.363	0.2819
Tools Per File	task_type:feature	0.361	0.1191
Duration Seconds	task_type:bugfix	0.357	0.0732
Tools Per File	task_type:refactor	0.327	0.5441
Iteration Required	task_type:greenfield	0.310	0.4134
Autonomy Level	task_type:refactor	0.308	0.1305
Scope Management	task_type:refactor	0.301	0.1452
Scope Expanded Rate	task_type:refactor	0.287	0.5674
Scope Expanded Rate	task_type:investigation	0.286	0.1743
Scope Expanded Rate	complexity:complex	0.264	0.1136
Scope Expanded Rate	task_type:bugfix	0.251	0.5441
Lines Per Minute	task_type:bugfix	0.248	0.6793
Communication Quality	task_type:refactor	0.214	0.4521
Scope Expanded Rate	iteration:significant	0.208	0.0578
Lines Per Minute	iteration:minor	0.189	0.3560
Tools Per File	complexity:complex	0.186	0.8365
Duration Seconds	task_type:sysadmin	0.182	0.1819
Iteration Required	task_type:refactor	0.177	0.6072
Lines Per Minute	task_type:sysadmin	0.168	0.4032
Scope Management	task_type:feature	0.163	0.2986

Different behavioral strategies raise the question of whether they lead to different outcomes. The next section examines completion rates, failure rates, and user satisfaction—the quality signals that the behavioral patterns should ultimately serve.

5. Quality & Satisfaction

Failed Rate

5.4% vs 12.0%

4.6 fails less often (p=0.000)

Alignment Score

d=-0.13

4.6 higher (p=0.000714, Bonferroni)

Completion Dist.

p=0.000000

Chi-square Bonferroni survivor (V=0.10)

LLM-annotated alignment scores (1–5 scale) show Opus 4.6 scoring higher on average, an effect that survives Bonferroni correction (p=0.000714, d=-0.13). The failed rate difference is also notable: 5.4% of 4.6 tasks fail vs 12.0% for 4.5 (p=0.000, significant at FDR but not Bonferroni). Both alignment and failure rate are LLM-classified—a Claude Haiku model reads each session transcript and assigns scores. The “LLM quality judgement” approach was abandoned as unreliable (see §10), but alignment scoring proved more robust because it rates user-goal correspondence from observable signals rather than attempting to judge code quality directly.

Two categorical distributions—task completion and communication quality—also survive Bonferroni as chi-square tests, indicating the models differ in how they reach outcomes, not just in outcome rates. Note that the completion distribution test (p=0.000000) survives, while the individual completion rate proportion test (p=0.000) is marginal. All chi-square tests carry a low-expected-cell-count warning due to rare categories in the 20-status taxonomy.

Completion Distribution

Outcome	Distribution (with 95% CI)	Δ
Complete	A 38.9% B 60.9%	B +22.0pp
Partial	A 38.5% B 29.6%	A +8.9pp
Interrupted	A 10.6% B 4.1%	A +6.6pp
Failed	A 12.0% B 5.4%	B −6.6pp

Sentiment Distribution

Sentiment	Distribution (with 95% CI)	Δ
Satisfied	A 23.5% B 20.2%	A +3.4pp
Neutral	A 60.5% B 66.5%	B +6.0pp
Dissatisfied	A 11.8% B 10.8%	≈ Tie

Satisfaction trends higher for 4.6 but does not survive Bonferroni correction. Dissatisfaction rates are essentially tied. Both completion and sentiment are LLM-classified: a Claude Haiku annotator reads the full session transcript for each task, classifying completion status from a 20-category taxonomy and inferring user sentiment from contextual signals (follow-up messages, tone shifts, task abandonment patterns). These classifications were validated through human spot-checks of flagged cases, but no formal inter-rater reliability was computed.

Quality confound: Opus 4.6’s different complexity mix (41% moderate-and-above vs 34% for 4.5) means it tackles harder work on average. Despite this, it achieves higher alignment scores and a lower failure rate—suggesting genuine capability improvement, though task selection remains confounded. The cross-cut detail below shows complexity-stratified alignment scores, consistent with genuine improvement rather than a pure task-mix artifact.

Full statistical test details for satisfaction metrics

Mann-Whitney U Test: Alignment Score

Metric	Opus 4.5	Opus 4.6
Sample size	1900	937
Mean	3.032	3.186
Median	3.0	3.0
Std dev	1.237	0.959

Test statistic	Value
U statistic	823192.5
p-value	0.000714
Cohen's d	-0.134
Effect size	negligible

Proportion Tests: Task Outcomes

Complete Rate

Metric	Opus 4.5	Opus 4.6
Proportion	0.389	0.609
Count	739 / 1900	571 / 937
95% CI	[0.367, 0.411]	[0.578, 0.640]

Test statistic	Value
z statistic	-11.077
p-value	0.0000
Cohen's h	-0.445
Effect size	small

Failed Rate

Metric	Opus 4.5	Opus 4.6
Proportion	0.120	0.054
Count	228 / 1900	51 / 937
95% CI	[0.106, 0.135]	[0.042, 0.071]

Test statistic	Value
z statistic	5.516
p-value	0.0000
Cohen's h	0.236
Effect size	small

Satisfaction Rate

Metric	Opus 4.5	Opus 4.6
Proportion	0.235	0.202
Count	447 / 1900	189 / 937
95% CI	[0.217, 0.255]	[0.177, 0.229]

Test statistic	Value
z statistic	2.016
p-value	0.0438
Cohen's h	0.081
Effect size	negligible

Dissatisfaction Rate

Metric	Opus 4.5	Opus 4.6
Proportion	0.118	0.108
Count	225 / 1900	101 / 937
95% CI	[0.105, 0.134]	[0.089, 0.129]

Test statistic	Value
z statistic	0.835
p-value	0.4037
Cohen's h	0.034
Effect size	negligible

Chi-Square Test: Task Completion Distribution

Note: The full categorical breakdown includes 4 unique completion statuses. For clarity, simplified counts are shown below.

Category	Opus 4.5	Opus 4.6
Complete	739	571
Partial	731	277
Interrupted	202	38
Failed	228	51
Other	0	0

Test statistic	Value
χ² statistic	139.581
Degrees of freedom	3
p-value	0.000
Cramér's V	0.222
Effect size	small

Bonferroni Correction

With 11 independent tests conducted (1 Mann-Whitney U, 9 proportion tests, 1 chi-square), the Bonferroni-corrected significance threshold is α = 0.05 / 11 = 0.0045.

Tests surviving Bonferroni correction (p < 0.0045):

Complete Rate: p = 0.000000 (significant)
Failed Rate: p = 0.000000 (significant)
One Shot Rate: p = 0.000000 (significant)
Good Execution Rate: p = 0.000000 (significant)
Task completion distribution: p = 0.000000 (significant)

Tests significant at α = 0.05 but not after correction:

Alignment score: p = 0.0007 (marginal)
Satisfaction Rate: p = 0.0438 (marginal)
Scope Expanded Rate: p = 0.0012 (marginal)
Has Edits Rate: p = 0.0453 (marginal)
Has Overlaps Rate: p = 0.0433 (marginal)

Non-significant tests:

Dissatisfaction Rate: p = 0.404 (both models 0%)

Cross-Cut Detail

Quality Findings

Measurement	Slice	Direction	Effect	p_adj	Sig
Satisfaction Rate	task_type:greenfield	opus-4-6 higher	1.242	0.0109	FDR
Alignment Score	task_type:greenfield	opus-4-6 higher	1.213	0.0230	FDR
Complete Rate	task_type:greenfield	opus-4-6 higher	0.963	0.0427	FDR
Normalized User Sentiment	task_type:greenfield	distributions differ	0.644	0.0318	FDR
Satisfaction Rate	task_type:refactor	opus-4-6 higher	0.643	0.0207	FDR
Complete Rate	complexity:trivial	opus-4-6 higher	0.607	0.0000	Bonf
Good Execution Rate	complexity:moderate	opus-4-5 higher	0.506	0.0000	Bonf
Good Execution Rate	complexity:complex	opus-4-5 higher	0.473	0.0003	Bonf
Complete Rate	overall	opus-4-6 higher	0.445	0.0000	Bonf
Complete Rate	iteration:one_shot	opus-4-6 higher	0.419	0.0000	Bonf
Good Execution Rate	iteration:one_shot	opus-4-5 higher	0.406	0.0000	Bonf
Good Execution Rate	task_type:feature	opus-4-5 higher	0.405	0.0060	FDR
Complete Rate	complexity:moderate	opus-4-6 higher	0.383	0.0000	Bonf
Satisfaction Rate	iteration:one_shot	opus-4-5 higher	0.350	0.0000	Bonf
Complete Rate	complexity:simple	opus-4-6 higher	0.309	0.0012	FDR
Good Execution Rate	complexity:simple	opus-4-5 higher	0.306	0.0015	FDR
Satisfaction Rate	complexity:complex	opus-4-5 higher	0.300	0.0307	FDR
Alignment Score	iteration:significant	equal	0.295	0.0002	Bonf
Task Completion	complexity:trivial	distributions differ	0.290	0.0000	Bonf
Normalized Execution Quality	complexity:moderate	distributions differ	0.267	0.0000	Bonf
Alignment Score	complexity:trivial	equal	0.260	0.0000	Bonf
Normalized Execution Quality	complexity:complex	distributions differ	0.260	0.0004	FDR
Alignment Score	iteration:one_shot	opus-4-5 higher	0.245	0.0000	Bonf
Failed Rate	complexity:trivial	opus-4-5 higher	0.237	0.0013	FDR
Failed Rate	overall	opus-4-5 higher	0.236	0.0000	Bonf
Dissatisfaction Rate	iteration:one_shot	opus-4-6 higher	0.226	0.0001	Bonf
Task Completion	overall	distributions differ	0.222	0.0000	Bonf
Normalized Execution Quality	iteration:one_shot	distributions differ	0.220	0.0000	Bonf
Failed Rate	iteration:one_shot	opus-4-5 higher	0.214	0.0012	FDR
Normalized Execution Quality	task_type:feature	distributions differ	0.211	0.0354	FDR
Normalized User Sentiment	complexity:complex	distributions differ	0.210	0.0096	FDR
Good Execution Rate	overall	opus-4-5 higher	0.207	0.0000	Bonf
Normalized User Sentiment	iteration:one_shot	distributions differ	0.199	0.0000	Bonf
Task Completion	iteration:one_shot	distributions differ	0.197	0.0000	Bonf
Task Completion	complexity:moderate	distributions differ	0.197	0.0001	Bonf
Dissatisfaction Rate	complexity:moderate	opus-4-5 higher	0.196	0.0431	FDR
Normalized Execution Quality	complexity:simple	distributions differ	0.191	0.0009	FDR
Task Completion	complexity:simple	distributions differ	0.153	0.0093	FDR
Normalized User Sentiment	complexity:simple	distributions differ	0.145	0.0161	FDR
Task Completion	task_type:investigation	distributions differ	0.134	0.0378	FDR
Alignment Score	overall	equal	0.134	0.0022	FDR
Normalized Execution Quality	overall	distributions differ	0.133	0.0000	Bonf
Task Completion	iteration:significant	distributions differ	0.129	0.0026	FDR
Normalized User Sentiment	overall	distributions differ	0.064	0.0236	FDR

82 non-significant quality results

Measurement	Slice	Effect	p_adj
Dissatisfaction Rate	task_type:greenfield	1.002	0.1256
Failed Rate	task_type:refactor	0.580	0.2222
Failed Rate	task_type:greenfield	0.562	0.4342
Task Completion	task_type:greenfield	0.555	0.0936
Good Execution Rate	task_type:refactor	0.436	0.1536
Complete Rate	task_type:refactor	0.415	0.1673
Normalized User Sentiment	task_type:refactor	0.312	0.1218
Normalized Execution Quality	task_type:refactor	0.309	0.2163
Task Completion	task_type:refactor	0.302	0.1427
Dissatisfaction Rate	task_type:bugfix	0.301	0.1964
Normalized Execution Quality	task_type:greenfield	0.289	0.4565
Failed Rate	complexity:complex	0.285	0.2163
Good Execution Rate	task_type:sysadmin	0.269	0.0855
Complete Rate	complexity:complex	0.256	0.0669
Dissatisfaction Rate	task_type:refactor	0.256	0.4534
Satisfaction Rate	task_type:bugfix	0.251	0.2282
Alignment Score	task_type:bugfix	0.240	0.3008
Alignment Score	task_type:investigation	0.205	0.0653
Normalized Execution Quality	task_type:sysadmin	0.197	0.0766
Failed Rate	task_type:bugfix	0.194	0.4342
Satisfaction Rate	complexity:simple	0.190	0.0670
Failed Rate	task_type:investigation	0.189	0.1352
Complete Rate	task_type:investigation	0.181	0.1195
Failed Rate	complexity:moderate	0.176	0.1022
Failed Rate	complexity:simple	0.161	0.1455
Good Execution Rate	task_type:investigation	0.159	0.1799
Failed Rate	task_type:feature	0.157	0.4205
Alignment Score	task_type:refactor	0.156	0.8416
Good Execution Rate	task_type:greenfield	0.154	0.7595
Alignment Score	complexity:simple	0.150	0.2909

Quality metrics paint a consistent-but-modest picture: 4.6 fails less and scores higher on alignment, but effect sizes are small (d=0.13) and the LLM-classification methodology adds a layer of uncertainty. The next section asks whether these quality differences manifest in the editing process itself.

6. Edit Accuracy

Rewrite Rate

11.6% vs 18.2%

4.6 rewrites less of its own output

Editing Tasks

1,135

767 (4.5) + 368 (4.6)

Overlapping Edits

486 vs 204

Self-corrections, error recovery, user-directed, iterative

Edit timeline analysis tracks every Edit and Write tool call, building per-file content ownership maps to detect when a model later overwrites its own earlier output. Opus 4.5 rewrites 18.2% of its edits vs 11.6% for Opus 4.6. Overlap classification reveals the rewrites are predominantly iterative refinement (64% for 4.5, largest category), not error recovery.

Overlap Breakdown

Metric	Distribution	4.5	4.6
Tasks with edits		700	246
Edit calls (rewrite rate denom.)		2,453	1,166
Rewrite rate	A 16.6% B 10.3%	16.6%	10.3%
Total overlapping edits		407	120
Self-corrections	A 11.5% B 31.7%	47	38
Error recovery	A 15.7% B 15.0%	64	18
User-directed corrections	A 9.8% B 0.8%	40	1
Iterative refinement	A 62.9% B 52.5%	256	63

The overlap composition tells a more nuanced story than the headline rewrite rate. When Opus 4.6 does overlap, a larger share is self-correction (30.4% vs 10.1% for 4.5)—meaning 4.6 catches and fixes its own mistakes more explicitly. Opus 4.5’s overlaps are more heavily iterative refinement (64% vs 59%), suggesting gradual adjustment rather than correction. Error recovery rates are comparable (15.2% vs 10.3%).

Full edit overlap breakdown

Metric	Distribution	4.5	4.6
Tasks with edits		767	368
Edit calls (rewrite rate denom.)		2,674	1,765
Rewrite rate	A 18.2% B 11.6%	18.2%	11.6%
Total overlapping edits		486	204
Self-corrections	A 10.1% B 30.4%	49	62
Error recovery	A 15.2% B 10.3%	74	21
User-directed corrections	A 11.1% B 0.5%	54	1
Iterative refinement	A 63.6% B 58.8%	309	120

Self-correction rate by complexity

Complexity	Self-Correction Rate	4.5 (n)	4.6 (n)
Trivial	A 0.0% B 0.0%	71	20
Simple	A 0.8% B 2.6%	199	64
Moderate	A 1.8% B 2.0%	322	160
Complex	A 1.2% B 3.2%	157	103
Major	A 1.3% B 2.6%	18	21

Cross-Cut Detail

Editing Findings

Measurement	Slice	Direction	Effect	p_adj	Sig
Has Edits Rate	complexity:simple	opus-4-5 higher	0.488	0.0000	Bonf
Lines Removed	complexity:simple	equal	0.387	0.0000	Bonf
Lines Added	task_type:refactor	opus-4-6 higher	0.340	0.0187	FDR
Has Overlaps Rate	complexity:simple	opus-4-5 higher	0.332	0.0013	FDR
Has Edits Rate	complexity:moderate	opus-4-5 higher	0.330	0.0002	Bonf
Lines Added	complexity:moderate	opus-4-5 higher	0.311	0.0001	Bonf
Max Chain Depth	iteration:minor	equal	0.287	0.0351	FDR
Triage Score	complexity:simple	equal	0.268	0.0012	FDR
Max Chain Depth	complexity:simple	equal	0.265	0.0013	FDR
Rewrite Rate	complexity:simple	equal	0.258	0.0015	FDR
Rewrite Rate	iteration:minor	equal	0.254	0.0404	FDR
Triage Score	iteration:minor	equal	0.252	0.0387	FDR
Has Overlaps Rate	iteration:minor	opus-4-5 higher	0.248	0.0481	FDR
Overlap Count	iteration:minor	equal	0.246	0.0458	FDR
Has Overlaps Rate	complexity:moderate	opus-4-5 higher	0.234	0.0124	FDR
Overlap Count	complexity:simple	equal	0.230	0.0016	FDR
Lines Removed	complexity:moderate	opus-4-5 higher	0.224	0.0002	Bonf
Lines Added	complexity:simple	opus-4-5 higher	0.206	0.0000	Bonf
Max Chain Depth	complexity:moderate	equal	0.194	0.0093	FDR
Overlap Count	complexity:moderate	equal	0.185	0.0118	FDR
Rewrite Rate	complexity:moderate	equal	0.166	0.0113	FDR
Triage Score	complexity:moderate	equal	0.147	0.0094	FDR

90 non-significant editing results

Measurement	Slice	Effect	p_adj
Has Overlaps Rate	task_type:greenfield	0.806	0.2282
Max Chain Depth	task_type:greenfield	0.579	0.2606
Triage Score	task_type:greenfield	0.579	0.2606
Rewrite Rate	task_type:greenfield	0.579	0.2606
Overlap Count	task_type:greenfield	0.545	0.2606
Lines Added	task_type:bugfix	0.529	0.4868
Lines Removed	task_type:refactor	0.452	0.4819
Rewrite Rate	task_type:bugfix	0.442	0.2823
Rewrite Rate	task_type:refactor	0.395	0.3869
Has Edits Rate	task_type:greenfield	0.318	0.5317
Triage Score	task_type:bugfix	0.308	0.3949
Lines Removed	task_type:bugfix	0.304	0.5674
Triage Score	task_type:refactor	0.279	0.4835
Max Chain Depth	task_type:bugfix	0.244	0.4525
Max Chain Depth	task_type:refactor	0.244	0.4758
Has Overlaps Rate	task_type:sysadmin	0.239	0.1818
Max Chain Depth	task_type:sysadmin	0.228	0.1799
Rewrite Rate	task_type:sysadmin	0.226	0.1799
Has Overlaps Rate	task_type:refactor	0.220	0.4876
Lines Added	task_type:greenfield	0.218	0.6607
Overlap Count	task_type:refactor	0.207	0.4541
Has Edits Rate	complexity:complex	0.205	0.1589
Max Chain Depth	task_type:feature	0.199	0.5094
Overlap Count	task_type:sysadmin	0.194	0.1828
Has Edits Rate	task_type:feature	0.189	0.2309
Lines Removed	complexity:complex	0.177	0.3578
Triage Score	task_type:sysadmin	0.174	0.1799
Triage Score	task_type:investigation	0.173	0.6125
Lines Removed	iteration:significant	0.171	0.4310
Triage Score	task_type:feature	0.165	0.5655

Interpretation: A lower rewrite rate is consistent with Opus 4.6’s research-first approach—investigating before editing reduces the need for later corrections. However, the distinction between “self-correction” and “iterative refinement” is heuristic-based, and Opus 4.6’s overlap sample is small (n=204), making per-category percentages volatile—the user-directed category at 0.8% represents a single edit.

Edit patterns capture one dimension of how the models work; the next section broadens the lens to overall resource usage and complexity scaling.

7. Complexity & Resource Usage

Tool Calls / Task

13.3 vs 9.6

Strongest behavioral signal (p<0.000001, d=-0.22)

Complexity Mix

41% vs 34%

4.6 has more moderate-and-above tasks

Lines Added

100 vs 104

Only +-4% despite 38% more tool calls

Opus 4.5 Opus 4.6

Task Distribution by Complexity

Complexity	Distribution	4.5 n	4.6 n
Trivial	A 46.4% B 36.9%	882	346
Simple	A 20.1% B 22.3%	381	209
Moderate	A 21.7% B 26.4%	413	247
Complex	A 10.4% B 12.0%	198	112
Major	A 1.4% B 2.5%	26	23

Opus 4.6 sessions skew toward higher complexity: fewer trivial tasks (37% vs 46%) and proportionally more moderate tasks (26% vs 22%). This makes raw aggregate comparisons misleading—Opus 4.6 is tackling harder work on average.

Resource Usage

Metric	Distribution	4.5	4.6	Δ
Avg tools per task	A 9.6 B 13.3	9.6	13.3	B +38%
Avg files per task	A 1.7 B 2.3	1.7	2.3	B +40%
Avg lines added	A 104.0 B 100.3	104.0	100.3	≈ Tie

The exploration–output tradeoff: Tool calls per task is the strongest behavioral signal in the study (p<0.000001, d=0.29, Bonferroni); tools per file also survives correction (p<0.000001, d=0.10). Yet Opus 4.6 produces only 6% more lines of code per task despite 38% more tool calls. The extra activity is predominantly read-only research (74% Explore subagents, §4), not proportional output growth. An alternative reading: 4.6 is simply less efficient, doing more work for similar results. The subagent composition data from §4 supports the research interpretation, but the distinction matters.

Significance: Tool calls/task (p<0.000001, d=−0.29) and tools/file (p<0.000001, d=−0.10) survive Bonferroni correction. The d=−0.10 for tools/file is negligible in practical terms despite statistical significance, an artifact of large sample size. The tool call averages in the table above (9.6 vs 13.3) include subagent calls; the stat test was run on per-task attributed calls (mean 8.9 vs 12.9), which show the same directional effect.

Task Scope by Complexity

Complexity	4.5 tasks	4.6 tasks	4.5 files/task	4.6 files/task	4.5 lines+/task	4.6 lines+/task	4.5 lines−/task	4.6 lines−/task
Trivial	882	346	0.1	0.1	0	0	0	0
Simple	381	209	1.0	0.7	14	10	8	3
Moderate	413	247	2.8	2.9	112	79	37	26
Complex	198	112	5.6	7.2	472	420	102	131
Major	26	23	15.9	21.5	2006	1096	140	271

Cross-Cut Detail

Tool calls and tools/file are classified under the “behavior” theme in the cross-cut analysis. Their per-complexity, per-task-type, and per-iteration breakdowns appear in §4’s cross-cut detail (Behavioral Findings). Key results: the tool-call gap is largest for significantly-iterated tasks (d=0.53) and trivial complexity (d=0.30), both Bonferroni-significant.

The preceding sections examined behavioral, quality, and resource dimensions. The next section examines temporal patterns—how performance unfolds within and across sessions.

8. Session Dynamics

Median Task Duration

62s vs 42s

4.6 takes 46% longer per task

Explore Phase

2.3×

71.0s vs 31.3s median explore duration

Active-Time Cost

$27.48 vs $25.52/hr

5-min idle threshold

Task duration survives Bonferroni correction (p=0.000001), though the effect size is negligible (d=0.005)—a case of statistical significance without practical significance, driven by sample size. Opus 4.6 takes longer per task (median 62s vs 42s, a 46% increase). The explore phase runs 2.3× longer at median (71.0s vs 31.3s). Effort distribution shows 4.6 allocates more tool calls to research (35.1% vs 28.3%) and fewer to implementation (17.5% vs 27.0%). Active-time cost is $27.48/hour for 4.6 vs $25.52/hour for 4.5 (5-min idle threshold).

Task Duration

Percentile	Comparison	4.5	4.6
p10	A 8s B 10s	8s	10s
p25	A 15s B 20s	15s	20s
Median	A 42s B 1.0m	42s	1.0m
p75	A 2.0m B 3.4m	2.0m	3.4m
p90	A 4.5m B 8.2m	4.5m	8.2m

Task Duration Distribution

Percentiles

Percentile	Comparison	4.5	4.6
p10	A 8s B 10s	8s	10s
p25	A 15s B 20s	15s	20s
Median	A 42s B 1.0m	42s	1.0m
p75	A 2.0m B 3.4m	2.0m	3.4m
p90	A 4.5m B 8.2m	4.5m	8.2m

Duration buckets

Duration	Distribution	4.5	4.6
Under 30s	A 42.1% B 34.4%	772 (42.1%)	310 (34.4%)
30s – 2m	A 33.0% B 29.9%	604 (33.0%)	270 (29.9%)
2m – 10m	A 21.4% B 27.9%	392 (21.4%)	252 (27.9%)
10m – 1h	A 3.0% B 6.8%	55 (3.0%)	61 (6.8%)
Over 1h	A 0.5% B 1.0%	9 (0.5%)	9 (1.0%)

Session Length & Warmup

Session Length Effects

Session Length	Alignment (4.5 / 4.6)	Completion Rate	Sessions (4.5 / 4.6)
Short (1–3 tasks)	2.93 / 3.53	A 33.3% B 41.7%	135 / 26
Medium (4–8 tasks)	2.93 / 3.16	A 30.3% B 37.4%	45 / 27
Long (9+ tasks)	2.85 / 2.96	A 27.0% B 28.6%	58 / 11

Warm-up Effects

Phase	Alignment	Completion Rate	Tools/File (4.5 / 4.6)
Early (first 3 tasks)	A 2.85 B 2.95	A 24.2% B 25.0%	4.95 / 5.68
Later (task 4+)	A 2.85 B 3.06	A 28.0% B 33.5%	4.50 / 5.11

Active-Time Cost

Idle threshold	4.5 active hrs	4.6 active hrs	4.5 $/hr	4.6 $/hr	Δ $/hr
2 min	181.1	88.7	$27.53	$29.38	+7%
5 min	195.4	94.8	$25.52	$27.48	+8%
10 min	212.7	103.2	$23.45	$25.25	+8%
20 min	235.8	114.9	$21.15	$22.68	+7%
30 min	249.4	124.8	$20.00	$20.87	+4%
60 min	282.3	148.6	$17.67	$17.53	−1%

Complication: Session overlap analysis (1,953 overlapping pairs, max concurrency 11) complicates per-session cost attribution. Active-time cost varies with idle threshold (see sensitivity table above), but the directional relationship is stable across all thresholds tested.

Context Compaction

Context-window compaction occurs in 9.8% of 4.5 sessions (32/327) and 11.7% of 4.6 sessions (22/188). Pre/post comparisons show improvement after compaction, but a position-adjusted control group—splitting non-compacting sessions at the median compaction position to isolate position effects—reveals the effect is driven by session position, not compaction itself (position-adjusted effect: −0.17 for 4.5, −0.17 for 4.6). Compaction appears to preserve rather than degrade performance.

Compaction Overview

Metric	Distribution	4.5	4.6
Sessions with compaction	A 9.8% B 11.7%	32 / 327	22 / 188
Total compaction events		51	35
Events per compacting session		1.59	1.59
Auto-triggered		70.6%	80.0%
Avg pre-compaction tokens		156,823	164,617
Avg position in session		59.2%	60.0%

Pre/post compaction outcome data

	Opus 4.5			Opus 4.6
Metric	Compacting Δ	Control Δ	Net effect	Compacting Δ	Control Δ	Net effect
Alignment score	+0.08	+0.24	−0.17	+0.08	+0.24	−0.17
Satisfaction rate	+5.1pp	+3.6pp	+1.4pp	+5.6pp	+7.3pp	−1.7pp
Completion rate	−0.0pp	+9.0pp	−9.0pp	−1.3pp	+9.8pp	−11.1pp

Position-adjusted effect: The negative values mean compacting sessions improve less than position-matched controls, suggesting the apparent post-compaction improvement is driven by session position rather than compaction itself. Compaction neither helps nor substantially harms outcomes.

Cross-Cut Detail

Duration is classified under the “behavior” theme in the cross-cut analysis. Per-task-type and per-iteration breakdowns appear in §4’s cross-cut detail (Behavioral Findings). Key results: the duration gap is largest for significantly-iterated tasks (median 99.0s vs 45.5s) and investigation tasks (median 79.9s vs 41.0s), both Bonferroni-significant. Effect sizes are negligible (d<0.1) despite significance—driven by sample size, not practical magnitude.

Session dynamics reveal a temporal dimension to the behavioral differences. The next section synthesizes all dimensions into overall model profiles.

9. Model Profiles

Not routing recommendations: These profiles summarize observed behavioral patterns from a single user’s workflow. They describe tendencies in this dataset, not inherent model properties. Different users, tasks, or evaluation periods could produce different profiles.

Observed Patterns by Task Type

Task Type	Observed Pattern	Evidence & Caveats
Trivial / simple tasks	Similar completion rates	4.6 is 28–35% cheaper (§2); n=882/346 and 381/209
Complex / major tasks	4.6 showed higher alignment	n=112+23 for 4.6 vs 198+26 for 4.5; confounded by project differences
Refactoring	4.6 produced 2.1× output tokens	5,647 vs 2,674 avg output (§2); lower rewrite rate (§6)
Investigation / research	4.6 used more Explore agents	69% read-only subagents (§4); 2.3× longer explore phase (§8)
Long sessions (9+ tasks)	Both show some degradation	Small sample for late-session tasks; 4.6 may degrade faster
Parallel execution	4.6 backgrounded more tasks	4.5 spawned more agents but ran them sequentially

10. Methodology

Data Cleaning Methodology

Task-level data cleaning applied four exclusion rules and four informational flags to canonical tasks before analysis. Exclusions remove tasks that do not represent genuine user-model interactions; flags annotate tasks with contextual metadata without removing them.

Exclusion Rules

Rule	Description	Opus 4.5	Opus 4.6
`slash_command`	Task prompt is a slash command (`/command`) or `<command-name>` tag — these invoke built-in features, not model reasoning	—	—
`system_continuation`	Automatic continuations triggered by the system (e.g., context compaction boundaries, session resumptions) rather than deliberate user prompts	—	—
`empty_continuation`	Bare acknowledgement prompts ("continue", "ok", "yes") with zero tool calls and <5s duration — the model produced no meaningful work	—	—
`no_response_interrupt`	Tasks where the model produced zero output (0 tool calls, 0 duration) before the session ended, typically user cancellations	—	—

Informational Flags

These flags are preserved on included tasks for subgroup analysis but do not trigger exclusion:

meta — Task occurred within a meta-analysis session (e.g., this report's own development), where the model analyzed its own output
no_project — No project directory was associated with the session
interrupted — User interrupted the model mid-work (next message was [Request interrupted]). Reasons vary: accidental, correction, redirection, or technical issues
post_compaction — Task occurred after a context compaction event in the same session, potentially with degraded context

Project Overlap & Sensitivity Analysis

A potential confound arises from unequal project coverage between models. To quantify this, a sensitivity analysis compares all statistical tests on the full dataset against a restricted subset containing only tasks from projects where both models were active. If results agree across both analyses, the project confound is unlikely to explain observed differences.

This section documents how each pipeline step works. Each step includes a summary of the approach and a collapsible detail block with thresholds, algorithms, and parameters.

Task Extraction & Classification

Each Claude Code session was segmented into tasks at user-message boundaries. An LLM annotator (Haiku) then classified each task for complexity, type, sentiment, completion status, and alignment score (1–5 scale). Behavioral metrics—subagent usage, planning, parallelization—were extracted directly from tool-call logs.

Sentiment detection detail

Three independent signal sources feed into sentiment aggregation:

Keyword patterns: Regex matching in user messages for positive signals (thanks, perfect, excellent, looks good), negative signals (wrong, incorrect, please fix, revert, undo), and continuation signals (now, next, also, can you). Confidence: low (0 hits), medium (1–2), high (≥3).
Structural edit signals: Self-corrections (consecutive edits overlapping on same file), error recoveries (edits within 10 message indices of an error), user corrections (redirect patterns in the next user message), and rewrite rate (overlaps / total edits).
LLM judgement: Haiku classifies the full task context. Free-text sentiment is normalized to satisfied/neutral/dissatisfied/ambiguous via pattern matching.

Aggregation uses downgrade logic: if edit signals contradict the LLM (e.g., user corrections present but LLM says “satisfied”), the combined score is downgraded. If rewrite rate >0.3 but execution quality is “excellent,” the quality score is downgraded to “good.”

Cross-Cut Dimensions

All 529 statistical tests are run both at the overall level and stratified across three cross-cut dimensions. Each section’s “Cross-Cut Detail” expansion shows how its metrics behave under each slice.

Cross-cut dimension definitions

Dimension	Levels	Method
Complexity	trivial (≤3 tools, ≤1 file, ≤20 lines), simple (≤10, ≤3, ≤100), moderate (≤30, ≤10, ≤500), complex (≤80, ≤25, ≤2000), major (above all thresholds)	Metric thresholds on tool calls, files touched, and lines changed. Lowest matching tier wins. Keyword heuristics as tiebreaker.
Task type	investigation, bugfix, feature, greenfield, refactor, sysadmin, docs, continuation, port	LLM-classified (Haiku) from user prompt, tool usage, and work summary. Regex pattern matching provides initial signal; LLM classification overrides at medium/high confidence, resolving previously “unknown” tasks (33.6% of dataset). Eval: 100% unknown resolution, LLM agrees with regex on 55% of classified tasks.
Iteration	one_shot (no back-and-forth), minor (small corrections), significant (multiple rework cycles)	LLM-classified from the user’s next message after task completion, informed by edit signal heuristics (self-corrections, rewrite rate).

Minimum sample sizes: Cross-cut cells with fewer than 5–10 observations (depending on test type) are excluded from statistical testing. This primarily affects the “major” complexity tier (n=26/23) and rare task types.

Edit Timeline Analysis

The edit timeline reconstructs a per-file content ownership history from every Edit/Write tool call across all sessions. When a later edit’s old_string overlaps with content placed by an earlier edit, a rewrite is detected—providing a mechanistic signal for self-correction that doesn’t depend on sentiment classification.

Overlap detection tiers and classification

Overlaps are matched via three tiers, evaluated in order:

Tier	Method	Threshold
Exact	String equality between prior `new_string` and later `old_string`	100% match
Containment	Substring match with size constraints	≥40 chars AND ≥30% of larger string
Line overlap	Jaccard coefficient on non-trivial lines (>15 chars)	Jaccard >0.3 OR coverage >0.5

Each detected overlap is classified by context:

Self-correction: Same task, no intervening user prompt or errors
Error recovery: Error detected between the two edits
User-directed: Dissatisfaction keyword in intervening user message
Iterative refinement: Chain depth >3, or none of the above

A per-task triage score weights these: (self_corrections×3 + error_recoveries×2 + user_corrections×5 + max_chain_depth) / total_edits. Edit metrics were joined with task classifications to compute complexity-binned accuracy rates (100% coverage for both models).

Compaction Analysis

Claude Code compacts conversation context when token limits approach. This analysis measures whether compaction degrades task outcomes or merely correlates with session position.

Compaction detection and outcome measurement

86 compact_boundary system messages were found across 54 compacting sessions, with trigger type, pre-compaction token count, and session position extracted for each. Outcome impact was measured by splitting tasks into pre/post groups at the first compaction timestamp. A control group of non-compacting sessions, split at the median compaction position, isolates position effects from compaction effects.

Statistical Testing

529 tests were conducted across overall and per-complexity strata, using Bonferroni correction—the most conservative standard—to minimize false positives given the observational design.

Test types, effect sizes, and correction thresholds

Three test types were used: chi-square for categorical distributions (effect size: Cramér’s V), Mann-Whitney U for continuous metrics (Cohen’s d with bootstrap confidence intervals, n=5,000 resamples), and two-proportion Z-tests for rates (Cohen’s h). Confidence intervals on proportions use Wilson score intervals.

Bonferroni corrected threshold: p<0.0000945. Across all 529 tests, 141 survive Bonferroni and 234 survive FDR correction. At the overall level, 21 survive Bonferroni, including alignment score (p=0.000714), duration (p<0.000001, though d=-0.000—negligible practical effect), tool calls/task (p<0.000001, d=−0.22), tools/file (p<0.000001, d=−0.10), and three categorical distributions (task completion, communication quality, autonomy level). Two of the three chi-square survivors have low-expected-cell-count warnings, which may inflate their test statistics.

Complete statistical test results (529 tests)

Test Category	Field	p-value	Effect Size	Bonferroni	Result
Mann-Whitney U	alignment_score	0.000714	d = -0.1337		Opus 4.6 higher (p < 0.05) CI_A: [3.0, 3.1], CI_B: [3.1, 3.2]
	duration_seconds	0.000000	d = -0.0000	✓	Opus 4.6 lower (Bonferroni significant) CI_A: [154.1, 550.6], CI_B: [206.9, 499.5]
	tool_calls	0.000000	d = -0.2214	✓	Opus 4.6 higher (Bonferroni significant) CI_A: [8.2, 9.7], CI_B: [11.6, 14.2]
	files_touched	0.049942	d = -0.1628		Opus 4.6 higher (p < 0.05) CI_A: [1.5, 1.8], CI_B: [2.0, 2.7]
	lines_added	0.835121	d = 0.0092		No significant difference CI_A: [86.2, 125.4], CI_B: [82.8, 119.8]
	lines_removed	0.147622	d = -0.0929		No significant difference CI_A: [19.3, 25.4], CI_B: [23.3, 37.4]
	lines_per_minute	0.121768	d = 0.1424		No significant difference CI_A: [36.9, 43.8], CI_B: [26.4, 33.9]
	tools_per_file	0.000000	d = -0.1033	✓	Opus 4.6 higher (Bonferroni significant) CI_A: [4.0, 4.7], CI_B: [4.7, 5.4]
Proportion Test	satisfaction_rate	0.043843	h = 0.0813		Opus 4.6 lower (p < 0.05) A: 23.5% [21.7%, 25.5%], B: 20.2% [17.7%, 22.9%]
	dissatisfaction_rate	0.403717	h = 0.0336		No significant difference A: 11.8% [10.5%, 13.4%], B: 10.8% [8.9%, 12.9%]
	complete_rate	0.000000	h = -0.4445	✓	Opus 4.6 higher (Bonferroni significant) A: 38.9% [36.7%, 41.1%], B: 60.9% [57.8%, 64.0%]
	failed_rate	0.000000	h = 0.2365	✓	Opus 4.6 lower (Bonferroni significant) A: 12.0% [10.6%, 13.5%], B: 5.4% [4.2%, 7.1%]
	scope_expanded_rate	0.001178	h = 0.1551		Opus 4.6 lower (p < 0.05) A: 1.8% [1.3%, 2.5%], B: 0.3% [0.1%, 0.9%]
	one_shot_rate	0.000000	h = -0.3747	✓	Opus 4.6 higher (Bonferroni significant) A: 42.0% [39.8%, 44.2%], B: 60.6% [57.5%, 63.7%]
	good_execution_rate	0.000000	h = 0.2075	✓	Opus 4.6 lower (Bonferroni significant) A: 31.5% [29.4%, 33.6%], B: 22.3% [19.8%, 25.1%]
Chi-square	task_completion	0.000000	V = 0.2218	✓	Distribution differs (p < 0.05, V = 0.2218)
	scope_management	0.000000	V = 0.2963	✓	Distribution differs (p < 0.05, V = 0.2963) (low cell counts)
	iteration_required	0.000000	V = 0.2076	✓	Distribution differs (p < 0.05, V = 0.2076) (low cell counts)
	error_recovery	0.000268	V = 0.1399		Distribution differs (p < 0.05, V = 0.1399) (low cell counts)
	communication_quality	0.000000	V = 0.2038	✓	Distribution differs (p < 0.05, V = 0.2038) (low cell counts)
	autonomy_level	0.000000	V = 0.2522	✓	Distribution differs (p < 0.05, V = 0.2522) (low cell counts)

Sensitivity Analysis

To validate robustness, all overall-level tests were re-run on a restricted dataset excluding overlapping projects. This tests whether findings depend on specific project mix or are stable across the data.

Sensitivity Analysis: Robustness Validation

Restricted dataset excludes 8 overlapping projects to test whether findings depend on specific project mix.

Overall Bonferroni survivors: 15 (full dataset) vs 16 (restricted).

Metric	Test	Full p	Restricted p	Persists?
Task Completion	Chi Square	3.00e-06	0.00e+00	Yes
Communication Quality	Chi Square	0.00e+00	0.00e+00	Yes
Autonomy Level	Chi Square	0.00e+00	0.00e+00	Yes
Alignment Score	Mann Whitney	1.00e-06	0.00e+00	Yes
Duration Seconds	Mann Whitney	0.00e+00	0.00e+00	Yes
Tool Calls	Mann Whitney	0.00e+00	0.00e+00	Yes
Tools Per File	Mann Whitney	0.00e+00	0.00e+00	Yes
Total Output Tokens	Mann Whitney	0.00e+00	0.00e+00	Yes
Total Input Tokens	Mann Whitney	0.00e+00	0.00e+00	Yes
Thinking Chars	Mann Whitney	0.00e+00	4.30e-05	Yes
Request Count	Mann Whitney	0.00e+00	0.00e+00	Yes
Cost Per Minute	Mann Whitney	1.00e-06	2.90e-05	Yes
Output Per Request	Mann Whitney	0.00e+00	0.00e+00	Yes
Cache Hit Rate	Mann Whitney	0.00e+00	0.00e+00	Yes
Thinking Fraction	Mann Whitney	0.00e+00	0.00e+00	Yes

Ranked Findings

#	Measurement	Theme	Direction	Effect	p_adj	Sig
1	Thinking Fraction	Thinking	opus-4-5 higher	0.636 medium	0.0000	Bonf
2	Complete Rate	Quality	opus-4-6 higher	0.445 small	0.0000	Bonf
3	Total Output Tokens	Cost	opus-4-6 higher	0.388 small	0.0000	Bonf
4	One Shot Rate	Behavior	opus-4-6 higher	0.375 small	0.0000	Bonf
5	Output Per Request	Cost	opus-4-6 higher	0.299 small	0.0000	Bonf
6	Scope Management	Behavior	distributions differ	0.296 small	0.0000	Bonf
7	Autonomy Level	Behavior	distributions differ	0.252 small	0.0000	Bonf
8	Cost Per Minute	Cost	opus-4-5 higher	0.237 small	0.0000	Bonf
9	Failed Rate	Quality	opus-4-5 higher	0.236 small	0.0000	Bonf
10	Task Completion	Quality	distributions differ	0.222 small	0.0000	Bonf
11	Tool Calls	Behavior	opus-4-6 higher	0.221 small	0.0000	Bonf
12	Iteration Required	Behavior	distributions differ	0.208 small	0.0000	Bonf
13	Good Execution Rate	Quality	opus-4-5 higher	0.207 small	0.0000	Bonf
14	Communication Quality	Behavior	distributions differ	0.204 small	0.0000	Bonf
15	Cache Hit Rate	Cost	equal	0.192 negligible	0.0000	Bonf
16	Request Count	Cost	opus-4-6 higher	0.190 negligible	0.0000	Bonf
17	Scope Expanded Rate	Behavior	opus-4-5 higher	0.155 negligible	0.0036	FDR
18	Error Recovery		distributions differ	0.140 negligible	0.0009	FDR
19	Thinking Chars	Thinking	opus-4-5 higher	0.140 negligible	0.0000	Bonf
20	Alignment Score	Quality	equal	0.134 negligible	0.0022	FDR
21	Normalized Execution Quality	Quality	distributions differ	0.133 negligible	0.0000	Bonf
22	Total Input Tokens	Cost	opus-4-5 higher	0.119 negligible	0.0000	Bonf
23	Tools Per File	Behavior	opus-4-6 higher	0.103 negligible	0.0000	Bonf
24	Normalized User Sentiment	Quality	distributions differ	0.064 negligible	0.0236	FDR
25	Duration Seconds	Behavior	opus-4-6 higher	0.000 negligible	0.0000	Bonf

13 non-significant overall results

Measurement	Theme	Effect	p_adj
Files Touched	Behavior	0.163	0.1005
Lines Per Minute	Behavior	0.142	0.2091
Rewrite Rate	Editing	0.126	0.0528
Triage Score	Editing	0.123	0.0587
Max Chain Depth	Editing	0.094	0.0690
Lines Removed	Editing	0.093	0.2366
Has Overlaps Rate	Editing	0.082	0.0899
Satisfaction Rate	Quality	0.081	0.0906
Has Edits Rate	Editing	0.080	0.0929
Estimated Cost	Cost	0.047	0.8292
Overlap Count	Editing	0.042	0.0894
Dissatisfaction Rate	Quality	0.034	0.5134
Lines Added	Editing	0.009	0.8647

Development Process

This analysis was developed iteratively. Two early approaches were replaced after proving unreliable:

Abandoned approaches

LLM-only dissatisfaction detection: Initial LLM-based sentiment classification flagged 7–9% dissatisfaction for both models. An audit of all 59 flagged cases revealed 73–93% false positive rates—the classifiers were fooled by task-coordination language (e.g., “fix” in subagent prompts). This was replaced by the current multi-signal approach, which requires corroboration from keyword patterns and structural edit signals before classifying dissatisfaction.

LLM quality judgement: An LLM judge was asked to compare code quality between models. The judge lacked sufficient context to evaluate whether code met domain requirements and produced confident but ungrounded assessments. This was replaced by mechanistic edit timeline analysis, which detects self-corrections from the tool-call record rather than relying on subjective quality assessment.

Reproducing This Analysis

Reproduction guide

The analysis pipeline is fully automated and can reproduce all tables and statistics from the raw session data.

Requirements

Python 3.11+
scipy (pip install scipy)
Claude Code CLI (for LLM-dependent steps only)

Full pipeline

python scripts/run_pipeline.py --data-dir comparisons/opus-4.5-vs-4.6/data

Tables only (no LLM, no cost)

python scripts/run_pipeline.py --data-dir comparisons/opus-4.5-vs-4.6/data --no-llm

Individual steps

# Run from a specific step onward
python scripts/run_pipeline.py --data-dir comparisons/opus-4.5-vs-4.6/data --from stats

# Run specific steps
python scripts/run_pipeline.py --data-dir comparisons/opus-4.5-vs-4.6/data --steps dataset,update,report

# Check what needs re-running
python scripts/run_pipeline.py --data-dir comparisons/opus-4.5-vs-4.6/data --check-stale

Step cost breakdown

Step	LLM?	Estimated Cost	--no-llm behavior
collect	No	$0	Runs normally
extract	No	$0	Runs normally
classify	No	$0	Runs normally
annotate	Yes (Haiku)	~$7.50	Skipped (uses cached annotations)
analyze	No	$0	Runs normally
tokens	No	$0	Runs normally
enrich	No	$0	Runs normally
stats	No	$0	Runs normally
findings	No	$0	Runs normally
dataset	No	$0	Runs normally
update	Partial (Opus)	~$2.00	Tables only, no LLM expression authoring
report	No	$0	Runs normally

All statistical results, tables, and charts are deterministic (no LLM). Only task annotation and expression authoring use LLM calls. The --no-llm flag produces identical quantitative results at zero API cost. Total full-pipeline LLM cost: ~$9.50.

Cost methodology

Annotate cost estimated by reconstructing all 3,153 annotation prompts from canonical task data, measuring character counts of prompts (~3,700 chars median) and cached responses (~1,500 chars median), converting at ~4 chars/token, and applying Haiku 4.5 pricing ($0.80/MTok input, $4.00/MTok output). Includes ~20% backfill rate for task-type classification calls. Update cost estimated from the annotated template size (~490K chars, ~122K tokens input) with Opus 4.6 pricing ($15/MTok input, $75/MTok output); one LLM call per pipeline run.

Limitations

LLM-in-the-loop analysis: All task classification, sentiment analysis, and alignment scoring was performed by LLM agents (Claude Haiku and Sonnet). This creates a circularity concern: Claude models are classifying Claude model outputs. No formal inter-rater reliability was computed. Human spot-checks validated flagged cases, but systematic bias between models (e.g., if the classifier is more generous toward outputs that resemble its own style) cannot be ruled out. All three overall chi-square Bonferroni survivors and the alignment score depend on LLM-generated categories.

Single user: All data comes from one developer’s workflow. Results may not generalize to other users, codebases, or task distributions.

Temporal confound: Opus 4.5 spans 70 days; Opus 4.6 spans 13 days. A productive week, a particular project focus, or simply the novelty of a new model could color all 937 Opus 4.6 tasks simultaneously. The null hypothesis—that all observed differences reflect the user’s changing work patterns rather than model capabilities—cannot be rejected by this design.

Observational, not experimental: Tasks were not randomly assigned to models. Opus 4.6 was used later chronologically and on different (often harder) tasks, confounding model effects with task effects.

Complexity confound: Opus 4.6’s different complexity mix (41% moderate-and-above vs 34% for Opus 4.5) inflates its resource usage metrics and may suppress its satisfaction scores. Complexity-stratified comparisons (presented throughout as cross-cut detail) partially control for this, but cannot fully separate model effects from mix effects.

Platform evolution: The Claude Code SDK evolved between December 2025 and February 2026. Changes to system prompts, available tools, or subagent defaults could contribute to behavioral differences attributed to the models.

Sample asymmetry: The 2.0:1 ratio (1,900 vs 937 tasks) means Opus 4.5 estimates have narrower confidence intervals. Effect sizes for Opus 4.6 are less precise.

User learning effect: The user may have learned to use Claude Code more effectively over time, benefiting whichever model came second in the chronological sequence.

Thanks to Anthropic for including me in the Claude Code Early Access Program and for supporting independent research into model behavior. The EAP provided early access to Opus 4.6, making this comparative analysis possible.