Claude Code EAP Report

Opus 4.5 → 4.6: What Changed in Practice?

A single-user behavioral study across 2,837 Claude Code tasks
Samuel H. Christie V · February 2026 · Claude Code Early Access Program

Executive Summary

Per-Task Cost
$2.44 vs $2.56
4.6 saves 13–37% at trivial-moderate; +35% at major (§2)
Output Token Ratio
2.5×
4.6: 2,293 avg vs 4.5: 933 avg
Tool Calls / Task
+44%
Per-task mean 12.9 vs 8.9 (incl. subagents: 13.3 vs 9.6) (§7)
Bonferroni Survivors
21 / 529
21 overall; 141 including per-complexity and cross-cut strata
Before you read: Most numbers, tables, and statistical results in this report are computed from analysis data by deterministic Python scripts and bound to the prose via template expressions—a tool used to reduce transcription errors and keep claims grounded in the data. LLMs assisted with prose drafting and with converting literal numbers to data-bound expressions. All data comes from a single user’s workflow over a limited time period—treat all findings as anecdotal observations, not generalizable conclusions.

What Changes When You Switch

Across 529 statistical tests (overall, per-complexity, and cross-cut strata), 141 survive Bonferroni correction—21 at the overall level. Most describe how the model works, not whether it succeeds. The overall success rates are comparable; what changes is the experience of working alongside it. These are the five most noticeable differences, drawn from one user’s workflow over 2,837 tasks.

1. The model plans before it acts—and you stop steering it

Opus 4.6 uses formal planning mode on 12.3% of tasks vs 1.8% for 4.5, rising to 43% at complex and 65% at major difficulty. It front-loads codebase investigation with a 2.3× longer explore phase, deploying subagents that are 69% read-only researchers (vs 49% for 4.5). The practical effect is a shift from interactive collaboration to delegation: you issue a prompt and return to find completed work rather than course-correcting mid-task. This shows up in the data as fewer user-directed corrections across all complexity levels (§4, §8).

2. It thinks less often, but more carefully

The largest overall effect across all 529 tests is thinking fraction (d=0.64, medium, §3). Opus 4.5 activates extended thinking on 75% of requests regardless of difficulty; 4.6 activates on 59% but averages 4,067 characters when it does (vs 2,578). On trivial tasks, 4.6 often skips thinking entirely. On complex tasks, it thinks deeply. This calibration means compute is allocated where it matters rather than spread uniformly across every interaction.

3. Fewer rewrites, better first-attempt accuracy

Opus 4.6 rewrites its own edits 11.6% of the time vs 18.2%—a 36% reduction. Its self-correction rate is actually higher (3.5% vs 1.8%), meaning it catches its own mistakes rather than having the user point them out. Failure rates drop from 12.0% to 5.4%, and alignment scores improve significantly (p=0.000714, one of 21 overall Bonferroni survivors). The “plan first” approach appears to pay off in execution accuracy (§5, §6).

4. Sessions get longer—you trust it with more

Median task duration rises 46% (62s vs 42s), with fewer ultra-short interactions (34% of tasks under 30 seconds vs 42%). The task mix shifts toward moderate-to-complex work issued in a single instruction, and 4.6 runs more tasks in the background for parallel execution. This isn’t purely a model capability difference—it’s a workflow adaptation. When the model handles larger tasks reliably, the user gives it larger tasks, waits longer, and intervenes less. The 7× increase in planning mode (§4) and 44% more tool calls per task (§7) are partly a consequence of this delegation shift.

5. Cost stays flat where it counts

Despite 2.5× more output tokens and 44% more tool calls per task, 4.6 is 13–37% cheaper at trivial through moderate complexity—the bulk of daily work. The reason is counterintuitive: output tokens account for just 6.7% of per-task cost, while cache operations account for 93–97%. Opus 4.6 writes 29% less to cache (the most expensive token category at $18.75/MTok), more than offsetting its higher output and cache reads. Cost only tips higher at 30+ API requests, where cumulative cache reads compound past the write savings. Overall per-task cost is $2.56 vs $2.44: functionally neutral for a meaningfully different style of work (§2).

The data is consistent with a tentative characterization: Opus 4.5 acts first and adjusts, while Opus 4.6 investigates first and implements in concentrated bursts. The confounded study design means this framing is a hypothesis, not a conclusion. The analysis was iterative—several initial findings were revised or reversed when more direct signals became available (§10).

How This Report Works

This report is itself a Claude Code project. The analysis pipeline, statistical tests, table generation, and report assembly are all automated Python scripts, most written with substantial assistance from Opus 4.6—the same model being evaluated. LLMs are used in two places: task classification (Haiku annotates complexity, sentiment, and task type) and prose drafting. Most quantitative claims—numbers, tables, and statistical tests—are produced by deterministic computation, not LLM generation. All data comes from one user’s real Claude Code sessions during and after the Early Access Program—not synthetic benchmarks or controlled experiments.

Session Logs518 sessions
Task Extraction2,837 tasks
LLM Classificationcomplexity, sentiment
Behavioral Analysisedits, planning, subagents
Token Extractioncost, verbosity
Statistical Tests529 comparisons
Report Buildterms, expansions

A 12-step pipeline transforms raw JSONL session logs into the finished report. A few things worth noting about the approach:

What to Trust

Most numbers, tables, and statistical results are computed deterministically from analysis JSON files and are reproducible by re-running the pipeline. The expression system binds prose to data paths, which helps catch drift but is not a guarantee of correctness. Interpretive prose was drafted with LLM assistance and may contain errors or overstatements. Effect sizes and p-values are exact; narrative claims linking those numbers to causal explanations are hypotheses, not conclusions. A sensitivity analysis validates key findings against restricted datasets excluding shared projects. The Methodology section describes every step in full detail.

1. Dataset at a Glance

Sessions Analyzed
518
329 (4.5) + 189 (4.6)
Tasks Extracted
2,837
1,900 (4.5) + 937 (4.6)
Projects Spanned
41
10 shared between both models
Total API Cost
$8,060
$5,222 (4.5) + $2,839 (4.6)

All data comes from a single user's organic Claude Code sessions between December 2025 and February 2026. The dataset is intentionally asymmetric: Opus 4.5 served as the primary model for two months, while Opus 4.6 entered evaluation in early February. This means Opus 4.5 totals are larger in absolute terms, but per-task and per-session comparisons normalize for this. Where sample size limits statistical power, the report notes it explicitly.

The 13-day concentration of the Opus 4.6 data creates a temporal clustering concern: a productive stretch, a particular project focus, or simply the novelty of a new model could color all 937 tasks simultaneously. The report treats tasks as independent observations, but short collection windows make this assumption weaker for 4.6 than for 4.5’s 70-day span.

Per-model composition

The dataset reflects organic usage patterns, not a controlled experiment. Opus 4.5 accumulated sessions over two months of daily use; Opus 4.6 entered evaluation in early February 2026.

Metric Opus 4.5 Opus 4.6 Combined
Sessions 329 189 518
Tasks 1,900 937 2,837
Tasks / session 5.8 5.0 5.5
Projects 29 22 41
Date range Dec 5 – Feb 13 Feb 3 – Feb 16 Dec 5 – Feb 16
User prompts 1,928 855 2,783
API turns 20,834 13,861 34,695
Tool calls 18,298 12,472 30,770

The 2:1 session ratio means per-task averages for Opus 4.5 are more robust, while Opus 4.6 estimates carry wider confidence intervals. Opus 4.6 sessions are concentrated across 22 projects (all of which also have Opus 4.5 sessions), providing natural overlap for matched-pair comparisons where they apply.

Task type and complexity distribution

By task type

Tasks are classified by primary type using heuristic pattern matching on prompts, tool usage, and file operations. "Unknown" tasks lacked clear classification signals.

Type 4.5 count 4.6 count Distribution
Continuation 587 225
A
30.9%
B
24.0%
Investigation 463 217
A
24.4%
B
23.2%
Feature 216 104
A
11.4%
B
11.1%
Bugfix 205 61
A
10.8%
B
6.5%
Sysadmin 188 129
A
9.9%
B
13.8%
Docs 102 16
A
5.4%
B
1.7%
Refactor 54 48
A
2.8%
B
5.1%
Greenfield 30 33
A
1.6%
B
3.5%
Port 5 8
A
0.3%
B
0.9%
Unknown 50 96
A
2.6%
B
10.2%

By complexity

Complexity is inferred from tool count, files touched, and lines changed. Over half of all tasks are trivial (single-turn interactions), while major tasks (>50 tool calls or >500 lines) represent ~1% of volume but a significant share of cost.

Complexity 4.5 count 4.5 % 4.6 count 4.6 % Distribution
Trivial 882 46.4% 346 36.9%
A
46.4%
B
36.9%
Simple 381 20.1% 209 22.3%
A
20.1%
B
22.3%
Moderate 413 21.7% 247 26.4%
A
21.7%
B
26.4%
Complex 198 10.4% 112 12.0%
A
10.4%
B
12.0%
Major 26 1.4% 23 2.5%
A
1.4%
B
2.5%

The task type distributions are broadly similar across models, suggesting the user's work patterns remained consistent. The complexity mix is also comparable, though Opus 4.6 has a slightly higher share of moderate-and-above tasks (40.8% vs 33.5%), likely reflecting the evaluation period's focus on substantive work rather than quick queries.

Token volumes and code output

Raw token volumes across the full dataset. These are absolute totals, not per-task averages (see §2 for normalized comparisons).

Metric Opus 4.5 Opus 4.6 Combined
Output tokens 2.0M 2.5M 4.5M
Input tokens (fresh) 666,412 157,109 823,521
Cache read tokens 1.26B 878.7M 2.14B
Cache write tokens 143.9M 53.2M 197.2M
Total API cost $5,221.55 $2,838.61 $8,060.16

Output composition

Model output splits into thinking (extended thinking / chain-of-thought, not billed as output) and text (visible response, code, tool calls). Estimated from character counts with a 3:1 chars-to-tokens ratio for thinking.

Metric Opus 4.5 Opus 4.6
Est. thinking tokens 1,375,051 885,062
Est. text tokens 672,072 464,654
Thinking ratio (tasks using thinking) 74.8% 58.9%
Avg requests / task 7.4 9.5

Code output

Metric Opus 4.5 Opus 4.6 Combined
Files touched 3,162 2,183 5,345
Lines added 197,538 93,984 291,522
Lines removed 42,320 28,173 70,493

Cache reads dominate the token budget: 91% of all tokens processed were served from cache rather than freshly encoded. This reflects Claude Code's prompt architecture, where the system prompt and conversation history are re-sent with each API call but largely hit the prompt cache.

With the dataset in view, we turn to what the token data reveals about how each model allocates its computational budget.

2. Token Economy & Cost

Opus 4.6 costs ~4.9% more per task on average ($2.56 vs $2.44), despite producing 2.5× more output tokens and making more API round-trips (9.5 vs 7.4 requests/task). But this aggregate masks a complexity-dependent pattern: at trivial through moderate levels, 4.6 is 13–37% cheaper, driven by superior cache economics—not output efficiency. Output tokens account for less than 7% of per-task cost; cache operations account for ~93%. 4.6 achieves a leaner cache footprint, writing 29% fewer tokens at the most expensive token category. The cost advantage reverses at complex and major tiers, where accumulated cache reads over many requests outweigh the write savings.

Per-Task Cost
$2.44 vs $2.56
4.6 is ~4.9% more expensive overall; cheaper at trivial–moderate
Output Token Ratio
2.5×
4.6 produces 2,293 vs 933 avg tokens
Per-Request Output
1.9×
241 vs 125 tokens/request

Output Verbosity by Task Type

Task TypeOutput Comparison4.5 avg4.6 avg4.6/4.5
Feature
A
2,544
B
6,031
2,5446,0312.4×
Greenfield
A
1,952
B
8,590
1,9528,5904.4×
Refactor
A
2,674
B
5,647
2,6745,6472.1×
Bugfix
A
1,133
B
3,773
1,1333,7733.3×
Investigation
A
640
B
1,298
6401,2982.0×
Continuation
A
595
B
1,291
5951,2912.2×
Sysadmin
A
398
B
1,102
3981,1022.8×
Port
A
7,175
B
1,680
7,1751,6800.2×
Docs
A
824
B
784
8247841.0×

The 2.1× ratio for refactoring is notable: Opus 4.6 produces substantially more output tokens for refactoring tasks, suggesting more thorough changes. For continuation tasks (follow-ups within a session), Opus 4.6 produces 2.2× the output volume of Opus 4.5.

Cost by Complexity

ComplexityCost Comparison4.5 avg cost4.6 avg costΔ
Trivial
A
$0.87
B
$0.55
$0.87$0.55−37%
Simple
A
$2.21
B
$1.49
$2.21$1.49−33%
Moderate
A
$3.69
B
$3.19
$3.69$3.19−13%
Complex
A
$7.31
B
$7.95
$7.31$7.95+9%
Major
A
$13.65
B
$18.47
$13.65$18.47+35%

Session-Hour Cost

Normalizing by session hours rather than task count: Opus 4.5 costs $12.57/session-hour ($5,222 over 415.4h) vs $6.74/session-hour for Opus 4.6 ($2,839 over 421.0h). Session hours measure wall-clock time from first to last message, so this metric includes idle time and is not a direct measure of active coding cost.

Cost pattern (with caveats): Despite producing 2.5× more output and costing ~4.9% more per task overall, Opus 4.6 is 13–37% cheaper at trivial through moderate complexity. The explanation is structural: output accounts for just 7% of cost while cache operations account for 93–97%, and 4.6 writes 29% less to cache (the most expensive token category at $18.75/MTok). At major tiers, accumulated cache reads over 30+ requests exceed the write savings (+35%), but this is based on 23 vs 26 tasks. These figures come from organic sessions where task mix and session structure differ between models.

Per-Request Output

ComplexityOutput per Request4.54.6Ratio
Trivial
A
61
B
94
61941.5×
Simple
A
91
B
149
911491.6×
Moderate
A
129
B
234
1292341.8×
Complex
A
146
B
305
1463052.1×
Major
A
162
B
254
1622541.6×

Per-request output survives Bonferroni correction (d=-0.30, small). Opus 4.6 produces more tokens per API round-trip at every complexity level, concentrating work into larger responses rather than many small incremental calls.

Why Output Doesn’t Drive Cost

The 2.5× output difference seems like it should dominate the cost comparison, but output tokens are a minor cost component. Cache operations dwarf everything else:

ComponentPrice/MTok4.5/task4.5 cost4.6/task4.6 cost% of total
Input$15.00312$0.005142$0.002<1%
Output$75.00933$0.0702,293$0.1723–7%
Cache read$1.875589K$1.10792K$1.4945–58%
Cache write$18.7567K$1.2648K$0.9035–52%
Total$2.44$2.56

Three forces offset to produce the $0.12 net difference. 4.6 writes 29% fewer tokens to cache per task (48K vs 67K), saving $0.36 at the most expensive category ($18.75/M—10× the read price). But 4.6 reads 34% more cached context (792K vs 589K), costing $0.38 at the cheapest category ($1.875/M). And the 2.5× output increase costs only $0.10. The write savings nearly cancel the read and output increases: −$0.36 + $0.38 + $0.10 = +$0.12.

Why does 4.6 write less to cache? Per-request analysis reveals two mechanisms. First, 4.6 starts tasks with a leaner cache footprint—its first API request writes 50% fewer cache tokens than 4.5’s (24K vs 48K), suggesting a more compact or reusable context structure. Second, the first-request write penalty is steeper for 4.5—11.4× above its steady-state rate vs 7.2× for 4.6—so 4.5 pays a larger per-task initialization tax. The likely explanation: 4.6’s “investigate then execute” pattern creates more compact, reusable context, while 4.5’s incremental approach builds a larger accumulated context that costs more to establish.

This mechanism produces the complexity-dependent curve above. At trivial through moderate tiers (typically ≤30 API requests), lean initialization dominates and 4.6 is 13–37% cheaper. At major complexity, tasks average 50–70 API requests and cumulative cache reads compound past the initialization savings—4.6 reads 34% more cached context per task overall, and over enough round-trips this gap overwhelms the write savings.

Cache Behavior Analysis

Two hypotheses were tested to explain 4.6's superior cache economics: (1) 4.6 front-loads reads, keeping cache warm for subsequent requests; (2) 4.5 experiences more cache cooling between turns, causing expensive re-writes.

Hypothesis 1: Front-Loading (Partially Supported)

The hypothesis that 4.6 front-loads cache reads is weakly supported—4.6 concentrates 17.8% of cache reads in the first request vs 15.2% for 4.5. But the dominant signal is on the write side: 4.6's first request writes 49% fewer cache tokens (24.6K vs 48.1K median), and its overall cache hit rate is 89.5% vs 81.3%.

MetricOpus 4.5Opus 4.6Ratio
First-request cache read (avg)52,48655,6521.06×
First-request cache write (avg)48,10824,5610.51×
Overall cache hit rate81.3%89.5%+8.2pp
Read/write ratio at position 1533.246.71.4×

As sessions extend, 4.6 maintains a better read/write ratio—by request position 15, it achieves 46.7× reads per write vs 33.2× for 4.5. This suggests 4.6 writes proportionally less new cache as sessions progress.

Cost Crossover by Request Count

Grouping tasks by API request count confirms the crossover at 30+ requests:

Request tier4.5 avg cost4.6 avg costΔ
1 request$0.81$0.33−60%
2–3 requests$1.25$0.83−34%
4–10 requests$2.35$2.04−14%
11–30 requests$5.31$4.81−9%
30+ requests$11.61$13.16+13%

4.6's advantage is largest on single-request tasks (−60%), where its lean initialization is maximally visible, and erodes steadily as request count increases. At 30+ requests, 4.6 averages 52.7 requests/task vs 47.0 for 4.5, with slightly higher per-request cache reads (89K vs 80K)—enough to tip the balance.

Hypothesis 2: Cache Cooling (Mechanism Supported, Premise Not)

Both models experience cache cooling gaps (>5 minutes between task transitions) at nearly identical rates—26.7% for 4.5 vs 24.4% for 4.6. So the premise that 4.5 “stops more often” is not supported. However, the impact of cooling differs dramatically:

Condition4.5 cache write fraction4.6 cache write fraction
After cold start (>5 min gap)38.6%11.9%
After warm start (≤5 min gap)6.0%5.7%
Cold/warm inflation6.4×2.1×

After warm starts, both models behave identically (~6% write fraction). After cold starts, 4.5’s write fraction jumps to 38.6%—roughly 6.4× inflation—while 4.6 reaches only 11.9% (2.1×). In absolute terms: 4.5 re-writes 177K tokens after a cold start vs 87K for 4.6. This suggests 4.5 accumulates a larger context payload that is more expensive to reconstruct when cache expires.

The combined picture: 4.6’s “lean initialization” (fewer first-request writes) and “cold resistance” (smaller re-cache payload) together produce its cache advantage. Both effects point to the same underlying cause: 4.6’s concentrated work style creates a more compact context that costs less to establish and re-establish.

Cache Efficiency by Complexity

Thinking tokens are billed as output but do not enter the conversation history or affect cache behavior. Visible text output accumulates in history and increases subsequent input size. Cache writes ($18.75/MTok) are 12.5× more expensive than cache reads ($1.50/MTok).

ComplexityCache Write/Task4.5 read %4.6 read %Δ write
Trivial
A
36,586
B
17,363
72.1% 86.7% 0.47×
Simple
A
72,027
B
35,710
85.9% 91.9% 0.50×
Moderate
A
96,046
B
62,833
90.8% 93.9% 0.65×
Complex
A
163,372
B
134,872
92.8% 94.9% 0.83×
Major
A
228,664
B
238,348
95.3% 96.6% 1.04×
Overall
A
67,294
B
47,994
89.7% 94.3% 0.71×
Token breakdown by complexity level

Per-task average token usage driving the cost differences. Opus 4.6 produces more output but uses less fresh input, relying more on cached context:

Complexity 4.5 output/task 4.6 output/task 4.5 thinking chars 4.6 thinking chars 4.5 input/task 4.6 input/task 4.5 requests 4.6 requests
Trivial 93 164 839 1,108 14 14 1.5 1.7
Simple 508 783 1,684 1,593 160 107 5.6 5.2
Moderate 1,467 2,798 3,828 3,916 554 275 11.4 12.0
Complex 3,766 9,118 7,264 9,911 1,332 462 25.9 29.9
Major 8,754 18,291 10,637 9,859 1,522 394 54.0 72.0

At complex tasks, Opus 4.6 uses fewer fresh input tokens per task (490 vs 1,484) while producing 141% more output (9,416 vs 3,903). Despite using similar request counts per task at the complex level (29.7 vs 26.3), Opus 4.6 achieves much more effective cache utilization, and the output cost premium is offset by input savings.

Caveat: These averages come from organic sessions with different task mixes per model. Some of the per-complexity gap may reflect session-level factors (e.g., caching benefits accumulate within longer sessions).

Cross-Cut Detail

Cost Findings
MeasurementSliceDirectionEffectpadjSig
Total Output Tokens task_type:bugfix opus-4-6 higher 0.849
0.0001 Bonf
Total Output Tokens complexity:complex opus-4-6 higher 0.838
0.0000 Bonf
Request Count task_type:refactor opus-4-6 higher 0.770
0.0110 FDR
Output Per Request task_type:refactor opus-4-6 higher 0.692
0.0011 FDR
Cache Hit Rate task_type:greenfield equal 0.671
0.0022 FDR
Total Output Tokens task_type:refactor opus-4-6 higher 0.661
0.0001 Bonf
Total Output Tokens task_type:feature opus-4-6 higher 0.644
0.0000 Bonf
Estimated Cost task_type:refactor opus-4-6 higher 0.638
0.0307 FDR
Total Output Tokens complexity:moderate opus-4-6 higher 0.531
0.0000 Bonf
Total Output Tokens iteration:significant opus-4-6 higher 0.516
0.0000 Bonf
Request Count iteration:significant opus-4-6 higher 0.493
0.0000 Bonf
Output Per Request task_type:feature opus-4-6 higher 0.487
0.0000 Bonf
Total Output Tokens iteration:one_shot opus-4-6 higher 0.421
0.0000 Bonf
Cost Per Minute complexity:complex opus-4-5 higher 0.407
0.0017 FDR
Estimated Cost complexity:simple opus-4-5 higher 0.400
0.0000 Bonf
Total Output Tokens task_type:sysadmin opus-4-6 higher 0.394
0.0000 Bonf
Total Output Tokens overall opus-4-6 higher 0.388
0.0000 Bonf
Output Per Request task_type:bugfix opus-4-6 higher 0.371
0.0005 FDR
Cost Per Minute complexity:moderate opus-4-5 higher 0.365
0.0001 Bonf
Output Per Request complexity:moderate opus-4-6 higher 0.345
0.0000 Bonf
Total Input Tokens complexity:complex opus-4-5 higher 0.344
0.0000 Bonf
Output Per Request iteration:minor opus-4-6 higher 0.344
0.0000 Bonf
Output Per Request complexity:complex opus-4-6 higher 0.338
0.0000 Bonf
Total Output Tokens task_type:investigation opus-4-6 higher 0.325
0.0000 Bonf
Total Output Tokens complexity:simple opus-4-6 higher 0.322
0.0000 Bonf
Output Per Request task_type:investigation opus-4-6 higher 0.321
0.0000 Bonf
Total Output Tokens iteration:minor opus-4-6 higher 0.319
0.0000 Bonf
Request Count task_type:feature opus-4-6 higher 0.305
0.0458 FDR
Output Per Request iteration:one_shot opus-4-6 higher 0.302
0.0000 Bonf
Estimated Cost iteration:significant opus-4-6 higher 0.300
0.0124 FDR
Output Per Request overall opus-4-6 higher 0.299
0.0000 Bonf
Output Per Request iteration:significant opus-4-6 higher 0.296
0.0000 Bonf
Cost Per Minute iteration:one_shot opus-4-5 higher 0.278
0.0000 Bonf
Cost Per Minute complexity:simple opus-4-5 higher 0.277
0.0039 FDR
Cache Hit Rate complexity:trivial equal 0.270
0.0000 Bonf
Output Per Request task_type:sysadmin opus-4-6 higher 0.269
0.0000 Bonf
Estimated Cost complexity:trivial opus-4-5 higher 0.264
0.0486 FDR
Output Per Request complexity:simple opus-4-6 higher 0.251
0.0000 Bonf
Cost Per Minute iteration:significant opus-4-5 higher 0.250
0.0138 FDR
Total Output Tokens complexity:trivial opus-4-6 higher 0.238
0.0000 Bonf
Cost Per Minute overall opus-4-5 higher 0.237
0.0000 Bonf
Cache Hit Rate iteration:significant equal 0.232
0.0000 Bonf
Total Input Tokens task_type:refactor opus-4-5 higher 0.226
0.0443 FDR
Request Count complexity:trivial opus-4-6 higher 0.223
0.0001 Bonf
Cache Hit Rate task_type:investigation equal 0.221
0.0000 Bonf
Cost Per Minute complexity:trivial opus-4-5 higher 0.214
0.0029 FDR
Cache Hit Rate task_type:refactor equal 0.197
0.0000 Bonf
Cache Hit Rate iteration:minor equal 0.192
0.0000 Bonf
Cache Hit Rate overall equal 0.192
0.0000 Bonf
Request Count overall opus-4-6 higher 0.190
0.0000 Bonf
Request Count task_type:sysadmin opus-4-6 higher 0.189
0.0364 FDR
Total Input Tokens task_type:bugfix opus-4-5 higher 0.189
0.0000 Bonf
Total Input Tokens complexity:moderate opus-4-5 higher 0.173
0.0000 Bonf
Cache Hit Rate complexity:complex equal 0.170
0.0000 Bonf
Cost Per Minute iteration:minor opus-4-5 higher 0.168
0.0267 FDR
Cache Hit Rate task_type:bugfix equal 0.162
0.0000 Bonf
Cost Per Minute task_type:investigation opus-4-5 higher 0.161
0.0096 FDR
Request Count iteration:one_shot opus-4-6 higher 0.159
0.0001 Bonf
Cache Hit Rate iteration:one_shot equal 0.159
0.0000 Bonf
Output Per Request complexity:trivial opus-4-6 higher 0.158
0.0000 Bonf
Total Input Tokens iteration:one_shot opus-4-5 higher 0.144
0.0000 Bonf
Cache Hit Rate task_type:sysadmin equal 0.137
0.0000 Bonf
Request Count task_type:investigation opus-4-6 higher 0.135
0.0005 FDR
Cache Hit Rate complexity:moderate equal 0.126
0.0000 Bonf
Total Input Tokens iteration:minor opus-4-5 higher 0.124
0.0000 Bonf
Total Input Tokens overall opus-4-5 higher 0.119
0.0000 Bonf
Cache Hit Rate task_type:feature equal 0.086
0.0000 Bonf
Total Input Tokens iteration:significant opus-4-5 higher 0.076
0.0000 Bonf
Total Input Tokens complexity:simple opus-4-5 higher 0.072
0.0000 Bonf
Total Input Tokens task_type:investigation opus-4-5 higher 0.065
0.0000 Bonf
Cache Hit Rate complexity:simple equal 0.020
0.0000 Bonf
Total Input Tokens task_type:sysadmin opus-4-5 higher 0.013
0.0000 Bonf
Total Input Tokens task_type:feature opus-4-5 higher 0.011
0.0000 Bonf
Total Input Tokens complexity:trivial opus-4-5 higher 0.000
0.0000 Bonf
24 non-significant cost results
MeasurementSliceEffectpadj
Request Counttask_type:greenfield0.8220.1917
Total Output Tokenstask_type:greenfield0.7360.0538
Estimated Costtask_type:greenfield0.7200.4060
Total Input Tokenstask_type:greenfield0.5930.2112
Request Counttask_type:bugfix0.5380.0931
Cost Per Minutetask_type:refactor0.2990.1799
Request Countcomplexity:complex0.2960.0758
Cost Per Minutetask_type:feature0.2920.0758
Estimated Costtask_type:bugfix0.2750.4758
Cost Per Minutetask_type:sysadmin0.2480.4460
Estimated Costcomplexity:moderate0.1970.1222
Estimated Costcomplexity:complex0.1500.3988
Cost Per Minutetask_type:greenfield0.1460.9305
Request Countcomplexity:simple0.1460.2142
Estimated Costiteration:minor0.1180.5655
Cost Per Minutetask_type:bugfix0.1130.0943
Output Per Requesttask_type:greenfield0.1070.1140
Estimated Costtask_type:feature0.1020.4758
Request Countcomplexity:moderate0.0950.3041
Estimated Costtask_type:sysadmin0.0950.7716
Estimated Costoverall0.0470.8292
Estimated Costtask_type:investigation0.0390.3707
Estimated Costiteration:one_shot0.0210.9305
Request Countiteration:minor0.0080.8273

The cost difference raises a natural question: does 4.6’s different spending pattern correspond to different thinking strategies? The next section examines thinking calibration—the largest overall effect in the study.

3. Thinking & Calibration

Thinking Fraction Effect
d=0.64
#1 overall Bonferroni survivor (medium)
Thinking Frequency
75% vs 59%
4.5 thinks more often but shallowly
Thinking Depth
+58%
4,067 vs 2,578 chars when thinking

Thinking fraction is the largest overall effect in the study (d=0.64, medium by Cohen’s convention). Opus 4.5 thinks on 75% of tasks but shallowly; Opus 4.6 thinks on 59% of tasks but more deeply when it does (4,067 vs 2,578 chars). The pattern suggests 4.6 has better calibration of when thinking is needed, reserving it for moderate-and-above complexity.

Calibration by Complexity

Opus 4.5 Opus 4.6
ComplexityDistribution4.5 (n)4.6 (n)Δ
Trivial — thinking %
A
76.0%
B
43.4%
882 346 −33pp
Simple — thinking %
A
91.3%
B
59.8%
381 209 −32pp
Moderate — thinking %
A
89.8%
B
91.1%
413 247 +1pp
Complex — thinking %
A
76.8%
B
93.8%
198 112 +17pp
Major — thinking %
A
65.4%
B
82.6%
26 23 +17pp
Thinking calibration: Opus 4.6 shows better calibration of when thinking is needed. It skips thinking for 57% of trivial tasks (vs 24% for Opus 4.5), but engages thinking for 90%+ of moderate and complex tasks. Opus 4.5 over-thinks easy problems; at moderate complexity, both converge.
Thinking depth by complexity
Complexity 4.5 Thinking Chars (when used) 4.6 Thinking Chars (when used) 4.5 Text Chars 4.6 Text Chars 4.5 Think/Text 4.6 Think/Text
Trivial 839 1,108 841 895 1.00 1.24
Simple 1,684 1,593 976 1,185 1.73 1.34
Moderate 3,828 3,916 1,573 2,204 2.43 1.78
Complex 7,264 9,911 3,232 3,768 2.25 2.63
Major 10,637 9,859 4,150 8,117 2.56 1.21

Thinking Depth by Task Type

Task TypeThinking Depth4.5 chars4.6 charsRatio
Greenfield
A
2,520
B
8,196
2,5208,1963.3×
Refactor
A
4,613
B
7,028
4,6137,0281.5×
Bugfix
A
3,202
B
3,974
3,2023,9741.2×
Feature
A
4,115
B
6,605
4,1156,6051.6×
Investigation
A
2,458
B
2,572
2,4582,5721.0×
Sysadmin
A
1,441
B
2,624
1,4412,6241.8×
Continuation
A
1,651
B
2,256
1,6512,2561.4×
Docs
A
1,898
B
4,868
1,8984,8682.6×
Port
A
4,287
B
4,047
4,2874,0470.9×

Cross-Cut Detail

Thinking Findings
MeasurementSliceDirectionEffectpadjSig
Thinking Fraction task_type:sysadmin opus-4-5 higher 1.440
0.0000 Bonf
Thinking Fraction complexity:simple opus-4-5 higher 1.154
0.0000 Bonf
Thinking Fraction task_type:bugfix opus-4-5 higher 1.150
0.0001 Bonf
Thinking Fraction iteration:minor opus-4-5 higher 0.995
0.0000 Bonf
Thinking Fraction task_type:investigation opus-4-5 higher 0.860
0.0000 Bonf
Thinking Fraction complexity:trivial opus-4-5 higher 0.829
0.0000 Bonf
Thinking Fraction overall opus-4-5 higher 0.636
0.0000 Bonf
Thinking Fraction iteration:significant opus-4-5 higher 0.628
0.0000 Bonf
Thinking Fraction complexity:moderate opus-4-5 higher 0.583
0.0000 Bonf
Thinking Fraction iteration:one_shot opus-4-5 higher 0.545
0.0000 Bonf
Thinking Fraction task_type:feature opus-4-5 higher 0.505
0.0306 FDR
Thinking Chars complexity:complex opus-4-6 higher 0.487
0.0484 FDR
Thinking Chars complexity:simple opus-4-5 higher 0.340
0.0000 Bonf
Thinking Chars task_type:sysadmin opus-4-5 higher 0.195
0.0000 Bonf
Thinking Chars complexity:trivial opus-4-5 higher 0.160
0.0000 Bonf
Thinking Chars overall opus-4-5 higher 0.140
0.0000 Bonf
Thinking Chars iteration:minor opus-4-5 higher 0.096
0.0000 Bonf
Thinking Chars complexity:moderate opus-4-5 higher 0.026
0.0028 FDR
Thinking Chars task_type:investigation opus-4-5 higher 0.016
0.0012 FDR
9 non-significant thinking results
MeasurementSliceEffectpadj
Thinking Charstask_type:greenfield0.7690.1355
Thinking Fractiontask_type:refactor0.6390.1188
Thinking Fractiontask_type:greenfield0.5960.3287
Thinking Charstask_type:feature0.4110.8972
Thinking Charstask_type:refactor0.3760.8110
Thinking Charsiteration:one_shot0.2520.1028
Thinking Charsiteration:significant0.2210.0551
Thinking Fractioncomplexity:complex0.0710.8292
Thinking Charstask_type:bugfix0.0550.1355
Significance: Thinking fraction survives Bonferroni correction across trivial, simple, and moderate complexity strata, and most cross-cut slices. The complex stratum is non-significant (p=0.75), likely due to smaller sample size. This is the most robust and widespread finding in the study.

Thinking calibration is one manifestation of broader behavioral differences between the models. The next section examines other behavioral patterns—subagent deployment, planning adoption, and effort distribution.

4. Behavioral Patterns

Planning Adoption
12.3% vs 1.8% of tasks
Explore Subagents
69% vs 49%
4.6 favors read-only exploration
Autonomous Subagents
84% vs 55%
4.6 self-initiates most subagent calls

Beyond token economics, the models differ in how they approach tasks. Opus 4.6 plans more often (12.3% vs 1.8% of tasks), deploys more subagents, and favors read-only exploration over general-purpose workers. These behavioral differences are among the most visible in the dataset, though the Claude Code platform itself evolved between the two collection periods—some of the shift may reflect SDK changes rather than model decisions.

Opus 4.5 Opus 4.6

Subagent & Planning Adoption

MetricDistribution4.54.6Δ
Tasks using planning mode
A
1.8%
B
12.3%
35 115 B +10.4pp
Tasks using subagents
A
8.2%
B
20.1%
155 188 B +11.9pp
Autonomous subagent calls
A
54.9%
B
84.2%
196 315 B +29.3pp

Subagent Type Distribution

Opus 4.5 Opus 4.6
TypeDistribution4.54.6Δ
Explore
A
49.0%
B
68.7%
175 257 B +19.7pp
General-purpose
A
31.9%
B
19.8%
114 74 A +12.1pp
Plan
A
6.7%
B
7.2%
24 27 ≈ Tie
Bash
A
1.7%
B
3.7%
6 14 ≈ Tie
Subagent strategies diverge: Both models deploy similar total subagents (357 vs 374) but they serve different purposes. For Opus 4.6, 69% are lightweight, read-only Explore agents that gather context before implementation begins, with 20% general-purpose. Opus 4.5 splits its subagents more evenly—49% Explore, 32% general-purpose implementation workers that visibly modify files. Both front-load research, but Opus 4.6 concentrates even more heavily on read-only exploration.
Significance: Autonomy level distribution (p<0.0000945, Cramér’s V=0.25) survives Bonferroni correction.

Planning Adoption

Opus 4.6 enters plan mode on 12.3% of tasks (115 of 937) vs 1.8% for Opus 4.5. Adoption scales steeply with complexity: 42.9% at complex, 65.2% at major. Planned tasks show a modest alignment benefit (+0.17 overall) that diminishes at complex and major tiers.

MetricDistribution4.54.6
Planning adoption rate
A
1.8%
B
12.3%
35 tasks 115 tasks
ComplexityDistribution4.54.6
Trivial
A
0.0%
B
0.9%
0/882 3/346
Simple
A
0.8%
B
2.9%
3/381 6/209
Moderate
A
2.4%
B
17.4%
10/413 43/247
Complex
A
7.1%
B
42.9%
14/198 48/112
Major
A
30.8%
B
65.2%
8/26 15/23
Planning alignment by complexity bin
Opus 4.5Opus 4.6
ComplexityPlannedUnplannedΔPlannedUnplannedΔ
Trivial — (n=0)2.72 (n=882) 3.00 (n=3)3.07 (n=343)−0.07
Simple 3.00 (n=3)3.27 (n=378)−0.27 3.17 (n=6)3.11 (n=203)+0.06
Moderate 3.70 (n=10)3.30 (n=403)+0.40 3.35 (n=43)3.32 (n=204)+0.03
Complex+ — (n=0)— (n=0) — (n=0)— (n=0)
Planning by Complexity
ComplexityDistribution4.54.6
Trivial
A
0.0%
B
0.9%
0/882 3/346
Simple
A
0.8%
B
2.9%
3/381 6/209
Moderate
A
2.4%
B
17.4%
10/413 43/247
Complex
A
7.1%
B
42.9%
14/198 48/112
Major
A
30.8%
B
65.2%
8/26 15/23

Effort Distribution

Effort distribution shows Opus 4.6 allocates more tool calls to research (35.1% vs 28.3%) and fewer to implementation (17.5% vs 27.0%), consistent with the research-first approach visible in subagent type preferences.

MetricDistribution4.54.6
Research ratio
A
28.3%
B
35.1%
0.283 0.351
Implementation ratio
A
27.0%
B
17.5%
0.270 0.175
Front-load positive %
A
54.3%
B
59.3%
868 tasks 194 tasks

Cross-Cut Detail

Behavioral Findings
MeasurementSliceDirectionEffectpadjSig
Tool Calls task_type:refactor opus-4-6 higher 1.057
0.0009 FDR
Files Touched task_type:refactor opus-4-6 higher 0.808
0.0045 FDR
Tool Calls iteration:significant opus-4-6 higher 0.529
0.0000 Bonf
Tool Calls task_type:feature opus-4-6 higher 0.508
0.0005 FDR
Tool Calls task_type:bugfix opus-4-6 higher 0.507
0.0368 FDR
One Shot Rate complexity:trivial opus-4-6 higher 0.448
0.0000 Bonf
Lines Per Minute complexity:moderate opus-4-5 higher 0.442
0.0000 Bonf
Duration Seconds task_type:refactor opus-4-6 higher 0.435
0.0018 FDR
Files Touched complexity:simple opus-4-5 higher 0.423
0.0000 Bonf
One Shot Rate complexity:complex opus-4-6 higher 0.409
0.0016 FDR
Tool Calls complexity:complex opus-4-6 higher 0.405
0.0051 FDR
Lines Per Minute task_type:feature opus-4-5 higher 0.395
0.0127 FDR
Files Touched task_type:feature opus-4-6 higher 0.388
0.0226 FDR
Autonomy Level iteration:one_shot distributions differ 0.380
0.0000 Bonf
Scope Management complexity:simple distributions differ 0.379
0.0000 Bonf
Files Touched iteration:significant opus-4-6 higher 0.379
0.0043 FDR
One Shot Rate complexity:simple opus-4-6 higher 0.378
0.0001 Bonf
Scope Management iteration:one_shot distributions differ 0.378
0.0000 Bonf
One Shot Rate complexity:moderate opus-4-6 higher 0.378
0.0000 Bonf
One Shot Rate overall opus-4-6 higher 0.375
0.0000 Bonf
Files Touched complexity:complex opus-4-6 higher 0.336
0.0133 FDR
Tool Calls complexity:moderate opus-4-6 higher 0.328
0.0002 Bonf
Scope Management complexity:trivial distributions differ 0.324
0.0000 Bonf
Tool Calls task_type:investigation opus-4-6 higher 0.317
0.0000 Bonf
Autonomy Level complexity:trivial distributions differ 0.309
0.0000 Bonf
Tool Calls complexity:trivial opus-4-6 higher 0.301
0.0000 Bonf
Tools Per File complexity:trivial opus-4-6 higher 0.301
0.0000 Bonf
Scope Management overall distributions differ 0.296
0.0000 Bonf
Tools Per File task_type:sysadmin opus-4-6 higher 0.296
0.0213 FDR
Communication Quality complexity:trivial distributions differ 0.293
0.0000 Bonf
Files Touched task_type:investigation equal 0.279
0.0292 FDR
Autonomy Level complexity:complex distributions differ 0.263
0.0001 Bonf
Autonomy Level overall distributions differ 0.252
0.0000 Bonf
Scope Management complexity:moderate distributions differ 0.251
0.0000 Bonf
Tools Per File task_type:investigation opus-4-6 higher 0.248
0.0000 Bonf
Tools Per File iteration:significant opus-4-6 higher 0.246
0.0000 Bonf
Lines Per Minute complexity:simple opus-4-5 higher 0.244
0.0000 Bonf
Tools Per File complexity:simple opus-4-6 higher 0.240
0.0024 FDR
Communication Quality iteration:one_shot distributions differ 0.234
0.0000 Bonf
Iteration Required complexity:trivial distributions differ 0.230
0.0000 Bonf
Scope Expanded Rate complexity:moderate opus-4-5 higher 0.229
0.0422 FDR
Iteration Required complexity:moderate distributions differ 0.226
0.0002 Bonf
Duration Seconds task_type:feature opus-4-6 higher 0.224
0.0082 FDR
Iteration Required complexity:complex distributions differ 0.222
0.0048 FDR
Iteration Required complexity:simple distributions differ 0.222
0.0005 FDR
Tool Calls overall opus-4-6 higher 0.221
0.0000 Bonf
Iteration Required overall distributions differ 0.208
0.0000 Bonf
Communication Quality overall distributions differ 0.204
0.0000 Bonf
Scope Management complexity:complex distributions differ 0.200
0.0354 FDR
Autonomy Level complexity:moderate distributions differ 0.196
0.0022 FDR
Autonomy Level task_type:investigation distributions differ 0.192
0.0050 FDR
Communication Quality task_type:investigation distributions differ 0.191
0.0056 FDR
Autonomy Level complexity:simple distributions differ 0.174
0.0172 FDR
Tools Per File complexity:moderate opus-4-6 higher 0.174
0.0244 FDR
Scope Management task_type:investigation distributions differ 0.172
0.0229 FDR
Scope Expanded Rate overall opus-4-5 higher 0.155
0.0036 FDR
Tool Calls iteration:one_shot opus-4-6 higher 0.154
0.0000 Bonf
Communication Quality iteration:significant distributions differ 0.140
0.0007 FDR
Autonomy Level iteration:significant distributions differ 0.138
0.0003 Bonf
Scope Management iteration:significant distributions differ 0.133
0.0040 FDR
Tools Per File overall opus-4-6 higher 0.103
0.0000 Bonf
Duration Seconds iteration:significant opus-4-6 higher 0.085
0.0000 Bonf
Duration Seconds task_type:investigation opus-4-6 higher 0.053
0.0000 Bonf
Tools Per File iteration:one_shot opus-4-6 higher 0.051
0.0000 Bonf
Duration Seconds complexity:complex opus-4-6 higher 0.048
0.0000 Bonf
Duration Seconds complexity:moderate opus-4-6 higher 0.040
0.0075 FDR
Duration Seconds iteration:one_shot opus-4-6 higher 0.025
0.0000 Bonf
Duration Seconds overall opus-4-6 higher 0.000
0.0000 Bonf
83 non-significant behavior results
MeasurementSliceEffectpadj
Scope Expanded Ratetask_type:greenfield0.8060.2282
Lines Per Minutetask_type:greenfield0.6590.4521
One Shot Ratetask_type:greenfield0.6140.2233
Tool Callstask_type:greenfield0.5370.3961
Duration Secondstask_type:greenfield0.4880.4521
Communication Qualitytask_type:greenfield0.4880.0929
Scope Managementtask_type:greenfield0.4270.4438
Tools Per Filetask_type:greenfield0.4160.4438
Files Touchedtask_type:bugfix0.3740.1278
Lines Per Minutecomplexity:complex0.3690.1271
Autonomy Leveltask_type:greenfield0.3630.2819
Tools Per Filetask_type:feature0.3610.1191
Duration Secondstask_type:bugfix0.3570.0732
Tools Per Filetask_type:refactor0.3270.5441
Iteration Requiredtask_type:greenfield0.3100.4134
Autonomy Leveltask_type:refactor0.3080.1305
Scope Managementtask_type:refactor0.3010.1452
Scope Expanded Ratetask_type:refactor0.2870.5674
Scope Expanded Ratetask_type:investigation0.2860.1743
Scope Expanded Ratecomplexity:complex0.2640.1136
Scope Expanded Ratetask_type:bugfix0.2510.5441
Lines Per Minutetask_type:bugfix0.2480.6793
Communication Qualitytask_type:refactor0.2140.4521
Scope Expanded Rateiteration:significant0.2080.0578
Lines Per Minuteiteration:minor0.1890.3560
Tools Per Filecomplexity:complex0.1860.8365
Duration Secondstask_type:sysadmin0.1820.1819
Iteration Requiredtask_type:refactor0.1770.6072
Lines Per Minutetask_type:sysadmin0.1680.4032
Scope Managementtask_type:feature0.1630.2986

Different behavioral strategies raise the question of whether they lead to different outcomes. The next section examines completion rates, failure rates, and user satisfaction—the quality signals that the behavioral patterns should ultimately serve.

5. Quality & Satisfaction

Failed Rate
5.4% vs 12.0%
4.6 fails less often (p=0.000)
Alignment Score
d=-0.13
4.6 higher (p=0.000714, Bonferroni)
Completion Dist.
p=0.000000
Chi-square Bonferroni survivor (V=0.10)

LLM-annotated alignment scores (1–5 scale) show Opus 4.6 scoring higher on average, an effect that survives Bonferroni correction (p=0.000714, d=-0.13). The failed rate difference is also notable: 5.4% of 4.6 tasks fail vs 12.0% for 4.5 (p=0.000, significant at FDR but not Bonferroni). Both alignment and failure rate are LLM-classified—a Claude Haiku model reads each session transcript and assigns scores. The “LLM quality judgement” approach was abandoned as unreliable (see §10), but alignment scoring proved more robust because it rates user-goal correspondence from observable signals rather than attempting to judge code quality directly.

Two categorical distributions—task completion and communication quality—also survive Bonferroni as chi-square tests, indicating the models differ in how they reach outcomes, not just in outcome rates. Note that the completion distribution test (p=0.000000) survives, while the individual completion rate proportion test (p=0.000) is marginal. All chi-square tests carry a low-expected-cell-count warning due to rare categories in the 20-status taxonomy.

Completion Distribution

OutcomeDistribution (with 95% CI)Δ
Complete
A
38.9%
B
60.9%
B +22.0pp
Partial
A
38.5%
B
29.6%
A +8.9pp
Interrupted
A
10.6%
B
4.1%
A +6.6pp
Failed
A
12.0%
B
5.4%
B −6.6pp

Sentiment Distribution

SentimentDistribution (with 95% CI)Δ
Satisfied
A
23.5%
B
20.2%
A +3.4pp
Neutral
A
60.5%
B
66.5%
B +6.0pp
Dissatisfied
A
11.8%
B
10.8%
≈ Tie

Satisfaction trends higher for 4.6 but does not survive Bonferroni correction. Dissatisfaction rates are essentially tied. Both completion and sentiment are LLM-classified: a Claude Haiku annotator reads the full session transcript for each task, classifying completion status from a 20-category taxonomy and inferring user sentiment from contextual signals (follow-up messages, tone shifts, task abandonment patterns). These classifications were validated through human spot-checks of flagged cases, but no formal inter-rater reliability was computed.

Quality confound: Opus 4.6’s different complexity mix (41% moderate-and-above vs 34% for 4.5) means it tackles harder work on average. Despite this, it achieves higher alignment scores and a lower failure rate—suggesting genuine capability improvement, though task selection remains confounded. The cross-cut detail below shows complexity-stratified alignment scores, consistent with genuine improvement rather than a pure task-mix artifact.
Full statistical test details for satisfaction metrics

Mann-Whitney U Test: Alignment Score

MetricOpus 4.5Opus 4.6
Sample size1900937
Mean3.0323.186
Median3.03.0
Std dev1.2370.959
Test statisticValue
U statistic823192.5
p-value0.000714
Cohen's d-0.134
Effect sizenegligible

Proportion Tests: Task Outcomes

Complete Rate

MetricOpus 4.5Opus 4.6
Proportion0.3890.609
Count739 / 1900571 / 937
95% CI[0.367, 0.411][0.578, 0.640]
Test statisticValue
z statistic-11.077
p-value0.0000
Cohen's h-0.445
Effect sizesmall

Failed Rate

MetricOpus 4.5Opus 4.6
Proportion0.1200.054
Count228 / 190051 / 937
95% CI[0.106, 0.135][0.042, 0.071]
Test statisticValue
z statistic5.516
p-value0.0000
Cohen's h0.236
Effect sizesmall

Satisfaction Rate

MetricOpus 4.5Opus 4.6
Proportion0.2350.202
Count447 / 1900189 / 937
95% CI[0.217, 0.255][0.177, 0.229]
Test statisticValue
z statistic2.016
p-value0.0438
Cohen's h0.081
Effect sizenegligible

Dissatisfaction Rate

MetricOpus 4.5Opus 4.6
Proportion0.1180.108
Count225 / 1900101 / 937
95% CI[0.105, 0.134][0.089, 0.129]
Test statisticValue
z statistic0.835
p-value0.4037
Cohen's h0.034
Effect sizenegligible

Chi-Square Test: Task Completion Distribution

Note: The full categorical breakdown includes 4 unique completion statuses. For clarity, simplified counts are shown below.

CategoryOpus 4.5Opus 4.6
Complete739571
Partial731277
Interrupted20238
Failed22851
Other00
Test statisticValue
χ² statistic139.581
Degrees of freedom3
p-value0.000
Cramér's V0.222
Effect sizesmall

Bonferroni Correction

With 11 independent tests conducted (1 Mann-Whitney U, 9 proportion tests, 1 chi-square), the Bonferroni-corrected significance threshold is α = 0.05 / 11 = 0.0045.

Tests surviving Bonferroni correction (p < 0.0045):

  • Complete Rate: p = 0.000000 (significant)
  • Failed Rate: p = 0.000000 (significant)
  • One Shot Rate: p = 0.000000 (significant)
  • Good Execution Rate: p = 0.000000 (significant)
  • Task completion distribution: p = 0.000000 (significant)

Tests significant at α = 0.05 but not after correction:

  • Alignment score: p = 0.0007 (marginal)
  • Satisfaction Rate: p = 0.0438 (marginal)
  • Scope Expanded Rate: p = 0.0012 (marginal)
  • Has Edits Rate: p = 0.0453 (marginal)
  • Has Overlaps Rate: p = 0.0433 (marginal)

Non-significant tests:

  • Dissatisfaction Rate: p = 0.404 (both models 0%)

Cross-Cut Detail

Quality Findings
MeasurementSliceDirectionEffectpadjSig
Satisfaction Rate task_type:greenfield opus-4-6 higher 1.242
0.0109 FDR
Alignment Score task_type:greenfield opus-4-6 higher 1.213
0.0230 FDR
Complete Rate task_type:greenfield opus-4-6 higher 0.963
0.0427 FDR
Normalized User Sentiment task_type:greenfield distributions differ 0.644
0.0318 FDR
Satisfaction Rate task_type:refactor opus-4-6 higher 0.643
0.0207 FDR
Complete Rate complexity:trivial opus-4-6 higher 0.607
0.0000 Bonf
Good Execution Rate complexity:moderate opus-4-5 higher 0.506
0.0000 Bonf
Good Execution Rate complexity:complex opus-4-5 higher 0.473
0.0003 Bonf
Complete Rate overall opus-4-6 higher 0.445
0.0000 Bonf
Complete Rate iteration:one_shot opus-4-6 higher 0.419
0.0000 Bonf
Good Execution Rate iteration:one_shot opus-4-5 higher 0.406
0.0000 Bonf
Good Execution Rate task_type:feature opus-4-5 higher 0.405
0.0060 FDR
Complete Rate complexity:moderate opus-4-6 higher 0.383
0.0000 Bonf
Satisfaction Rate iteration:one_shot opus-4-5 higher 0.350
0.0000 Bonf
Complete Rate complexity:simple opus-4-6 higher 0.309
0.0012 FDR
Good Execution Rate complexity:simple opus-4-5 higher 0.306
0.0015 FDR
Satisfaction Rate complexity:complex opus-4-5 higher 0.300
0.0307 FDR
Alignment Score iteration:significant equal 0.295
0.0002 Bonf
Task Completion complexity:trivial distributions differ 0.290
0.0000 Bonf
Normalized Execution Quality complexity:moderate distributions differ 0.267
0.0000 Bonf
Alignment Score complexity:trivial equal 0.260
0.0000 Bonf
Normalized Execution Quality complexity:complex distributions differ 0.260
0.0004 FDR
Alignment Score iteration:one_shot opus-4-5 higher 0.245
0.0000 Bonf
Failed Rate complexity:trivial opus-4-5 higher 0.237
0.0013 FDR
Failed Rate overall opus-4-5 higher 0.236
0.0000 Bonf
Dissatisfaction Rate iteration:one_shot opus-4-6 higher 0.226
0.0001 Bonf
Task Completion overall distributions differ 0.222
0.0000 Bonf
Normalized Execution Quality iteration:one_shot distributions differ 0.220
0.0000 Bonf
Failed Rate iteration:one_shot opus-4-5 higher 0.214
0.0012 FDR
Normalized Execution Quality task_type:feature distributions differ 0.211
0.0354 FDR
Normalized User Sentiment complexity:complex distributions differ 0.210
0.0096 FDR
Good Execution Rate overall opus-4-5 higher 0.207
0.0000 Bonf
Normalized User Sentiment iteration:one_shot distributions differ 0.199
0.0000 Bonf
Task Completion iteration:one_shot distributions differ 0.197
0.0000 Bonf
Task Completion complexity:moderate distributions differ 0.197
0.0001 Bonf
Dissatisfaction Rate complexity:moderate opus-4-5 higher 0.196
0.0431 FDR
Normalized Execution Quality complexity:simple distributions differ 0.191
0.0009 FDR
Task Completion complexity:simple distributions differ 0.153
0.0093 FDR
Normalized User Sentiment complexity:simple distributions differ 0.145
0.0161 FDR
Task Completion task_type:investigation distributions differ 0.134
0.0378 FDR
Alignment Score overall equal 0.134
0.0022 FDR
Normalized Execution Quality overall distributions differ 0.133
0.0000 Bonf
Task Completion iteration:significant distributions differ 0.129
0.0026 FDR
Normalized User Sentiment overall distributions differ 0.064
0.0236 FDR
82 non-significant quality results
MeasurementSliceEffectpadj
Dissatisfaction Ratetask_type:greenfield1.0020.1256
Failed Ratetask_type:refactor0.5800.2222
Failed Ratetask_type:greenfield0.5620.4342
Task Completiontask_type:greenfield0.5550.0936
Good Execution Ratetask_type:refactor0.4360.1536
Complete Ratetask_type:refactor0.4150.1673
Normalized User Sentimenttask_type:refactor0.3120.1218
Normalized Execution Qualitytask_type:refactor0.3090.2163
Task Completiontask_type:refactor0.3020.1427
Dissatisfaction Ratetask_type:bugfix0.3010.1964
Normalized Execution Qualitytask_type:greenfield0.2890.4565
Failed Ratecomplexity:complex0.2850.2163
Good Execution Ratetask_type:sysadmin0.2690.0855
Complete Ratecomplexity:complex0.2560.0669
Dissatisfaction Ratetask_type:refactor0.2560.4534
Satisfaction Ratetask_type:bugfix0.2510.2282
Alignment Scoretask_type:bugfix0.2400.3008
Alignment Scoretask_type:investigation0.2050.0653
Normalized Execution Qualitytask_type:sysadmin0.1970.0766
Failed Ratetask_type:bugfix0.1940.4342
Satisfaction Ratecomplexity:simple0.1900.0670
Failed Ratetask_type:investigation0.1890.1352
Complete Ratetask_type:investigation0.1810.1195
Failed Ratecomplexity:moderate0.1760.1022
Failed Ratecomplexity:simple0.1610.1455
Good Execution Ratetask_type:investigation0.1590.1799
Failed Ratetask_type:feature0.1570.4205
Alignment Scoretask_type:refactor0.1560.8416
Good Execution Ratetask_type:greenfield0.1540.7595
Alignment Scorecomplexity:simple0.1500.2909

Quality metrics paint a consistent-but-modest picture: 4.6 fails less and scores higher on alignment, but effect sizes are small (d=0.13) and the LLM-classification methodology adds a layer of uncertainty. The next section asks whether these quality differences manifest in the editing process itself.

6. Edit Accuracy

Rewrite Rate
11.6% vs 18.2%
4.6 rewrites less of its own output
Editing Tasks
1,135
767 (4.5) + 368 (4.6)
Overlapping Edits
486 vs 204
Self-corrections, error recovery, user-directed, iterative

Edit timeline analysis tracks every Edit and Write tool call, building per-file content ownership maps to detect when a model later overwrites its own earlier output. Opus 4.5 rewrites 18.2% of its edits vs 11.6% for Opus 4.6. Overlap classification reveals the rewrites are predominantly iterative refinement (64% for 4.5, largest category), not error recovery.

Overlap Breakdown

MetricDistribution4.54.6
Tasks with edits700246
Edit calls (rewrite rate denom.)2,4531,166
Rewrite rate
A
16.6%
B
10.3%
16.6% 10.3%
Total overlapping edits407120
Self-corrections
A
11.5%
B
31.7%
4738
Error recovery
A
15.7%
B
15.0%
6418
User-directed corrections
A
9.8%
B
0.8%
401
Iterative refinement
A
62.9%
B
52.5%
25663

The overlap composition tells a more nuanced story than the headline rewrite rate. When Opus 4.6 does overlap, a larger share is self-correction (30.4% vs 10.1% for 4.5)—meaning 4.6 catches and fixes its own mistakes more explicitly. Opus 4.5’s overlaps are more heavily iterative refinement (64% vs 59%), suggesting gradual adjustment rather than correction. Error recovery rates are comparable (15.2% vs 10.3%).

Full edit overlap breakdown
MetricDistribution4.54.6
Tasks with edits767368
Edit calls (rewrite rate denom.)2,6741,765
Rewrite rate
A
18.2%
B
11.6%
18.2% 11.6%
Total overlapping edits486204
Self-corrections
A
10.1%
B
30.4%
4962
Error recovery
A
15.2%
B
10.3%
7421
User-directed corrections
A
11.1%
B
0.5%
541
Iterative refinement
A
63.6%
B
58.8%
309120
Self-correction rate by complexity
ComplexitySelf-Correction Rate4.5 (n)4.6 (n)
Trivial
A
0.0%
B
0.0%
7120
Simple
A
0.8%
B
2.6%
19964
Moderate
A
1.8%
B
2.0%
322160
Complex
A
1.2%
B
3.2%
157103
Major
A
1.3%
B
2.6%
1821

Cross-Cut Detail

Editing Findings
MeasurementSliceDirectionEffectpadjSig
Has Edits Rate complexity:simple opus-4-5 higher 0.488
0.0000 Bonf
Lines Removed complexity:simple equal 0.387
0.0000 Bonf
Lines Added task_type:refactor opus-4-6 higher 0.340
0.0187 FDR
Has Overlaps Rate complexity:simple opus-4-5 higher 0.332
0.0013 FDR
Has Edits Rate complexity:moderate opus-4-5 higher 0.330
0.0002 Bonf
Lines Added complexity:moderate opus-4-5 higher 0.311
0.0001 Bonf
Max Chain Depth iteration:minor equal 0.287
0.0351 FDR
Triage Score complexity:simple equal 0.268
0.0012 FDR
Max Chain Depth complexity:simple equal 0.265
0.0013 FDR
Rewrite Rate complexity:simple equal 0.258
0.0015 FDR
Rewrite Rate iteration:minor equal 0.254
0.0404 FDR
Triage Score iteration:minor equal 0.252
0.0387 FDR
Has Overlaps Rate iteration:minor opus-4-5 higher 0.248
0.0481 FDR
Overlap Count iteration:minor equal 0.246
0.0458 FDR
Has Overlaps Rate complexity:moderate opus-4-5 higher 0.234
0.0124 FDR
Overlap Count complexity:simple equal 0.230
0.0016 FDR
Lines Removed complexity:moderate opus-4-5 higher 0.224
0.0002 Bonf
Lines Added complexity:simple opus-4-5 higher 0.206
0.0000 Bonf
Max Chain Depth complexity:moderate equal 0.194
0.0093 FDR
Overlap Count complexity:moderate equal 0.185
0.0118 FDR
Rewrite Rate complexity:moderate equal 0.166
0.0113 FDR
Triage Score complexity:moderate equal 0.147
0.0094 FDR
90 non-significant editing results
MeasurementSliceEffectpadj
Has Overlaps Ratetask_type:greenfield0.8060.2282
Max Chain Depthtask_type:greenfield0.5790.2606
Triage Scoretask_type:greenfield0.5790.2606
Rewrite Ratetask_type:greenfield0.5790.2606
Overlap Counttask_type:greenfield0.5450.2606
Lines Addedtask_type:bugfix0.5290.4868
Lines Removedtask_type:refactor0.4520.4819
Rewrite Ratetask_type:bugfix0.4420.2823
Rewrite Ratetask_type:refactor0.3950.3869
Has Edits Ratetask_type:greenfield0.3180.5317
Triage Scoretask_type:bugfix0.3080.3949
Lines Removedtask_type:bugfix0.3040.5674
Triage Scoretask_type:refactor0.2790.4835
Max Chain Depthtask_type:bugfix0.2440.4525
Max Chain Depthtask_type:refactor0.2440.4758
Has Overlaps Ratetask_type:sysadmin0.2390.1818
Max Chain Depthtask_type:sysadmin0.2280.1799
Rewrite Ratetask_type:sysadmin0.2260.1799
Has Overlaps Ratetask_type:refactor0.2200.4876
Lines Addedtask_type:greenfield0.2180.6607
Overlap Counttask_type:refactor0.2070.4541
Has Edits Ratecomplexity:complex0.2050.1589
Max Chain Depthtask_type:feature0.1990.5094
Overlap Counttask_type:sysadmin0.1940.1828
Has Edits Ratetask_type:feature0.1890.2309
Lines Removedcomplexity:complex0.1770.3578
Triage Scoretask_type:sysadmin0.1740.1799
Triage Scoretask_type:investigation0.1730.6125
Lines Removediteration:significant0.1710.4310
Triage Scoretask_type:feature0.1650.5655
Interpretation: A lower rewrite rate is consistent with Opus 4.6’s research-first approach—investigating before editing reduces the need for later corrections. However, the distinction between “self-correction” and “iterative refinement” is heuristic-based, and Opus 4.6’s overlap sample is small (n=204), making per-category percentages volatile—the user-directed category at 0.8% represents a single edit.

Edit patterns capture one dimension of how the models work; the next section broadens the lens to overall resource usage and complexity scaling.

7. Complexity & Resource Usage

Tool Calls / Task
13.3 vs 9.6
Strongest behavioral signal (p<0.000001, d=-0.22)
Complexity Mix
41% vs 34%
4.6 has more moderate-and-above tasks
Lines Added
100 vs 104
Only +-4% despite 38% more tool calls
Opus 4.5 Opus 4.6

Task Distribution by Complexity

ComplexityDistribution4.5 n4.6 n
Trivial
A
46.4%
B
36.9%
882 346
Simple
A
20.1%
B
22.3%
381 209
Moderate
A
21.7%
B
26.4%
413 247
Complex
A
10.4%
B
12.0%
198 112
Major
A
1.4%
B
2.5%
26 23

Opus 4.6 sessions skew toward higher complexity: fewer trivial tasks (37% vs 46%) and proportionally more moderate tasks (26% vs 22%). This makes raw aggregate comparisons misleading—Opus 4.6 is tackling harder work on average.

Resource Usage

MetricDistribution4.54.6Δ
Avg tools per task
A
9.6
B
13.3
9.6 13.3 B +38%
Avg files per task
A
1.7
B
2.3
1.7 2.3 B +40%
Avg lines added
A
104.0
B
100.3
104.0 100.3 ≈ Tie
The exploration–output tradeoff: Tool calls per task is the strongest behavioral signal in the study (p<0.000001, d=0.29, Bonferroni); tools per file also survives correction (p<0.000001, d=0.10). Yet Opus 4.6 produces only 6% more lines of code per task despite 38% more tool calls. The extra activity is predominantly read-only research (74% Explore subagents, §4), not proportional output growth. An alternative reading: 4.6 is simply less efficient, doing more work for similar results. The subagent composition data from §4 supports the research interpretation, but the distinction matters.
Significance: Tool calls/task (p<0.000001, d=−0.29) and tools/file (p<0.000001, d=−0.10) survive Bonferroni correction. The d=−0.10 for tools/file is negligible in practical terms despite statistical significance, an artifact of large sample size. The tool call averages in the table above (9.6 vs 13.3) include subagent calls; the stat test was run on per-task attributed calls (mean 8.9 vs 12.9), which show the same directional effect.
Task Scope by Complexity
Complexity4.5 tasks4.6 tasks4.5 files/task4.6 files/task4.5 lines+/task4.6 lines+/task4.5 lines−/task4.6 lines−/task
Trivial 882 346 0.1 0.1 0 0 0 0
Simple 381 209 1.0 0.7 14 10 8 3
Moderate 413 247 2.8 2.9 112 79 37 26
Complex 198 112 5.6 7.2 472 420 102 131
Major 26 23 15.9 21.5 2006 1096 140 271

Cross-Cut Detail

Tool calls and tools/file are classified under the “behavior” theme in the cross-cut analysis. Their per-complexity, per-task-type, and per-iteration breakdowns appear in §4’s cross-cut detail (Behavioral Findings). Key results: the tool-call gap is largest for significantly-iterated tasks (d=0.53) and trivial complexity (d=0.30), both Bonferroni-significant.

The preceding sections examined behavioral, quality, and resource dimensions. The next section examines temporal patterns—how performance unfolds within and across sessions.

8. Session Dynamics

Median Task Duration
62s vs 42s
4.6 takes 46% longer per task
Explore Phase
2.3×
71.0s vs 31.3s median explore duration
Active-Time Cost
$27.48 vs $25.52/hr
5-min idle threshold

Task duration survives Bonferroni correction (p=0.000001), though the effect size is negligible (d=0.005)—a case of statistical significance without practical significance, driven by sample size. Opus 4.6 takes longer per task (median 62s vs 42s, a 46% increase). The explore phase runs 2.3× longer at median (71.0s vs 31.3s). Effort distribution shows 4.6 allocates more tool calls to research (35.1% vs 28.3%) and fewer to implementation (17.5% vs 27.0%). Active-time cost is $27.48/hour for 4.6 vs $25.52/hour for 4.5 (5-min idle threshold).

Task Duration

PercentileComparison4.54.6
p10
A
8s
B
10s
8s 10s
p25
A
15s
B
20s
15s 20s
Median
A
42s
B
1.0m
42s 1.0m
p75
A
2.0m
B
3.4m
2.0m 3.4m
p90
A
4.5m
B
8.2m
4.5m 8.2m
Task Duration Distribution

Percentiles

PercentileComparison4.54.6
p10
A
8s
B
10s
8s 10s
p25
A
15s
B
20s
15s 20s
Median
A
42s
B
1.0m
42s 1.0m
p75
A
2.0m
B
3.4m
2.0m 3.4m
p90
A
4.5m
B
8.2m
4.5m 8.2m

Duration buckets

DurationDistribution4.54.6
Under 30s
A
42.1%
B
34.4%
772 (42.1%) 310 (34.4%)
30s – 2m
A
33.0%
B
29.9%
604 (33.0%) 270 (29.9%)
2m – 10m
A
21.4%
B
27.9%
392 (21.4%) 252 (27.9%)
10m – 1h
A
3.0%
B
6.8%
55 (3.0%) 61 (6.8%)
Over 1h
A
0.5%
B
1.0%
9 (0.5%) 9 (1.0%)

Session Length & Warmup

Session Length Effects
Session LengthAlignment (4.5 / 4.6)Completion RateSessions (4.5 / 4.6)
Short (1–3 tasks) 2.93 / 3.53
A
33.3%
B
41.7%
135 / 26
Medium (4–8 tasks) 2.93 / 3.16
A
30.3%
B
37.4%
45 / 27
Long (9+ tasks) 2.85 / 2.96
A
27.0%
B
28.6%
58 / 11
Warm-up Effects
PhaseAlignmentCompletion RateTools/File (4.5 / 4.6)
Early (first 3 tasks)
A
2.85
B
2.95
A
24.2%
B
25.0%
4.95 / 5.68
Later (task 4+)
A
2.85
B
3.06
A
28.0%
B
33.5%
4.50 / 5.11

Active-Time Cost

Idle threshold4.5 active hrs4.6 active hrs4.5 $/hr4.6 $/hrΔ $/hr
2 min 181.1 88.7 $27.53 $29.38 +7%
5 min 195.4 94.8 $25.52 $27.48 +8%
10 min 212.7 103.2 $23.45 $25.25 +8%
20 min 235.8 114.9 $21.15 $22.68 +7%
30 min 249.4 124.8 $20.00 $20.87 +4%
60 min 282.3 148.6 $17.67 $17.53 −1%
Complication: Session overlap analysis (1,953 overlapping pairs, max concurrency 11) complicates per-session cost attribution. Active-time cost varies with idle threshold (see sensitivity table above), but the directional relationship is stable across all thresholds tested.

Context Compaction

Context-window compaction occurs in 9.8% of 4.5 sessions (32/327) and 11.7% of 4.6 sessions (22/188). Pre/post comparisons show improvement after compaction, but a position-adjusted control group—splitting non-compacting sessions at the median compaction position to isolate position effects—reveals the effect is driven by session position, not compaction itself (position-adjusted effect: −0.17 for 4.5, −0.17 for 4.6). Compaction appears to preserve rather than degrade performance.

Compaction Overview
MetricDistribution4.54.6
Sessions with compaction
A
9.8%
B
11.7%
32 / 32722 / 188
Total compaction events5135
Events per compacting session1.591.59
Auto-triggered70.6%80.0%
Avg pre-compaction tokens156,823164,617
Avg position in session59.2%60.0%
Pre/post compaction outcome data
Opus 4.5Opus 4.6
MetricCompacting ΔControl ΔNet effectCompacting ΔControl ΔNet effect
Alignment score +0.08 +0.24 −0.17 +0.08 +0.24 −0.17
Satisfaction rate +5.1pp +3.6pp +1.4pp +5.6pp +7.3pp −1.7pp
Completion rate −0.0pp +9.0pp −9.0pp −1.3pp +9.8pp −11.1pp
Position-adjusted effect: The negative values mean compacting sessions improve less than position-matched controls, suggesting the apparent post-compaction improvement is driven by session position rather than compaction itself. Compaction neither helps nor substantially harms outcomes.

Cross-Cut Detail

Duration is classified under the “behavior” theme in the cross-cut analysis. Per-task-type and per-iteration breakdowns appear in §4’s cross-cut detail (Behavioral Findings). Key results: the duration gap is largest for significantly-iterated tasks (median 99.0s vs 45.5s) and investigation tasks (median 79.9s vs 41.0s), both Bonferroni-significant. Effect sizes are negligible (d<0.1) despite significance—driven by sample size, not practical magnitude.

Session dynamics reveal a temporal dimension to the behavioral differences. The next section synthesizes all dimensions into overall model profiles.

9. Model Profiles

Not routing recommendations: These profiles summarize observed behavioral patterns from a single user’s workflow. They describe tendencies in this dataset, not inherent model properties. Different users, tasks, or evaluation periods could produce different profiles.
Opus 4.5 — Observed Pattern

Observed approach: Tends to act first and adjust as needed. Jumps to implementation with minimal upfront research.

Thinking: Thinks on 75% of tasks but shallowly (2,578 avg chars). Over-thinks trivial tasks (§3).

Subagents: 49% Explore, 32% general-purpose (implementation workers). Primarily autonomous (55%).

Planning: Rarely uses planning mode (1.8%). Distributes research evenly through the task (§4).

Observed strengths: Lower tool overhead (mean 8.9 calls/task). ~4.9% cheaper overall per task (§2). Stable performance across session lengths (§8).

Observed weaknesses: Higher rewrite rate (18.2%, §6). Higher failure rate (12.0% vs 5.4%, §5).

Opus 4.6 — Observed Pattern

Observed approach: Tends to research first, then implement. Front-loads investigation before touching files.

Thinking: Thinks on 59% of tasks but deeply (4,067 avg chars). Better calibrated—skips thinking on trivial, engages on complex (§3).

Subagents: 69% Explore (read-only research), 20% general-purpose. More autonomous (84%).

Planning: Uses planning mode on 12.3% of tasks (115 of 937). 43% at complex, 65% at major (§4).

Observed strengths: Lower rewrite rate (11.6%, §6). Lower failure rate (5.4%, §5). Lower cost at trivial–moderate (§2).

Observed weaknesses: 38% more tool calls per task (§7). ~4.9% more expensive overall. Costlier at major complexity (n=23, §2).

Observed Patterns by Task Type

Task TypeObserved PatternEvidence & Caveats
Trivial / simple tasksSimilar completion rates4.6 is 28–35% cheaper (§2); n=882/346 and 381/209
Complex / major tasks4.6 showed higher alignmentn=112+23 for 4.6 vs 198+26 for 4.5; confounded by project differences
Refactoring4.6 produced 2.1× output tokens5,647 vs 2,674 avg output (§2); lower rewrite rate (§6)
Investigation / research4.6 used more Explore agents69% read-only subagents (§4); 2.3× longer explore phase (§8)
Long sessions (9+ tasks)Both show some degradationSmall sample for late-session tasks; 4.6 may degrade faster
Parallel execution4.6 backgrounded more tasks4.5 spawned more agents but ran them sequentially

10. Methodology

Data Cleaning Methodology

Task-level data cleaning applied four exclusion rules and four informational flags to canonical tasks before analysis. Exclusions remove tasks that do not represent genuine user-model interactions; flags annotate tasks with contextual metadata without removing them.

Exclusion Rules

RuleDescriptionOpus 4.5Opus 4.6
slash_commandTask prompt is a slash command (/command) or <command-name> tag — these invoke built-in features, not model reasoning
system_continuationAutomatic continuations triggered by the system (e.g., context compaction boundaries, session resumptions) rather than deliberate user prompts
empty_continuationBare acknowledgement prompts ("continue", "ok", "yes") with zero tool calls and <5s duration — the model produced no meaningful work
no_response_interruptTasks where the model produced zero output (0 tool calls, 0 duration) before the session ended, typically user cancellations

Informational Flags

These flags are preserved on included tasks for subgroup analysis but do not trigger exclusion:

  • meta — Task occurred within a meta-analysis session (e.g., this report's own development), where the model analyzed its own output
  • no_project — No project directory was associated with the session
  • interrupted — User interrupted the model mid-work (next message was [Request interrupted]). Reasons vary: accidental, correction, redirection, or technical issues
  • post_compaction — Task occurred after a context compaction event in the same session, potentially with degraded context

Project Overlap & Sensitivity Analysis

A potential confound arises from unequal project coverage between models. To quantify this, a sensitivity analysis compares all statistical tests on the full dataset against a restricted subset containing only tasks from projects where both models were active. If results agree across both analyses, the project confound is unlikely to explain observed differences.

This section documents how each pipeline step works. Each step includes a summary of the approach and a collapsible detail block with thresholds, algorithms, and parameters.

Task Extraction & Classification

Each Claude Code session was segmented into tasks at user-message boundaries. An LLM annotator (Haiku) then classified each task for complexity, type, sentiment, completion status, and alignment score (1–5 scale). Behavioral metrics—subagent usage, planning, parallelization—were extracted directly from tool-call logs.

Sentiment detection detail

Three independent signal sources feed into sentiment aggregation:

  1. Keyword patterns: Regex matching in user messages for positive signals (thanks, perfect, excellent, looks good), negative signals (wrong, incorrect, please fix, revert, undo), and continuation signals (now, next, also, can you). Confidence: low (0 hits), medium (1–2), high (≥3).
  2. Structural edit signals: Self-corrections (consecutive edits overlapping on same file), error recoveries (edits within 10 message indices of an error), user corrections (redirect patterns in the next user message), and rewrite rate (overlaps / total edits).
  3. LLM judgement: Haiku classifies the full task context. Free-text sentiment is normalized to satisfied/neutral/dissatisfied/ambiguous via pattern matching.

Aggregation uses downgrade logic: if edit signals contradict the LLM (e.g., user corrections present but LLM says “satisfied”), the combined score is downgraded. If rewrite rate >0.3 but execution quality is “excellent,” the quality score is downgraded to “good.”

Cross-Cut Dimensions

All 529 statistical tests are run both at the overall level and stratified across three cross-cut dimensions. Each section’s “Cross-Cut Detail” expansion shows how its metrics behave under each slice.

Cross-cut dimension definitions
DimensionLevelsMethod
Complexity trivial (≤3 tools, ≤1 file, ≤20 lines), simple (≤10, ≤3, ≤100), moderate (≤30, ≤10, ≤500), complex (≤80, ≤25, ≤2000), major (above all thresholds) Metric thresholds on tool calls, files touched, and lines changed. Lowest matching tier wins. Keyword heuristics as tiebreaker.
Task type investigation, bugfix, feature, greenfield, refactor, sysadmin, docs, continuation, port LLM-classified (Haiku) from user prompt, tool usage, and work summary. Regex pattern matching provides initial signal; LLM classification overrides at medium/high confidence, resolving previously “unknown” tasks (33.6% of dataset). Eval: 100% unknown resolution, LLM agrees with regex on 55% of classified tasks.
Iteration one_shot (no back-and-forth), minor (small corrections), significant (multiple rework cycles) LLM-classified from the user’s next message after task completion, informed by edit signal heuristics (self-corrections, rewrite rate).
Minimum sample sizes: Cross-cut cells with fewer than 5–10 observations (depending on test type) are excluded from statistical testing. This primarily affects the “major” complexity tier (n=26/23) and rare task types.

Edit Timeline Analysis

The edit timeline reconstructs a per-file content ownership history from every Edit/Write tool call across all sessions. When a later edit’s old_string overlaps with content placed by an earlier edit, a rewrite is detected—providing a mechanistic signal for self-correction that doesn’t depend on sentiment classification.

Overlap detection tiers and classification

Overlaps are matched via three tiers, evaluated in order:

TierMethodThreshold
Exact String equality between prior new_string and later old_string 100% match
Containment Substring match with size constraints ≥40 chars AND ≥30% of larger string
Line overlap Jaccard coefficient on non-trivial lines (>15 chars) Jaccard >0.3 OR coverage >0.5

Each detected overlap is classified by context:

  • Self-correction: Same task, no intervening user prompt or errors
  • Error recovery: Error detected between the two edits
  • User-directed: Dissatisfaction keyword in intervening user message
  • Iterative refinement: Chain depth >3, or none of the above

A per-task triage score weights these: (self_corrections×3 + error_recoveries×2 + user_corrections×5 + max_chain_depth) / total_edits. Edit metrics were joined with task classifications to compute complexity-binned accuracy rates (100% coverage for both models).

Compaction Analysis

Claude Code compacts conversation context when token limits approach. This analysis measures whether compaction degrades task outcomes or merely correlates with session position.

Compaction detection and outcome measurement

86 compact_boundary system messages were found across 54 compacting sessions, with trigger type, pre-compaction token count, and session position extracted for each. Outcome impact was measured by splitting tasks into pre/post groups at the first compaction timestamp. A control group of non-compacting sessions, split at the median compaction position, isolates position effects from compaction effects.

Statistical Testing

529 tests were conducted across overall and per-complexity strata, using Bonferroni correction—the most conservative standard—to minimize false positives given the observational design.

Test types, effect sizes, and correction thresholds

Three test types were used: chi-square for categorical distributions (effect size: Cramér’s V), Mann-Whitney U for continuous metrics (Cohen’s d with bootstrap confidence intervals, n=5,000 resamples), and two-proportion Z-tests for rates (Cohen’s h). Confidence intervals on proportions use Wilson score intervals.

Bonferroni corrected threshold: p<0.0000945. Across all 529 tests, 141 survive Bonferroni and 234 survive FDR correction. At the overall level, 21 survive Bonferroni, including alignment score (p=0.000714), duration (p<0.000001, though d=-0.000—negligible practical effect), tool calls/task (p<0.000001, d=−0.22), tools/file (p<0.000001, d=−0.10), and three categorical distributions (task completion, communication quality, autonomy level). Two of the three chi-square survivors have low-expected-cell-count warnings, which may inflate their test statistics.

Complete statistical test results (529 tests)
Test Category Field p-value Effect Size Bonferroni Result
Mann-Whitney U alignment_score 0.000714 d = -0.1337
Opus 4.6 higher (p < 0.05) CIA: [3.0, 3.1], CIB: [3.1, 3.2]
duration_seconds 0.000000 d = -0.0000
Opus 4.6 lower (Bonferroni significant) CIA: [154.1, 550.6], CIB: [206.9, 499.5]
tool_calls 0.000000 d = -0.2214
Opus 4.6 higher (Bonferroni significant) CIA: [8.2, 9.7], CIB: [11.6, 14.2]
files_touched 0.049942 d = -0.1628
Opus 4.6 higher (p < 0.05) CIA: [1.5, 1.8], CIB: [2.0, 2.7]
lines_added 0.835121 d = 0.0092
No significant difference CIA: [86.2, 125.4], CIB: [82.8, 119.8]
lines_removed 0.147622 d = -0.0929
No significant difference CIA: [19.3, 25.4], CIB: [23.3, 37.4]
lines_per_minute 0.121768 d = 0.1424
No significant difference CIA: [36.9, 43.8], CIB: [26.4, 33.9]
tools_per_file 0.000000 d = -0.1033
Opus 4.6 higher (Bonferroni significant) CIA: [4.0, 4.7], CIB: [4.7, 5.4]
Proportion Test satisfaction_rate 0.043843 h = 0.0813
Opus 4.6 lower (p < 0.05) A: 23.5% [21.7%, 25.5%], B: 20.2% [17.7%, 22.9%]
dissatisfaction_rate 0.403717 h = 0.0336
No significant difference A: 11.8% [10.5%, 13.4%], B: 10.8% [8.9%, 12.9%]
complete_rate 0.000000 h = -0.4445
Opus 4.6 higher (Bonferroni significant) A: 38.9% [36.7%, 41.1%], B: 60.9% [57.8%, 64.0%]
failed_rate 0.000000 h = 0.2365
Opus 4.6 lower (Bonferroni significant) A: 12.0% [10.6%, 13.5%], B: 5.4% [4.2%, 7.1%]
scope_expanded_rate 0.001178 h = 0.1551
Opus 4.6 lower (p < 0.05) A: 1.8% [1.3%, 2.5%], B: 0.3% [0.1%, 0.9%]
one_shot_rate 0.000000 h = -0.3747
Opus 4.6 higher (Bonferroni significant) A: 42.0% [39.8%, 44.2%], B: 60.6% [57.5%, 63.7%]
good_execution_rate 0.000000 h = 0.2075
Opus 4.6 lower (Bonferroni significant) A: 31.5% [29.4%, 33.6%], B: 22.3% [19.8%, 25.1%]
Chi-square task_completion 0.000000 V = 0.2218
Distribution differs (p < 0.05, V = 0.2218)
scope_management 0.000000 V = 0.2963
Distribution differs (p < 0.05, V = 0.2963) (low cell counts)
iteration_required 0.000000 V = 0.2076
Distribution differs (p < 0.05, V = 0.2076) (low cell counts)
error_recovery 0.000268 V = 0.1399
Distribution differs (p < 0.05, V = 0.1399) (low cell counts)
communication_quality 0.000000 V = 0.2038
Distribution differs (p < 0.05, V = 0.2038) (low cell counts)
autonomy_level 0.000000 V = 0.2522
Distribution differs (p < 0.05, V = 0.2522) (low cell counts)

Sensitivity Analysis

To validate robustness, all overall-level tests were re-run on a restricted dataset excluding overlapping projects. This tests whether findings depend on specific project mix or are stable across the data.

Sensitivity Analysis: Robustness Validation

Restricted dataset excludes 8 overlapping projects to test whether findings depend on specific project mix.

Overall Bonferroni survivors: 15 (full dataset) vs 16 (restricted).

MetricTestFull pRestricted pPersists?
Task CompletionChi Square3.00e-060.00e+00Yes
Communication QualityChi Square0.00e+000.00e+00Yes
Autonomy LevelChi Square0.00e+000.00e+00Yes
Alignment ScoreMann Whitney1.00e-060.00e+00Yes
Duration SecondsMann Whitney0.00e+000.00e+00Yes
Tool CallsMann Whitney0.00e+000.00e+00Yes
Tools Per FileMann Whitney0.00e+000.00e+00Yes
Total Output TokensMann Whitney0.00e+000.00e+00Yes
Total Input TokensMann Whitney0.00e+000.00e+00Yes
Thinking CharsMann Whitney0.00e+004.30e-05Yes
Request CountMann Whitney0.00e+000.00e+00Yes
Cost Per MinuteMann Whitney1.00e-062.90e-05Yes
Output Per RequestMann Whitney0.00e+000.00e+00Yes
Cache Hit RateMann Whitney0.00e+000.00e+00Yes
Thinking FractionMann Whitney0.00e+000.00e+00Yes

Ranked Findings

#MeasurementThemeDirectionEffectpadjSig
1 Thinking Fraction Thinking opus-4-5 higher 0.636 medium
0.0000 Bonf
2 Complete Rate Quality opus-4-6 higher 0.445 small
0.0000 Bonf
3 Total Output Tokens Cost opus-4-6 higher 0.388 small
0.0000 Bonf
4 One Shot Rate Behavior opus-4-6 higher 0.375 small
0.0000 Bonf
5 Output Per Request Cost opus-4-6 higher 0.299 small
0.0000 Bonf
6 Scope Management Behavior distributions differ 0.296 small
0.0000 Bonf
7 Autonomy Level Behavior distributions differ 0.252 small
0.0000 Bonf
8 Cost Per Minute Cost opus-4-5 higher 0.237 small
0.0000 Bonf
9 Failed Rate Quality opus-4-5 higher 0.236 small
0.0000 Bonf
10 Task Completion Quality distributions differ 0.222 small
0.0000 Bonf
11 Tool Calls Behavior opus-4-6 higher 0.221 small
0.0000 Bonf
12 Iteration Required Behavior distributions differ 0.208 small
0.0000 Bonf
13 Good Execution Rate Quality opus-4-5 higher 0.207 small
0.0000 Bonf
14 Communication Quality Behavior distributions differ 0.204 small
0.0000 Bonf
15 Cache Hit Rate Cost equal 0.192 negligible
0.0000 Bonf
16 Request Count Cost opus-4-6 higher 0.190 negligible
0.0000 Bonf
17 Scope Expanded Rate Behavior opus-4-5 higher 0.155 negligible
0.0036 FDR
18 Error Recovery distributions differ 0.140 negligible
0.0009 FDR
19 Thinking Chars Thinking opus-4-5 higher 0.140 negligible
0.0000 Bonf
20 Alignment Score Quality equal 0.134 negligible
0.0022 FDR
21 Normalized Execution Quality Quality distributions differ 0.133 negligible
0.0000 Bonf
22 Total Input Tokens Cost opus-4-5 higher 0.119 negligible
0.0000 Bonf
23 Tools Per File Behavior opus-4-6 higher 0.103 negligible
0.0000 Bonf
24 Normalized User Sentiment Quality distributions differ 0.064 negligible
0.0236 FDR
25 Duration Seconds Behavior opus-4-6 higher 0.000 negligible
0.0000 Bonf
13 non-significant overall results
MeasurementThemeEffectpadj
Files TouchedBehavior0.1630.1005
Lines Per MinuteBehavior0.1420.2091
Rewrite RateEditing0.1260.0528
Triage ScoreEditing0.1230.0587
Max Chain DepthEditing0.0940.0690
Lines RemovedEditing0.0930.2366
Has Overlaps RateEditing0.0820.0899
Satisfaction RateQuality0.0810.0906
Has Edits RateEditing0.0800.0929
Estimated CostCost0.0470.8292
Overlap CountEditing0.0420.0894
Dissatisfaction RateQuality0.0340.5134
Lines AddedEditing0.0090.8647

Development Process

This analysis was developed iteratively. Two early approaches were replaced after proving unreliable:

Abandoned approaches

LLM-only dissatisfaction detection: Initial LLM-based sentiment classification flagged 7–9% dissatisfaction for both models. An audit of all 59 flagged cases revealed 73–93% false positive rates—the classifiers were fooled by task-coordination language (e.g., “fix” in subagent prompts). This was replaced by the current multi-signal approach, which requires corroboration from keyword patterns and structural edit signals before classifying dissatisfaction.

LLM quality judgement: An LLM judge was asked to compare code quality between models. The judge lacked sufficient context to evaluate whether code met domain requirements and produced confident but ungrounded assessments. This was replaced by mechanistic edit timeline analysis, which detects self-corrections from the tool-call record rather than relying on subjective quality assessment.

Reproducing This Analysis

Reproduction guide

The analysis pipeline is fully automated and can reproduce all tables and statistics from the raw session data.

Requirements

  • Python 3.11+
  • scipy (pip install scipy)
  • Claude Code CLI (for LLM-dependent steps only)

Full pipeline

python scripts/run_pipeline.py --data-dir comparisons/opus-4.5-vs-4.6/data

Tables only (no LLM, no cost)

python scripts/run_pipeline.py --data-dir comparisons/opus-4.5-vs-4.6/data --no-llm

Individual steps

# Run from a specific step onward
python scripts/run_pipeline.py --data-dir comparisons/opus-4.5-vs-4.6/data --from stats

# Run specific steps
python scripts/run_pipeline.py --data-dir comparisons/opus-4.5-vs-4.6/data --steps dataset,update,report

# Check what needs re-running
python scripts/run_pipeline.py --data-dir comparisons/opus-4.5-vs-4.6/data --check-stale

Step cost breakdown

StepLLM?Estimated Cost--no-llm behavior
collectNo$0Runs normally
extractNo$0Runs normally
classifyNo$0Runs normally
annotateYes (Haiku)~$7.50Skipped (uses cached annotations)
analyzeNo$0Runs normally
tokensNo$0Runs normally
enrichNo$0Runs normally
statsNo$0Runs normally
findingsNo$0Runs normally
datasetNo$0Runs normally
updatePartial (Opus)~$2.00Tables only, no LLM expression authoring
reportNo$0Runs normally

All statistical results, tables, and charts are deterministic (no LLM). Only task annotation and expression authoring use LLM calls. The --no-llm flag produces identical quantitative results at zero API cost. Total full-pipeline LLM cost: ~$9.50.

Cost methodology

Annotate cost estimated by reconstructing all 3,153 annotation prompts from canonical task data, measuring character counts of prompts (~3,700 chars median) and cached responses (~1,500 chars median), converting at ~4 chars/token, and applying Haiku 4.5 pricing ($0.80/MTok input, $4.00/MTok output). Includes ~20% backfill rate for task-type classification calls. Update cost estimated from the annotated template size (~490K chars, ~122K tokens input) with Opus 4.6 pricing ($15/MTok input, $75/MTok output); one LLM call per pipeline run.

Limitations

LLM-in-the-loop analysis: All task classification, sentiment analysis, and alignment scoring was performed by LLM agents (Claude Haiku and Sonnet). This creates a circularity concern: Claude models are classifying Claude model outputs. No formal inter-rater reliability was computed. Human spot-checks validated flagged cases, but systematic bias between models (e.g., if the classifier is more generous toward outputs that resemble its own style) cannot be ruled out. All three overall chi-square Bonferroni survivors and the alignment score depend on LLM-generated categories.

Single user: All data comes from one developer’s workflow. Results may not generalize to other users, codebases, or task distributions.

Temporal confound: Opus 4.5 spans 70 days; Opus 4.6 spans 13 days. A productive week, a particular project focus, or simply the novelty of a new model could color all 937 Opus 4.6 tasks simultaneously. The null hypothesis—that all observed differences reflect the user’s changing work patterns rather than model capabilities—cannot be rejected by this design.

Observational, not experimental: Tasks were not randomly assigned to models. Opus 4.6 was used later chronologically and on different (often harder) tasks, confounding model effects with task effects.

Complexity confound: Opus 4.6’s different complexity mix (41% moderate-and-above vs 34% for Opus 4.5) inflates its resource usage metrics and may suppress its satisfaction scores. Complexity-stratified comparisons (presented throughout as cross-cut detail) partially control for this, but cannot fully separate model effects from mix effects.

Platform evolution: The Claude Code SDK evolved between December 2025 and February 2026. Changes to system prompts, available tools, or subagent defaults could contribute to behavioral differences attributed to the models.

Sample asymmetry: The 2.0:1 ratio (1,900 vs 937 tasks) means Opus 4.5 estimates have narrower confidence intervals. Effect sizes for Opus 4.6 are less precise.

User learning effect: The user may have learned to use Claude Code more effectively over time, benefiting whichever model came second in the chronological sequence.

Thanks to Anthropic for including me in the Claude Code Early Access Program and for supporting independent research into model behavior. The EAP provided early access to Opus 4.6, making this comparative analysis possible.