Across 529 statistical tests (overall, per-complexity, and cross-cut strata), 141 survive Bonferroni correction—21 at the overall level. Most describe how the model works, not whether it succeeds. The overall success rates are comparable; what changes is the experience of working alongside it. These are the five most noticeable differences, drawn from one user’s workflow over 2,837 tasks.
Opus 4.6 uses formal planning mode on 12.3% of tasks vs 1.8% for 4.5, rising to 43% at complex and 65% at major difficulty. It front-loads codebase investigation with a 2.3× longer explore phase, deploying subagents that are 69% read-only researchers (vs 49% for 4.5). The practical effect is a shift from interactive collaboration to delegation: you issue a prompt and return to find completed work rather than course-correcting mid-task. This shows up in the data as fewer user-directed corrections across all complexity levels (§4, §8).
The largest overall effect across all 529 tests is thinking fraction (d=0.64, medium, §3). Opus 4.5 activates extended thinking on 75% of requests regardless of difficulty; 4.6 activates on 59% but averages 4,067 characters when it does (vs 2,578). On trivial tasks, 4.6 often skips thinking entirely. On complex tasks, it thinks deeply. This calibration means compute is allocated where it matters rather than spread uniformly across every interaction.
Opus 4.6 rewrites its own edits 11.6% of the time vs 18.2%—a 36% reduction. Its self-correction rate is actually higher (3.5% vs 1.8%), meaning it catches its own mistakes rather than having the user point them out. Failure rates drop from 12.0% to 5.4%, and alignment scores improve significantly (p=0.000714, one of 21 overall Bonferroni survivors). The “plan first” approach appears to pay off in execution accuracy (§5, §6).
Median task duration rises 46% (62s vs 42s), with fewer ultra-short interactions (34% of tasks under 30 seconds vs 42%). The task mix shifts toward moderate-to-complex work issued in a single instruction, and 4.6 runs more tasks in the background for parallel execution. This isn’t purely a model capability difference—it’s a workflow adaptation. When the model handles larger tasks reliably, the user gives it larger tasks, waits longer, and intervenes less. The 7× increase in planning mode (§4) and 44% more tool calls per task (§7) are partly a consequence of this delegation shift.
Despite 2.5× more output tokens and 44% more tool calls per task, 4.6 is 13–37% cheaper at trivial through moderate complexity—the bulk of daily work. The reason is counterintuitive: output tokens account for just 6.7% of per-task cost, while cache operations account for 93–97%. Opus 4.6 writes 29% less to cache (the most expensive token category at $18.75/MTok), more than offsetting its higher output and cache reads. Cost only tips higher at 30+ API requests, where cumulative cache reads compound past the write savings. Overall per-task cost is $2.56 vs $2.44: functionally neutral for a meaningfully different style of work (§2).
The data is consistent with a tentative characterization: Opus 4.5 acts first and adjusts, while Opus 4.6 investigates first and implements in concentrated bursts. The confounded study design means this framing is a hypothesis, not a conclusion. The analysis was iterative—several initial findings were revised or reversed when more direct signals became available (§10).
This report is itself a Claude Code project. The analysis pipeline, statistical tests, table generation, and report assembly are all automated Python scripts, most written with substantial assistance from Opus 4.6—the same model being evaluated. LLMs are used in two places: task classification (Haiku annotates complexity, sentiment, and task type) and prose drafting. Most quantitative claims—numbers, tables, and statistical tests—are produced by deterministic computation, not LLM generation. All data comes from one user’s real Claude Code sessions during and after the Early Access Program—not synthetic benchmarks or controlled experiments.
A 12-step pipeline transforms raw JSONL session logs into the finished report. A few things worth noting about the approach:
{{expr | format}}), and most tables are generated from spec files. This reduces transcription errors and makes it easier to keep prose in sync with the data as the analysis evolves. The pipeline can be re-run end-to-end to reproduce the report from raw session logs.Most numbers, tables, and statistical results are computed deterministically from analysis JSON files and are reproducible by re-running the pipeline. The expression system binds prose to data paths, which helps catch drift but is not a guarantee of correctness. Interpretive prose was drafted with LLM assistance and may contain errors or overstatements. Effect sizes and p-values are exact; narrative claims linking those numbers to causal explanations are hypotheses, not conclusions. A sensitivity analysis validates key findings against restricted datasets excluding shared projects. The Methodology section describes every step in full detail.
All data comes from a single user's organic Claude Code sessions between December 2025 and February 2026. The dataset is intentionally asymmetric: Opus 4.5 served as the primary model for two months, while Opus 4.6 entered evaluation in early February. This means Opus 4.5 totals are larger in absolute terms, but per-task and per-session comparisons normalize for this. Where sample size limits statistical power, the report notes it explicitly.
The 13-day concentration of the Opus 4.6 data creates a temporal clustering concern: a productive stretch, a particular project focus, or simply the novelty of a new model could color all 937 tasks simultaneously. The report treats tasks as independent observations, but short collection windows make this assumption weaker for 4.6 than for 4.5’s 70-day span.
The dataset reflects organic usage patterns, not a controlled experiment. Opus 4.5 accumulated sessions over two months of daily use; Opus 4.6 entered evaluation in early February 2026.
| Metric | Opus 4.5 | Opus 4.6 | Combined |
|---|---|---|---|
| Sessions | 329 | 189 | 518 |
| Tasks | 1,900 | 937 | 2,837 |
| Tasks / session | 5.8 | 5.0 | 5.5 |
| Projects | 29 | 22 | 41 |
| Date range | Dec 5 – Feb 13 | Feb 3 – Feb 16 | Dec 5 – Feb 16 |
| User prompts | 1,928 | 855 | 2,783 |
| API turns | 20,834 | 13,861 | 34,695 |
| Tool calls | 18,298 | 12,472 | 30,770 |
The 2:1 session ratio means per-task averages for Opus 4.5 are more robust, while Opus 4.6 estimates carry wider confidence intervals. Opus 4.6 sessions are concentrated across 22 projects (all of which also have Opus 4.5 sessions), providing natural overlap for matched-pair comparisons where they apply.
Tasks are classified by primary type using heuristic pattern matching on prompts, tool usage, and file operations. "Unknown" tasks lacked clear classification signals.
| Type | 4.5 count | 4.6 count | Distribution |
|---|---|---|---|
| Continuation | 587 | 225 | |
| Investigation | 463 | 217 | |
| Feature | 216 | 104 | |
| Bugfix | 205 | 61 | |
| Sysadmin | 188 | 129 | |
| Docs | 102 | 16 | |
| Refactor | 54 | 48 | |
| Greenfield | 30 | 33 | |
| Port | 5 | 8 | |
| Unknown | 50 | 96 |
Complexity is inferred from tool count, files touched, and lines changed. Over half of all tasks are trivial (single-turn interactions), while major tasks (>50 tool calls or >500 lines) represent ~1% of volume but a significant share of cost.
| Complexity | 4.5 count | 4.5 % | 4.6 count | 4.6 % | Distribution |
|---|---|---|---|---|---|
| Trivial | 882 | 46.4% | 346 | 36.9% | |
| Simple | 381 | 20.1% | 209 | 22.3% | |
| Moderate | 413 | 21.7% | 247 | 26.4% | |
| Complex | 198 | 10.4% | 112 | 12.0% | |
| Major | 26 | 1.4% | 23 | 2.5% |
The task type distributions are broadly similar across models, suggesting the user's work patterns remained consistent. The complexity mix is also comparable, though Opus 4.6 has a slightly higher share of moderate-and-above tasks (40.8% vs 33.5%), likely reflecting the evaluation period's focus on substantive work rather than quick queries.
Raw token volumes across the full dataset. These are absolute totals, not per-task averages (see §2 for normalized comparisons).
| Metric | Opus 4.5 | Opus 4.6 | Combined |
|---|---|---|---|
| Output tokens | 2.0M | 2.5M | 4.5M |
| Input tokens (fresh) | 666,412 | 157,109 | 823,521 |
| Cache read tokens | 1.26B | 878.7M | 2.14B |
| Cache write tokens | 143.9M | 53.2M | 197.2M |
| Total API cost | $5,221.55 | $2,838.61 | $8,060.16 |
Model output splits into thinking (extended thinking / chain-of-thought, not billed as output) and text (visible response, code, tool calls). Estimated from character counts with a 3:1 chars-to-tokens ratio for thinking.
| Metric | Opus 4.5 | Opus 4.6 |
|---|---|---|
| Est. thinking tokens | 1,375,051 | 885,062 |
| Est. text tokens | 672,072 | 464,654 |
| Thinking ratio (tasks using thinking) | 74.8% | 58.9% |
| Avg requests / task | 7.4 | 9.5 |
| Metric | Opus 4.5 | Opus 4.6 | Combined |
|---|---|---|---|
| Files touched | 3,162 | 2,183 | 5,345 |
| Lines added | 197,538 | 93,984 | 291,522 |
| Lines removed | 42,320 | 28,173 | 70,493 |
Cache reads dominate the token budget: 91% of all tokens processed were served from cache rather than freshly encoded. This reflects Claude Code's prompt architecture, where the system prompt and conversation history are re-sent with each API call but largely hit the prompt cache.
With the dataset in view, we turn to what the token data reveals about how each model allocates its computational budget.
Opus 4.6 costs ~4.9% more per task on average ($2.56 vs $2.44), despite producing 2.5× more output tokens and making more API round-trips (9.5 vs 7.4 requests/task). But this aggregate masks a complexity-dependent pattern: at trivial through moderate levels, 4.6 is 13–37% cheaper, driven by superior cache economics—not output efficiency. Output tokens account for less than 7% of per-task cost; cache operations account for ~93%. 4.6 achieves a leaner cache footprint, writing 29% fewer tokens at the most expensive token category. The cost advantage reverses at complex and major tiers, where accumulated cache reads over many requests outweigh the write savings.
| Task Type | Output Comparison | 4.5 avg | 4.6 avg | 4.6/4.5 |
|---|---|---|---|---|
| Feature | 2,544 | 6,031 | 2.4× | |
| Greenfield | 1,952 | 8,590 | 4.4× | |
| Refactor | 2,674 | 5,647 | 2.1× | |
| Bugfix | 1,133 | 3,773 | 3.3× | |
| Investigation | 640 | 1,298 | 2.0× | |
| Continuation | 595 | 1,291 | 2.2× | |
| Sysadmin | 398 | 1,102 | 2.8× | |
| Port | 7,175 | 1,680 | 0.2× | |
| Docs | 824 | 784 | 1.0× |
The 2.1× ratio for refactoring is notable: Opus 4.6 produces substantially more output tokens for refactoring tasks, suggesting more thorough changes. For continuation tasks (follow-ups within a session), Opus 4.6 produces 2.2× the output volume of Opus 4.5.
| Complexity | Cost Comparison | 4.5 avg cost | 4.6 avg cost | Δ |
|---|---|---|---|---|
| Trivial | $0.87 | $0.55 | −37% | |
| Simple | $2.21 | $1.49 | −33% | |
| Moderate | $3.69 | $3.19 | −13% | |
| Complex | $7.31 | $7.95 | +9% | |
| Major | $13.65 | $18.47 | +35% |
Normalizing by session hours rather than task count: Opus 4.5 costs $12.57/session-hour ($5,222 over 415.4h) vs $6.74/session-hour for Opus 4.6 ($2,839 over 421.0h). Session hours measure wall-clock time from first to last message, so this metric includes idle time and is not a direct measure of active coding cost.
| Complexity | Output per Request | 4.5 | 4.6 | Ratio |
|---|---|---|---|---|
| Trivial | 61 | 94 | 1.5× | |
| Simple | 91 | 149 | 1.6× | |
| Moderate | 129 | 234 | 1.8× | |
| Complex | 146 | 305 | 2.1× | |
| Major | 162 | 254 | 1.6× |
Per-request output survives Bonferroni correction (d=-0.30, small). Opus 4.6 produces more tokens per API round-trip at every complexity level, concentrating work into larger responses rather than many small incremental calls.
The 2.5× output difference seems like it should dominate the cost comparison, but output tokens are a minor cost component. Cache operations dwarf everything else:
| Component | Price/MTok | 4.5/task | 4.5 cost | 4.6/task | 4.6 cost | % of total |
|---|---|---|---|---|---|---|
| Input | $15.00 | 312 | $0.005 | 142 | $0.002 | <1% |
| Output | $75.00 | 933 | $0.070 | 2,293 | $0.172 | 3–7% |
| Cache read | $1.875 | 589K | $1.10 | 792K | $1.49 | 45–58% |
| Cache write | $18.75 | 67K | $1.26 | 48K | $0.90 | 35–52% |
| Total | $2.44 | $2.56 |
Three forces offset to produce the $0.12 net difference. 4.6 writes 29% fewer tokens to cache per task (48K vs 67K), saving $0.36 at the most expensive category ($18.75/M—10× the read price). But 4.6 reads 34% more cached context (792K vs 589K), costing $0.38 at the cheapest category ($1.875/M). And the 2.5× output increase costs only $0.10. The write savings nearly cancel the read and output increases: −$0.36 + $0.38 + $0.10 = +$0.12.
Why does 4.6 write less to cache? Per-request analysis reveals two mechanisms. First, 4.6 starts tasks with a leaner cache footprint—its first API request writes 50% fewer cache tokens than 4.5’s (24K vs 48K), suggesting a more compact or reusable context structure. Second, the first-request write penalty is steeper for 4.5—11.4× above its steady-state rate vs 7.2× for 4.6—so 4.5 pays a larger per-task initialization tax. The likely explanation: 4.6’s “investigate then execute” pattern creates more compact, reusable context, while 4.5’s incremental approach builds a larger accumulated context that costs more to establish.
This mechanism produces the complexity-dependent curve above. At trivial through moderate tiers (typically ≤30 API requests), lean initialization dominates and 4.6 is 13–37% cheaper. At major complexity, tasks average 50–70 API requests and cumulative cache reads compound past the initialization savings—4.6 reads 34% more cached context per task overall, and over enough round-trips this gap overwhelms the write savings.
Two hypotheses were tested to explain 4.6's superior cache economics: (1) 4.6 front-loads reads, keeping cache warm for subsequent requests; (2) 4.5 experiences more cache cooling between turns, causing expensive re-writes.
The hypothesis that 4.6 front-loads cache reads is weakly supported—4.6 concentrates 17.8% of cache reads in the first request vs 15.2% for 4.5. But the dominant signal is on the write side: 4.6's first request writes 49% fewer cache tokens (24.6K vs 48.1K median), and its overall cache hit rate is 89.5% vs 81.3%.
| Metric | Opus 4.5 | Opus 4.6 | Ratio |
|---|---|---|---|
| First-request cache read (avg) | 52,486 | 55,652 | 1.06× |
| First-request cache write (avg) | 48,108 | 24,561 | 0.51× |
| Overall cache hit rate | 81.3% | 89.5% | +8.2pp |
| Read/write ratio at position 15 | 33.2 | 46.7 | 1.4× |
As sessions extend, 4.6 maintains a better read/write ratio—by request position 15, it achieves 46.7× reads per write vs 33.2× for 4.5. This suggests 4.6 writes proportionally less new cache as sessions progress.
Grouping tasks by API request count confirms the crossover at 30+ requests:
| Request tier | 4.5 avg cost | 4.6 avg cost | Δ |
|---|---|---|---|
| 1 request | $0.81 | $0.33 | −60% |
| 2–3 requests | $1.25 | $0.83 | −34% |
| 4–10 requests | $2.35 | $2.04 | −14% |
| 11–30 requests | $5.31 | $4.81 | −9% |
| 30+ requests | $11.61 | $13.16 | +13% |
4.6's advantage is largest on single-request tasks (−60%), where its lean initialization is maximally visible, and erodes steadily as request count increases. At 30+ requests, 4.6 averages 52.7 requests/task vs 47.0 for 4.5, with slightly higher per-request cache reads (89K vs 80K)—enough to tip the balance.
Both models experience cache cooling gaps (>5 minutes between task transitions) at nearly identical rates—26.7% for 4.5 vs 24.4% for 4.6. So the premise that 4.5 “stops more often” is not supported. However, the impact of cooling differs dramatically:
| Condition | 4.5 cache write fraction | 4.6 cache write fraction |
|---|---|---|
| After cold start (>5 min gap) | 38.6% | 11.9% |
| After warm start (≤5 min gap) | 6.0% | 5.7% |
| Cold/warm inflation | 6.4× | 2.1× |
After warm starts, both models behave identically (~6% write fraction). After cold starts, 4.5’s write fraction jumps to 38.6%—roughly 6.4× inflation—while 4.6 reaches only 11.9% (2.1×). In absolute terms: 4.5 re-writes 177K tokens after a cold start vs 87K for 4.6. This suggests 4.5 accumulates a larger context payload that is more expensive to reconstruct when cache expires.
The combined picture: 4.6’s “lean initialization” (fewer first-request writes) and “cold resistance” (smaller re-cache payload) together produce its cache advantage. Both effects point to the same underlying cause: 4.6’s concentrated work style creates a more compact context that costs less to establish and re-establish.
Thinking tokens are billed as output but do not enter the conversation history or affect cache behavior. Visible text output accumulates in history and increases subsequent input size. Cache writes ($18.75/MTok) are 12.5× more expensive than cache reads ($1.50/MTok).
| Complexity | Cache Write/Task | 4.5 read % | 4.6 read % | Δ write |
|---|---|---|---|---|
| Trivial | 72.1% | 86.7% | 0.47× | |
| Simple | 85.9% | 91.9% | 0.50× | |
| Moderate | 90.8% | 93.9% | 0.65× | |
| Complex | 92.8% | 94.9% | 0.83× | |
| Major | 95.3% | 96.6% | 1.04× | |
| Overall | 89.7% | 94.3% | 0.71× |
Per-task average token usage driving the cost differences. Opus 4.6 produces more output but uses less fresh input, relying more on cached context:
| Complexity | 4.5 output/task | 4.6 output/task | 4.5 thinking chars | 4.6 thinking chars | 4.5 input/task | 4.6 input/task | 4.5 requests | 4.6 requests |
|---|---|---|---|---|---|---|---|---|
| Trivial | 93 | 164 | 839 | 1,108 | 14 | 14 | 1.5 | 1.7 |
| Simple | 508 | 783 | 1,684 | 1,593 | 160 | 107 | 5.6 | 5.2 |
| Moderate | 1,467 | 2,798 | 3,828 | 3,916 | 554 | 275 | 11.4 | 12.0 |
| Complex | 3,766 | 9,118 | 7,264 | 9,911 | 1,332 | 462 | 25.9 | 29.9 |
| Major | 8,754 | 18,291 | 10,637 | 9,859 | 1,522 | 394 | 54.0 | 72.0 |
At complex tasks, Opus 4.6 uses fewer fresh input tokens per task (490 vs 1,484) while producing 141% more output (9,416 vs 3,903). Despite using similar request counts per task at the complex level (29.7 vs 26.3), Opus 4.6 achieves much more effective cache utilization, and the output cost premium is offset by input savings.
Caveat: These averages come from organic sessions with different task mixes per model. Some of the per-complexity gap may reflect session-level factors (e.g., caching benefits accumulate within longer sessions).
| Measurement | Slice | Direction | Effect | padj | Sig |
|---|---|---|---|---|---|
| Total Output Tokens | task_type:bugfix | opus-4-6 higher | 0.849 | 0.0001 | Bonf |
| Total Output Tokens | complexity:complex | opus-4-6 higher | 0.838 | 0.0000 | Bonf |
| Request Count | task_type:refactor | opus-4-6 higher | 0.770 | 0.0110 | FDR |
| Output Per Request | task_type:refactor | opus-4-6 higher | 0.692 | 0.0011 | FDR |
| Cache Hit Rate | task_type:greenfield | equal | 0.671 | 0.0022 | FDR |
| Total Output Tokens | task_type:refactor | opus-4-6 higher | 0.661 | 0.0001 | Bonf |
| Total Output Tokens | task_type:feature | opus-4-6 higher | 0.644 | 0.0000 | Bonf |
| Estimated Cost | task_type:refactor | opus-4-6 higher | 0.638 | 0.0307 | FDR |
| Total Output Tokens | complexity:moderate | opus-4-6 higher | 0.531 | 0.0000 | Bonf |
| Total Output Tokens | iteration:significant | opus-4-6 higher | 0.516 | 0.0000 | Bonf |
| Request Count | iteration:significant | opus-4-6 higher | 0.493 | 0.0000 | Bonf |
| Output Per Request | task_type:feature | opus-4-6 higher | 0.487 | 0.0000 | Bonf |
| Total Output Tokens | iteration:one_shot | opus-4-6 higher | 0.421 | 0.0000 | Bonf |
| Cost Per Minute | complexity:complex | opus-4-5 higher | 0.407 | 0.0017 | FDR |
| Estimated Cost | complexity:simple | opus-4-5 higher | 0.400 | 0.0000 | Bonf |
| Total Output Tokens | task_type:sysadmin | opus-4-6 higher | 0.394 | 0.0000 | Bonf |
| Total Output Tokens | overall | opus-4-6 higher | 0.388 | 0.0000 | Bonf |
| Output Per Request | task_type:bugfix | opus-4-6 higher | 0.371 | 0.0005 | FDR |
| Cost Per Minute | complexity:moderate | opus-4-5 higher | 0.365 | 0.0001 | Bonf |
| Output Per Request | complexity:moderate | opus-4-6 higher | 0.345 | 0.0000 | Bonf |
| Total Input Tokens | complexity:complex | opus-4-5 higher | 0.344 | 0.0000 | Bonf |
| Output Per Request | iteration:minor | opus-4-6 higher | 0.344 | 0.0000 | Bonf |
| Output Per Request | complexity:complex | opus-4-6 higher | 0.338 | 0.0000 | Bonf |
| Total Output Tokens | task_type:investigation | opus-4-6 higher | 0.325 | 0.0000 | Bonf |
| Total Output Tokens | complexity:simple | opus-4-6 higher | 0.322 | 0.0000 | Bonf |
| Output Per Request | task_type:investigation | opus-4-6 higher | 0.321 | 0.0000 | Bonf |
| Total Output Tokens | iteration:minor | opus-4-6 higher | 0.319 | 0.0000 | Bonf |
| Request Count | task_type:feature | opus-4-6 higher | 0.305 | 0.0458 | FDR |
| Output Per Request | iteration:one_shot | opus-4-6 higher | 0.302 | 0.0000 | Bonf |
| Estimated Cost | iteration:significant | opus-4-6 higher | 0.300 | 0.0124 | FDR |
| Output Per Request | overall | opus-4-6 higher | 0.299 | 0.0000 | Bonf |
| Output Per Request | iteration:significant | opus-4-6 higher | 0.296 | 0.0000 | Bonf |
| Cost Per Minute | iteration:one_shot | opus-4-5 higher | 0.278 | 0.0000 | Bonf |
| Cost Per Minute | complexity:simple | opus-4-5 higher | 0.277 | 0.0039 | FDR |
| Cache Hit Rate | complexity:trivial | equal | 0.270 | 0.0000 | Bonf |
| Output Per Request | task_type:sysadmin | opus-4-6 higher | 0.269 | 0.0000 | Bonf |
| Estimated Cost | complexity:trivial | opus-4-5 higher | 0.264 | 0.0486 | FDR |
| Output Per Request | complexity:simple | opus-4-6 higher | 0.251 | 0.0000 | Bonf |
| Cost Per Minute | iteration:significant | opus-4-5 higher | 0.250 | 0.0138 | FDR |
| Total Output Tokens | complexity:trivial | opus-4-6 higher | 0.238 | 0.0000 | Bonf |
| Cost Per Minute | overall | opus-4-5 higher | 0.237 | 0.0000 | Bonf |
| Cache Hit Rate | iteration:significant | equal | 0.232 | 0.0000 | Bonf |
| Total Input Tokens | task_type:refactor | opus-4-5 higher | 0.226 | 0.0443 | FDR |
| Request Count | complexity:trivial | opus-4-6 higher | 0.223 | 0.0001 | Bonf |
| Cache Hit Rate | task_type:investigation | equal | 0.221 | 0.0000 | Bonf |
| Cost Per Minute | complexity:trivial | opus-4-5 higher | 0.214 | 0.0029 | FDR |
| Cache Hit Rate | task_type:refactor | equal | 0.197 | 0.0000 | Bonf |
| Cache Hit Rate | iteration:minor | equal | 0.192 | 0.0000 | Bonf |
| Cache Hit Rate | overall | equal | 0.192 | 0.0000 | Bonf |
| Request Count | overall | opus-4-6 higher | 0.190 | 0.0000 | Bonf |
| Request Count | task_type:sysadmin | opus-4-6 higher | 0.189 | 0.0364 | FDR |
| Total Input Tokens | task_type:bugfix | opus-4-5 higher | 0.189 | 0.0000 | Bonf |
| Total Input Tokens | complexity:moderate | opus-4-5 higher | 0.173 | 0.0000 | Bonf |
| Cache Hit Rate | complexity:complex | equal | 0.170 | 0.0000 | Bonf |
| Cost Per Minute | iteration:minor | opus-4-5 higher | 0.168 | 0.0267 | FDR |
| Cache Hit Rate | task_type:bugfix | equal | 0.162 | 0.0000 | Bonf |
| Cost Per Minute | task_type:investigation | opus-4-5 higher | 0.161 | 0.0096 | FDR |
| Request Count | iteration:one_shot | opus-4-6 higher | 0.159 | 0.0001 | Bonf |
| Cache Hit Rate | iteration:one_shot | equal | 0.159 | 0.0000 | Bonf |
| Output Per Request | complexity:trivial | opus-4-6 higher | 0.158 | 0.0000 | Bonf |
| Total Input Tokens | iteration:one_shot | opus-4-5 higher | 0.144 | 0.0000 | Bonf |
| Cache Hit Rate | task_type:sysadmin | equal | 0.137 | 0.0000 | Bonf |
| Request Count | task_type:investigation | opus-4-6 higher | 0.135 | 0.0005 | FDR |
| Cache Hit Rate | complexity:moderate | equal | 0.126 | 0.0000 | Bonf |
| Total Input Tokens | iteration:minor | opus-4-5 higher | 0.124 | 0.0000 | Bonf |
| Total Input Tokens | overall | opus-4-5 higher | 0.119 | 0.0000 | Bonf |
| Cache Hit Rate | task_type:feature | equal | 0.086 | 0.0000 | Bonf |
| Total Input Tokens | iteration:significant | opus-4-5 higher | 0.076 | 0.0000 | Bonf |
| Total Input Tokens | complexity:simple | opus-4-5 higher | 0.072 | 0.0000 | Bonf |
| Total Input Tokens | task_type:investigation | opus-4-5 higher | 0.065 | 0.0000 | Bonf |
| Cache Hit Rate | complexity:simple | equal | 0.020 | 0.0000 | Bonf |
| Total Input Tokens | task_type:sysadmin | opus-4-5 higher | 0.013 | 0.0000 | Bonf |
| Total Input Tokens | task_type:feature | opus-4-5 higher | 0.011 | 0.0000 | Bonf |
| Total Input Tokens | complexity:trivial | opus-4-5 higher | 0.000 | 0.0000 | Bonf |
| Measurement | Slice | Effect | padj |
|---|---|---|---|
| Request Count | task_type:greenfield | 0.822 | 0.1917 |
| Total Output Tokens | task_type:greenfield | 0.736 | 0.0538 |
| Estimated Cost | task_type:greenfield | 0.720 | 0.4060 |
| Total Input Tokens | task_type:greenfield | 0.593 | 0.2112 |
| Request Count | task_type:bugfix | 0.538 | 0.0931 |
| Cost Per Minute | task_type:refactor | 0.299 | 0.1799 |
| Request Count | complexity:complex | 0.296 | 0.0758 |
| Cost Per Minute | task_type:feature | 0.292 | 0.0758 |
| Estimated Cost | task_type:bugfix | 0.275 | 0.4758 |
| Cost Per Minute | task_type:sysadmin | 0.248 | 0.4460 |
| Estimated Cost | complexity:moderate | 0.197 | 0.1222 |
| Estimated Cost | complexity:complex | 0.150 | 0.3988 |
| Cost Per Minute | task_type:greenfield | 0.146 | 0.9305 |
| Request Count | complexity:simple | 0.146 | 0.2142 |
| Estimated Cost | iteration:minor | 0.118 | 0.5655 |
| Cost Per Minute | task_type:bugfix | 0.113 | 0.0943 |
| Output Per Request | task_type:greenfield | 0.107 | 0.1140 |
| Estimated Cost | task_type:feature | 0.102 | 0.4758 |
| Request Count | complexity:moderate | 0.095 | 0.3041 |
| Estimated Cost | task_type:sysadmin | 0.095 | 0.7716 |
| Estimated Cost | overall | 0.047 | 0.8292 |
| Estimated Cost | task_type:investigation | 0.039 | 0.3707 |
| Estimated Cost | iteration:one_shot | 0.021 | 0.9305 |
| Request Count | iteration:minor | 0.008 | 0.8273 |
The cost difference raises a natural question: does 4.6’s different spending pattern correspond to different thinking strategies? The next section examines thinking calibration—the largest overall effect in the study.
Thinking fraction is the largest overall effect in the study (d=0.64, medium by Cohen’s convention). Opus 4.5 thinks on 75% of tasks but shallowly; Opus 4.6 thinks on 59% of tasks but more deeply when it does (4,067 vs 2,578 chars). The pattern suggests 4.6 has better calibration of when thinking is needed, reserving it for moderate-and-above complexity.
| Complexity | Distribution | 4.5 (n) | 4.6 (n) | Δ |
|---|---|---|---|---|
| Trivial — thinking % | 882 | 346 | −33pp | |
| Simple — thinking % | 381 | 209 | −32pp | |
| Moderate — thinking % | 413 | 247 | +1pp | |
| Complex — thinking % | 198 | 112 | +17pp | |
| Major — thinking % | 26 | 23 | +17pp |
| Complexity | 4.5 Thinking Chars (when used) | 4.6 Thinking Chars (when used) | 4.5 Text Chars | 4.6 Text Chars | 4.5 Think/Text | 4.6 Think/Text |
|---|---|---|---|---|---|---|
| Trivial | 839 | 1,108 | 841 | 895 | 1.00 | 1.24 |
| Simple | 1,684 | 1,593 | 976 | 1,185 | 1.73 | 1.34 |
| Moderate | 3,828 | 3,916 | 1,573 | 2,204 | 2.43 | 1.78 |
| Complex | 7,264 | 9,911 | 3,232 | 3,768 | 2.25 | 2.63 |
| Major | 10,637 | 9,859 | 4,150 | 8,117 | 2.56 | 1.21 |
| Task Type | Thinking Depth | 4.5 chars | 4.6 chars | Ratio |
|---|---|---|---|---|
| Greenfield | 2,520 | 8,196 | 3.3× | |
| Refactor | 4,613 | 7,028 | 1.5× | |
| Bugfix | 3,202 | 3,974 | 1.2× | |
| Feature | 4,115 | 6,605 | 1.6× | |
| Investigation | 2,458 | 2,572 | 1.0× | |
| Sysadmin | 1,441 | 2,624 | 1.8× | |
| Continuation | 1,651 | 2,256 | 1.4× | |
| Docs | 1,898 | 4,868 | 2.6× | |
| Port | 4,287 | 4,047 | 0.9× |
| Measurement | Slice | Direction | Effect | padj | Sig |
|---|---|---|---|---|---|
| Thinking Fraction | task_type:sysadmin | opus-4-5 higher | 1.440 | 0.0000 | Bonf |
| Thinking Fraction | complexity:simple | opus-4-5 higher | 1.154 | 0.0000 | Bonf |
| Thinking Fraction | task_type:bugfix | opus-4-5 higher | 1.150 | 0.0001 | Bonf |
| Thinking Fraction | iteration:minor | opus-4-5 higher | 0.995 | 0.0000 | Bonf |
| Thinking Fraction | task_type:investigation | opus-4-5 higher | 0.860 | 0.0000 | Bonf |
| Thinking Fraction | complexity:trivial | opus-4-5 higher | 0.829 | 0.0000 | Bonf |
| Thinking Fraction | overall | opus-4-5 higher | 0.636 | 0.0000 | Bonf |
| Thinking Fraction | iteration:significant | opus-4-5 higher | 0.628 | 0.0000 | Bonf |
| Thinking Fraction | complexity:moderate | opus-4-5 higher | 0.583 | 0.0000 | Bonf |
| Thinking Fraction | iteration:one_shot | opus-4-5 higher | 0.545 | 0.0000 | Bonf |
| Thinking Fraction | task_type:feature | opus-4-5 higher | 0.505 | 0.0306 | FDR |
| Thinking Chars | complexity:complex | opus-4-6 higher | 0.487 | 0.0484 | FDR |
| Thinking Chars | complexity:simple | opus-4-5 higher | 0.340 | 0.0000 | Bonf |
| Thinking Chars | task_type:sysadmin | opus-4-5 higher | 0.195 | 0.0000 | Bonf |
| Thinking Chars | complexity:trivial | opus-4-5 higher | 0.160 | 0.0000 | Bonf |
| Thinking Chars | overall | opus-4-5 higher | 0.140 | 0.0000 | Bonf |
| Thinking Chars | iteration:minor | opus-4-5 higher | 0.096 | 0.0000 | Bonf |
| Thinking Chars | complexity:moderate | opus-4-5 higher | 0.026 | 0.0028 | FDR |
| Thinking Chars | task_type:investigation | opus-4-5 higher | 0.016 | 0.0012 | FDR |
| Measurement | Slice | Effect | padj |
|---|---|---|---|
| Thinking Chars | task_type:greenfield | 0.769 | 0.1355 |
| Thinking Fraction | task_type:refactor | 0.639 | 0.1188 |
| Thinking Fraction | task_type:greenfield | 0.596 | 0.3287 |
| Thinking Chars | task_type:feature | 0.411 | 0.8972 |
| Thinking Chars | task_type:refactor | 0.376 | 0.8110 |
| Thinking Chars | iteration:one_shot | 0.252 | 0.1028 |
| Thinking Chars | iteration:significant | 0.221 | 0.0551 |
| Thinking Fraction | complexity:complex | 0.071 | 0.8292 |
| Thinking Chars | task_type:bugfix | 0.055 | 0.1355 |
Thinking calibration is one manifestation of broader behavioral differences between the models. The next section examines other behavioral patterns—subagent deployment, planning adoption, and effort distribution.
Beyond token economics, the models differ in how they approach tasks. Opus 4.6 plans more often (12.3% vs 1.8% of tasks), deploys more subagents, and favors read-only exploration over general-purpose workers. These behavioral differences are among the most visible in the dataset, though the Claude Code platform itself evolved between the two collection periods—some of the shift may reflect SDK changes rather than model decisions.
| Metric | Distribution | 4.5 | 4.6 | Δ |
|---|---|---|---|---|
| Tasks using planning mode | 35 | 115 | B +10.4pp | |
| Tasks using subagents | 155 | 188 | B +11.9pp | |
| Autonomous subagent calls | 196 | 315 | B +29.3pp |
| Type | Distribution | 4.5 | 4.6 | Δ |
|---|---|---|---|---|
| Explore | 175 | 257 | B +19.7pp | |
| General-purpose | 114 | 74 | A +12.1pp | |
| Plan | 24 | 27 | ≈ Tie | |
| Bash | 6 | 14 | ≈ Tie |
Opus 4.6 enters plan mode on 12.3% of tasks (115 of 937) vs 1.8% for Opus 4.5. Adoption scales steeply with complexity: 42.9% at complex, 65.2% at major. Planned tasks show a modest alignment benefit (+0.17 overall) that diminishes at complex and major tiers.
| Metric | Distribution | 4.5 | 4.6 |
|---|---|---|---|
| Planning adoption rate | 35 tasks | 115 tasks |
| Complexity | Distribution | 4.5 | 4.6 |
|---|---|---|---|
| Trivial | 0/882 | 3/346 | |
| Simple | 3/381 | 6/209 | |
| Moderate | 10/413 | 43/247 | |
| Complex | 14/198 | 48/112 | |
| Major | 8/26 | 15/23 |
| Opus 4.5 | Opus 4.6 | |||||
|---|---|---|---|---|---|---|
| Complexity | Planned | Unplanned | Δ | Planned | Unplanned | Δ |
| Trivial | — (n=0) | 2.72 (n=882) | — | 3.00 (n=3) | 3.07 (n=343) | −0.07 |
| Simple | 3.00 (n=3) | 3.27 (n=378) | −0.27 | 3.17 (n=6) | 3.11 (n=203) | +0.06 |
| Moderate | 3.70 (n=10) | 3.30 (n=403) | +0.40 | 3.35 (n=43) | 3.32 (n=204) | +0.03 |
| Complex+ | — (n=0) | — (n=0) | — | — (n=0) | — (n=0) | — |
| Complexity | Distribution | 4.5 | 4.6 |
|---|---|---|---|
| Trivial | 0/882 | 3/346 | |
| Simple | 3/381 | 6/209 | |
| Moderate | 10/413 | 43/247 | |
| Complex | 14/198 | 48/112 | |
| Major | 8/26 | 15/23 |
Effort distribution shows Opus 4.6 allocates more tool calls to research (35.1% vs 28.3%) and fewer to implementation (17.5% vs 27.0%), consistent with the research-first approach visible in subagent type preferences.
| Metric | Distribution | 4.5 | 4.6 |
|---|---|---|---|
| Research ratio | 0.283 | 0.351 | |
| Implementation ratio | 0.270 | 0.175 | |
| Front-load positive % | 868 tasks | 194 tasks |
| Measurement | Slice | Direction | Effect | padj | Sig |
|---|---|---|---|---|---|
| Tool Calls | task_type:refactor | opus-4-6 higher | 1.057 | 0.0009 | FDR |
| Files Touched | task_type:refactor | opus-4-6 higher | 0.808 | 0.0045 | FDR |
| Tool Calls | iteration:significant | opus-4-6 higher | 0.529 | 0.0000 | Bonf |
| Tool Calls | task_type:feature | opus-4-6 higher | 0.508 | 0.0005 | FDR |
| Tool Calls | task_type:bugfix | opus-4-6 higher | 0.507 | 0.0368 | FDR |
| One Shot Rate | complexity:trivial | opus-4-6 higher | 0.448 | 0.0000 | Bonf |
| Lines Per Minute | complexity:moderate | opus-4-5 higher | 0.442 | 0.0000 | Bonf |
| Duration Seconds | task_type:refactor | opus-4-6 higher | 0.435 | 0.0018 | FDR |
| Files Touched | complexity:simple | opus-4-5 higher | 0.423 | 0.0000 | Bonf |
| One Shot Rate | complexity:complex | opus-4-6 higher | 0.409 | 0.0016 | FDR |
| Tool Calls | complexity:complex | opus-4-6 higher | 0.405 | 0.0051 | FDR |
| Lines Per Minute | task_type:feature | opus-4-5 higher | 0.395 | 0.0127 | FDR |
| Files Touched | task_type:feature | opus-4-6 higher | 0.388 | 0.0226 | FDR |
| Autonomy Level | iteration:one_shot | distributions differ | 0.380 | 0.0000 | Bonf |
| Scope Management | complexity:simple | distributions differ | 0.379 | 0.0000 | Bonf |
| Files Touched | iteration:significant | opus-4-6 higher | 0.379 | 0.0043 | FDR |
| One Shot Rate | complexity:simple | opus-4-6 higher | 0.378 | 0.0001 | Bonf |
| Scope Management | iteration:one_shot | distributions differ | 0.378 | 0.0000 | Bonf |
| One Shot Rate | complexity:moderate | opus-4-6 higher | 0.378 | 0.0000 | Bonf |
| One Shot Rate | overall | opus-4-6 higher | 0.375 | 0.0000 | Bonf |
| Files Touched | complexity:complex | opus-4-6 higher | 0.336 | 0.0133 | FDR |
| Tool Calls | complexity:moderate | opus-4-6 higher | 0.328 | 0.0002 | Bonf |
| Scope Management | complexity:trivial | distributions differ | 0.324 | 0.0000 | Bonf |
| Tool Calls | task_type:investigation | opus-4-6 higher | 0.317 | 0.0000 | Bonf |
| Autonomy Level | complexity:trivial | distributions differ | 0.309 | 0.0000 | Bonf |
| Tool Calls | complexity:trivial | opus-4-6 higher | 0.301 | 0.0000 | Bonf |
| Tools Per File | complexity:trivial | opus-4-6 higher | 0.301 | 0.0000 | Bonf |
| Scope Management | overall | distributions differ | 0.296 | 0.0000 | Bonf |
| Tools Per File | task_type:sysadmin | opus-4-6 higher | 0.296 | 0.0213 | FDR |
| Communication Quality | complexity:trivial | distributions differ | 0.293 | 0.0000 | Bonf |
| Files Touched | task_type:investigation | equal | 0.279 | 0.0292 | FDR |
| Autonomy Level | complexity:complex | distributions differ | 0.263 | 0.0001 | Bonf |
| Autonomy Level | overall | distributions differ | 0.252 | 0.0000 | Bonf |
| Scope Management | complexity:moderate | distributions differ | 0.251 | 0.0000 | Bonf |
| Tools Per File | task_type:investigation | opus-4-6 higher | 0.248 | 0.0000 | Bonf |
| Tools Per File | iteration:significant | opus-4-6 higher | 0.246 | 0.0000 | Bonf |
| Lines Per Minute | complexity:simple | opus-4-5 higher | 0.244 | 0.0000 | Bonf |
| Tools Per File | complexity:simple | opus-4-6 higher | 0.240 | 0.0024 | FDR |
| Communication Quality | iteration:one_shot | distributions differ | 0.234 | 0.0000 | Bonf |
| Iteration Required | complexity:trivial | distributions differ | 0.230 | 0.0000 | Bonf |
| Scope Expanded Rate | complexity:moderate | opus-4-5 higher | 0.229 | 0.0422 | FDR |
| Iteration Required | complexity:moderate | distributions differ | 0.226 | 0.0002 | Bonf |
| Duration Seconds | task_type:feature | opus-4-6 higher | 0.224 | 0.0082 | FDR |
| Iteration Required | complexity:complex | distributions differ | 0.222 | 0.0048 | FDR |
| Iteration Required | complexity:simple | distributions differ | 0.222 | 0.0005 | FDR |
| Tool Calls | overall | opus-4-6 higher | 0.221 | 0.0000 | Bonf |
| Iteration Required | overall | distributions differ | 0.208 | 0.0000 | Bonf |
| Communication Quality | overall | distributions differ | 0.204 | 0.0000 | Bonf |
| Scope Management | complexity:complex | distributions differ | 0.200 | 0.0354 | FDR |
| Autonomy Level | complexity:moderate | distributions differ | 0.196 | 0.0022 | FDR |
| Autonomy Level | task_type:investigation | distributions differ | 0.192 | 0.0050 | FDR |
| Communication Quality | task_type:investigation | distributions differ | 0.191 | 0.0056 | FDR |
| Autonomy Level | complexity:simple | distributions differ | 0.174 | 0.0172 | FDR |
| Tools Per File | complexity:moderate | opus-4-6 higher | 0.174 | 0.0244 | FDR |
| Scope Management | task_type:investigation | distributions differ | 0.172 | 0.0229 | FDR |
| Scope Expanded Rate | overall | opus-4-5 higher | 0.155 | 0.0036 | FDR |
| Tool Calls | iteration:one_shot | opus-4-6 higher | 0.154 | 0.0000 | Bonf |
| Communication Quality | iteration:significant | distributions differ | 0.140 | 0.0007 | FDR |
| Autonomy Level | iteration:significant | distributions differ | 0.138 | 0.0003 | Bonf |
| Scope Management | iteration:significant | distributions differ | 0.133 | 0.0040 | FDR |
| Tools Per File | overall | opus-4-6 higher | 0.103 | 0.0000 | Bonf |
| Duration Seconds | iteration:significant | opus-4-6 higher | 0.085 | 0.0000 | Bonf |
| Duration Seconds | task_type:investigation | opus-4-6 higher | 0.053 | 0.0000 | Bonf |
| Tools Per File | iteration:one_shot | opus-4-6 higher | 0.051 | 0.0000 | Bonf |
| Duration Seconds | complexity:complex | opus-4-6 higher | 0.048 | 0.0000 | Bonf |
| Duration Seconds | complexity:moderate | opus-4-6 higher | 0.040 | 0.0075 | FDR |
| Duration Seconds | iteration:one_shot | opus-4-6 higher | 0.025 | 0.0000 | Bonf |
| Duration Seconds | overall | opus-4-6 higher | 0.000 | 0.0000 | Bonf |
| Measurement | Slice | Effect | padj |
|---|---|---|---|
| Scope Expanded Rate | task_type:greenfield | 0.806 | 0.2282 |
| Lines Per Minute | task_type:greenfield | 0.659 | 0.4521 |
| One Shot Rate | task_type:greenfield | 0.614 | 0.2233 |
| Tool Calls | task_type:greenfield | 0.537 | 0.3961 |
| Duration Seconds | task_type:greenfield | 0.488 | 0.4521 |
| Communication Quality | task_type:greenfield | 0.488 | 0.0929 |
| Scope Management | task_type:greenfield | 0.427 | 0.4438 |
| Tools Per File | task_type:greenfield | 0.416 | 0.4438 |
| Files Touched | task_type:bugfix | 0.374 | 0.1278 |
| Lines Per Minute | complexity:complex | 0.369 | 0.1271 |
| Autonomy Level | task_type:greenfield | 0.363 | 0.2819 |
| Tools Per File | task_type:feature | 0.361 | 0.1191 |
| Duration Seconds | task_type:bugfix | 0.357 | 0.0732 |
| Tools Per File | task_type:refactor | 0.327 | 0.5441 |
| Iteration Required | task_type:greenfield | 0.310 | 0.4134 |
| Autonomy Level | task_type:refactor | 0.308 | 0.1305 |
| Scope Management | task_type:refactor | 0.301 | 0.1452 |
| Scope Expanded Rate | task_type:refactor | 0.287 | 0.5674 |
| Scope Expanded Rate | task_type:investigation | 0.286 | 0.1743 |
| Scope Expanded Rate | complexity:complex | 0.264 | 0.1136 |
| Scope Expanded Rate | task_type:bugfix | 0.251 | 0.5441 |
| Lines Per Minute | task_type:bugfix | 0.248 | 0.6793 |
| Communication Quality | task_type:refactor | 0.214 | 0.4521 |
| Scope Expanded Rate | iteration:significant | 0.208 | 0.0578 |
| Lines Per Minute | iteration:minor | 0.189 | 0.3560 |
| Tools Per File | complexity:complex | 0.186 | 0.8365 |
| Duration Seconds | task_type:sysadmin | 0.182 | 0.1819 |
| Iteration Required | task_type:refactor | 0.177 | 0.6072 |
| Lines Per Minute | task_type:sysadmin | 0.168 | 0.4032 |
| Scope Management | task_type:feature | 0.163 | 0.2986 |
Different behavioral strategies raise the question of whether they lead to different outcomes. The next section examines completion rates, failure rates, and user satisfaction—the quality signals that the behavioral patterns should ultimately serve.
LLM-annotated alignment scores (1–5 scale) show Opus 4.6 scoring higher on average, an effect that survives Bonferroni correction (p=0.000714, d=-0.13). The failed rate difference is also notable: 5.4% of 4.6 tasks fail vs 12.0% for 4.5 (p=0.000, significant at FDR but not Bonferroni). Both alignment and failure rate are LLM-classified—a Claude Haiku model reads each session transcript and assigns scores. The “LLM quality judgement” approach was abandoned as unreliable (see §10), but alignment scoring proved more robust because it rates user-goal correspondence from observable signals rather than attempting to judge code quality directly.
Two categorical distributions—task completion and communication quality—also survive Bonferroni as chi-square tests, indicating the models differ in how they reach outcomes, not just in outcome rates. Note that the completion distribution test (p=0.000000) survives, while the individual completion rate proportion test (p=0.000) is marginal. All chi-square tests carry a low-expected-cell-count warning due to rare categories in the 20-status taxonomy.
| Outcome | Distribution (with 95% CI) | Δ |
|---|---|---|
| Complete | B +22.0pp | |
| Partial | A +8.9pp | |
| Interrupted | A +6.6pp | |
| Failed | B −6.6pp |
| Sentiment | Distribution (with 95% CI) | Δ |
|---|---|---|
| Satisfied | A +3.4pp | |
| Neutral | B +6.0pp | |
| Dissatisfied | ≈ Tie |
Satisfaction trends higher for 4.6 but does not survive Bonferroni correction. Dissatisfaction rates are essentially tied. Both completion and sentiment are LLM-classified: a Claude Haiku annotator reads the full session transcript for each task, classifying completion status from a 20-category taxonomy and inferring user sentiment from contextual signals (follow-up messages, tone shifts, task abandonment patterns). These classifications were validated through human spot-checks of flagged cases, but no formal inter-rater reliability was computed.
| Metric | Opus 4.5 | Opus 4.6 |
|---|---|---|
| Sample size | 1900 | 937 |
| Mean | 3.032 | 3.186 |
| Median | 3.0 | 3.0 |
| Std dev | 1.237 | 0.959 |
| Test statistic | Value |
|---|---|
| U statistic | 823192.5 |
| p-value | 0.000714 |
| Cohen's d | -0.134 |
| Effect size | negligible |
| Metric | Opus 4.5 | Opus 4.6 |
|---|---|---|
| Proportion | 0.389 | 0.609 |
| Count | 739 / 1900 | 571 / 937 |
| 95% CI | [0.367, 0.411] | [0.578, 0.640] |
| Test statistic | Value |
|---|---|
| z statistic | -11.077 |
| p-value | 0.0000 |
| Cohen's h | -0.445 |
| Effect size | small |
| Metric | Opus 4.5 | Opus 4.6 |
|---|---|---|
| Proportion | 0.120 | 0.054 |
| Count | 228 / 1900 | 51 / 937 |
| 95% CI | [0.106, 0.135] | [0.042, 0.071] |
| Test statistic | Value |
|---|---|
| z statistic | 5.516 |
| p-value | 0.0000 |
| Cohen's h | 0.236 |
| Effect size | small |
| Metric | Opus 4.5 | Opus 4.6 |
|---|---|---|
| Proportion | 0.235 | 0.202 |
| Count | 447 / 1900 | 189 / 937 |
| 95% CI | [0.217, 0.255] | [0.177, 0.229] |
| Test statistic | Value |
|---|---|
| z statistic | 2.016 |
| p-value | 0.0438 |
| Cohen's h | 0.081 |
| Effect size | negligible |
| Metric | Opus 4.5 | Opus 4.6 |
|---|---|---|
| Proportion | 0.118 | 0.108 |
| Count | 225 / 1900 | 101 / 937 |
| 95% CI | [0.105, 0.134] | [0.089, 0.129] |
| Test statistic | Value |
|---|---|
| z statistic | 0.835 |
| p-value | 0.4037 |
| Cohen's h | 0.034 |
| Effect size | negligible |
Note: The full categorical breakdown includes 4 unique completion statuses. For clarity, simplified counts are shown below.
| Category | Opus 4.5 | Opus 4.6 |
|---|---|---|
| Complete | 739 | 571 |
| Partial | 731 | 277 |
| Interrupted | 202 | 38 |
| Failed | 228 | 51 |
| Other | 0 | 0 |
| Test statistic | Value |
|---|---|
| χ² statistic | 139.581 |
| Degrees of freedom | 3 |
| p-value | 0.000 |
| Cramér's V | 0.222 |
| Effect size | small |
With 11 independent tests conducted (1 Mann-Whitney U, 9 proportion tests, 1 chi-square), the Bonferroni-corrected significance threshold is α = 0.05 / 11 = 0.0045.
Tests surviving Bonferroni correction (p < 0.0045):
Tests significant at α = 0.05 but not after correction:
Non-significant tests:
| Measurement | Slice | Direction | Effect | padj | Sig |
|---|---|---|---|---|---|
| Satisfaction Rate | task_type:greenfield | opus-4-6 higher | 1.242 | 0.0109 | FDR |
| Alignment Score | task_type:greenfield | opus-4-6 higher | 1.213 | 0.0230 | FDR |
| Complete Rate | task_type:greenfield | opus-4-6 higher | 0.963 | 0.0427 | FDR |
| Normalized User Sentiment | task_type:greenfield | distributions differ | 0.644 | 0.0318 | FDR |
| Satisfaction Rate | task_type:refactor | opus-4-6 higher | 0.643 | 0.0207 | FDR |
| Complete Rate | complexity:trivial | opus-4-6 higher | 0.607 | 0.0000 | Bonf |
| Good Execution Rate | complexity:moderate | opus-4-5 higher | 0.506 | 0.0000 | Bonf |
| Good Execution Rate | complexity:complex | opus-4-5 higher | 0.473 | 0.0003 | Bonf |
| Complete Rate | overall | opus-4-6 higher | 0.445 | 0.0000 | Bonf |
| Complete Rate | iteration:one_shot | opus-4-6 higher | 0.419 | 0.0000 | Bonf |
| Good Execution Rate | iteration:one_shot | opus-4-5 higher | 0.406 | 0.0000 | Bonf |
| Good Execution Rate | task_type:feature | opus-4-5 higher | 0.405 | 0.0060 | FDR |
| Complete Rate | complexity:moderate | opus-4-6 higher | 0.383 | 0.0000 | Bonf |
| Satisfaction Rate | iteration:one_shot | opus-4-5 higher | 0.350 | 0.0000 | Bonf |
| Complete Rate | complexity:simple | opus-4-6 higher | 0.309 | 0.0012 | FDR |
| Good Execution Rate | complexity:simple | opus-4-5 higher | 0.306 | 0.0015 | FDR |
| Satisfaction Rate | complexity:complex | opus-4-5 higher | 0.300 | 0.0307 | FDR |
| Alignment Score | iteration:significant | equal | 0.295 | 0.0002 | Bonf |
| Task Completion | complexity:trivial | distributions differ | 0.290 | 0.0000 | Bonf |
| Normalized Execution Quality | complexity:moderate | distributions differ | 0.267 | 0.0000 | Bonf |
| Alignment Score | complexity:trivial | equal | 0.260 | 0.0000 | Bonf |
| Normalized Execution Quality | complexity:complex | distributions differ | 0.260 | 0.0004 | FDR |
| Alignment Score | iteration:one_shot | opus-4-5 higher | 0.245 | 0.0000 | Bonf |
| Failed Rate | complexity:trivial | opus-4-5 higher | 0.237 | 0.0013 | FDR |
| Failed Rate | overall | opus-4-5 higher | 0.236 | 0.0000 | Bonf |
| Dissatisfaction Rate | iteration:one_shot | opus-4-6 higher | 0.226 | 0.0001 | Bonf |
| Task Completion | overall | distributions differ | 0.222 | 0.0000 | Bonf |
| Normalized Execution Quality | iteration:one_shot | distributions differ | 0.220 | 0.0000 | Bonf |
| Failed Rate | iteration:one_shot | opus-4-5 higher | 0.214 | 0.0012 | FDR |
| Normalized Execution Quality | task_type:feature | distributions differ | 0.211 | 0.0354 | FDR |
| Normalized User Sentiment | complexity:complex | distributions differ | 0.210 | 0.0096 | FDR |
| Good Execution Rate | overall | opus-4-5 higher | 0.207 | 0.0000 | Bonf |
| Normalized User Sentiment | iteration:one_shot | distributions differ | 0.199 | 0.0000 | Bonf |
| Task Completion | iteration:one_shot | distributions differ | 0.197 | 0.0000 | Bonf |
| Task Completion | complexity:moderate | distributions differ | 0.197 | 0.0001 | Bonf |
| Dissatisfaction Rate | complexity:moderate | opus-4-5 higher | 0.196 | 0.0431 | FDR |
| Normalized Execution Quality | complexity:simple | distributions differ | 0.191 | 0.0009 | FDR |
| Task Completion | complexity:simple | distributions differ | 0.153 | 0.0093 | FDR |
| Normalized User Sentiment | complexity:simple | distributions differ | 0.145 | 0.0161 | FDR |
| Task Completion | task_type:investigation | distributions differ | 0.134 | 0.0378 | FDR |
| Alignment Score | overall | equal | 0.134 | 0.0022 | FDR |
| Normalized Execution Quality | overall | distributions differ | 0.133 | 0.0000 | Bonf |
| Task Completion | iteration:significant | distributions differ | 0.129 | 0.0026 | FDR |
| Normalized User Sentiment | overall | distributions differ | 0.064 | 0.0236 | FDR |
| Measurement | Slice | Effect | padj |
|---|---|---|---|
| Dissatisfaction Rate | task_type:greenfield | 1.002 | 0.1256 |
| Failed Rate | task_type:refactor | 0.580 | 0.2222 |
| Failed Rate | task_type:greenfield | 0.562 | 0.4342 |
| Task Completion | task_type:greenfield | 0.555 | 0.0936 |
| Good Execution Rate | task_type:refactor | 0.436 | 0.1536 |
| Complete Rate | task_type:refactor | 0.415 | 0.1673 |
| Normalized User Sentiment | task_type:refactor | 0.312 | 0.1218 |
| Normalized Execution Quality | task_type:refactor | 0.309 | 0.2163 |
| Task Completion | task_type:refactor | 0.302 | 0.1427 |
| Dissatisfaction Rate | task_type:bugfix | 0.301 | 0.1964 |
| Normalized Execution Quality | task_type:greenfield | 0.289 | 0.4565 |
| Failed Rate | complexity:complex | 0.285 | 0.2163 |
| Good Execution Rate | task_type:sysadmin | 0.269 | 0.0855 |
| Complete Rate | complexity:complex | 0.256 | 0.0669 |
| Dissatisfaction Rate | task_type:refactor | 0.256 | 0.4534 |
| Satisfaction Rate | task_type:bugfix | 0.251 | 0.2282 |
| Alignment Score | task_type:bugfix | 0.240 | 0.3008 |
| Alignment Score | task_type:investigation | 0.205 | 0.0653 |
| Normalized Execution Quality | task_type:sysadmin | 0.197 | 0.0766 |
| Failed Rate | task_type:bugfix | 0.194 | 0.4342 |
| Satisfaction Rate | complexity:simple | 0.190 | 0.0670 |
| Failed Rate | task_type:investigation | 0.189 | 0.1352 |
| Complete Rate | task_type:investigation | 0.181 | 0.1195 |
| Failed Rate | complexity:moderate | 0.176 | 0.1022 |
| Failed Rate | complexity:simple | 0.161 | 0.1455 |
| Good Execution Rate | task_type:investigation | 0.159 | 0.1799 |
| Failed Rate | task_type:feature | 0.157 | 0.4205 |
| Alignment Score | task_type:refactor | 0.156 | 0.8416 |
| Good Execution Rate | task_type:greenfield | 0.154 | 0.7595 |
| Alignment Score | complexity:simple | 0.150 | 0.2909 |
Quality metrics paint a consistent-but-modest picture: 4.6 fails less and scores higher on alignment, but effect sizes are small (d=0.13) and the LLM-classification methodology adds a layer of uncertainty. The next section asks whether these quality differences manifest in the editing process itself.
Edit timeline analysis tracks every Edit and Write tool call, building per-file content ownership maps to detect when a model later overwrites its own earlier output. Opus 4.5 rewrites 18.2% of its edits vs 11.6% for Opus 4.6. Overlap classification reveals the rewrites are predominantly iterative refinement (64% for 4.5, largest category), not error recovery.
| Metric | Distribution | 4.5 | 4.6 |
|---|---|---|---|
| Tasks with edits | 700 | 246 | |
| Edit calls (rewrite rate denom.) | 2,453 | 1,166 | |
| Rewrite rate | 16.6% | 10.3% | |
| Total overlapping edits | 407 | 120 | |
| Self-corrections | 47 | 38 | |
| Error recovery | 64 | 18 | |
| User-directed corrections | 40 | 1 | |
| Iterative refinement | 256 | 63 |
The overlap composition tells a more nuanced story than the headline rewrite rate. When Opus 4.6 does overlap, a larger share is self-correction (30.4% vs 10.1% for 4.5)—meaning 4.6 catches and fixes its own mistakes more explicitly. Opus 4.5’s overlaps are more heavily iterative refinement (64% vs 59%), suggesting gradual adjustment rather than correction. Error recovery rates are comparable (15.2% vs 10.3%).
| Metric | Distribution | 4.5 | 4.6 |
|---|---|---|---|
| Tasks with edits | 767 | 368 | |
| Edit calls (rewrite rate denom.) | 2,674 | 1,765 | |
| Rewrite rate | 18.2% | 11.6% | |
| Total overlapping edits | 486 | 204 | |
| Self-corrections | 49 | 62 | |
| Error recovery | 74 | 21 | |
| User-directed corrections | 54 | 1 | |
| Iterative refinement | 309 | 120 |
| Complexity | Self-Correction Rate | 4.5 (n) | 4.6 (n) |
|---|---|---|---|
| Trivial | 71 | 20 | |
| Simple | 199 | 64 | |
| Moderate | 322 | 160 | |
| Complex | 157 | 103 | |
| Major | 18 | 21 |
| Measurement | Slice | Direction | Effect | padj | Sig |
|---|---|---|---|---|---|
| Has Edits Rate | complexity:simple | opus-4-5 higher | 0.488 | 0.0000 | Bonf |
| Lines Removed | complexity:simple | equal | 0.387 | 0.0000 | Bonf |
| Lines Added | task_type:refactor | opus-4-6 higher | 0.340 | 0.0187 | FDR |
| Has Overlaps Rate | complexity:simple | opus-4-5 higher | 0.332 | 0.0013 | FDR |
| Has Edits Rate | complexity:moderate | opus-4-5 higher | 0.330 | 0.0002 | Bonf |
| Lines Added | complexity:moderate | opus-4-5 higher | 0.311 | 0.0001 | Bonf |
| Max Chain Depth | iteration:minor | equal | 0.287 | 0.0351 | FDR |
| Triage Score | complexity:simple | equal | 0.268 | 0.0012 | FDR |
| Max Chain Depth | complexity:simple | equal | 0.265 | 0.0013 | FDR |
| Rewrite Rate | complexity:simple | equal | 0.258 | 0.0015 | FDR |
| Rewrite Rate | iteration:minor | equal | 0.254 | 0.0404 | FDR |
| Triage Score | iteration:minor | equal | 0.252 | 0.0387 | FDR |
| Has Overlaps Rate | iteration:minor | opus-4-5 higher | 0.248 | 0.0481 | FDR |
| Overlap Count | iteration:minor | equal | 0.246 | 0.0458 | FDR |
| Has Overlaps Rate | complexity:moderate | opus-4-5 higher | 0.234 | 0.0124 | FDR |
| Overlap Count | complexity:simple | equal | 0.230 | 0.0016 | FDR |
| Lines Removed | complexity:moderate | opus-4-5 higher | 0.224 | 0.0002 | Bonf |
| Lines Added | complexity:simple | opus-4-5 higher | 0.206 | 0.0000 | Bonf |
| Max Chain Depth | complexity:moderate | equal | 0.194 | 0.0093 | FDR |
| Overlap Count | complexity:moderate | equal | 0.185 | 0.0118 | FDR |
| Rewrite Rate | complexity:moderate | equal | 0.166 | 0.0113 | FDR |
| Triage Score | complexity:moderate | equal | 0.147 | 0.0094 | FDR |
| Measurement | Slice | Effect | padj |
|---|---|---|---|
| Has Overlaps Rate | task_type:greenfield | 0.806 | 0.2282 |
| Max Chain Depth | task_type:greenfield | 0.579 | 0.2606 |
| Triage Score | task_type:greenfield | 0.579 | 0.2606 |
| Rewrite Rate | task_type:greenfield | 0.579 | 0.2606 |
| Overlap Count | task_type:greenfield | 0.545 | 0.2606 |
| Lines Added | task_type:bugfix | 0.529 | 0.4868 |
| Lines Removed | task_type:refactor | 0.452 | 0.4819 |
| Rewrite Rate | task_type:bugfix | 0.442 | 0.2823 |
| Rewrite Rate | task_type:refactor | 0.395 | 0.3869 |
| Has Edits Rate | task_type:greenfield | 0.318 | 0.5317 |
| Triage Score | task_type:bugfix | 0.308 | 0.3949 |
| Lines Removed | task_type:bugfix | 0.304 | 0.5674 |
| Triage Score | task_type:refactor | 0.279 | 0.4835 |
| Max Chain Depth | task_type:bugfix | 0.244 | 0.4525 |
| Max Chain Depth | task_type:refactor | 0.244 | 0.4758 |
| Has Overlaps Rate | task_type:sysadmin | 0.239 | 0.1818 |
| Max Chain Depth | task_type:sysadmin | 0.228 | 0.1799 |
| Rewrite Rate | task_type:sysadmin | 0.226 | 0.1799 |
| Has Overlaps Rate | task_type:refactor | 0.220 | 0.4876 |
| Lines Added | task_type:greenfield | 0.218 | 0.6607 |
| Overlap Count | task_type:refactor | 0.207 | 0.4541 |
| Has Edits Rate | complexity:complex | 0.205 | 0.1589 |
| Max Chain Depth | task_type:feature | 0.199 | 0.5094 |
| Overlap Count | task_type:sysadmin | 0.194 | 0.1828 |
| Has Edits Rate | task_type:feature | 0.189 | 0.2309 |
| Lines Removed | complexity:complex | 0.177 | 0.3578 |
| Triage Score | task_type:sysadmin | 0.174 | 0.1799 |
| Triage Score | task_type:investigation | 0.173 | 0.6125 |
| Lines Removed | iteration:significant | 0.171 | 0.4310 |
| Triage Score | task_type:feature | 0.165 | 0.5655 |
Edit patterns capture one dimension of how the models work; the next section broadens the lens to overall resource usage and complexity scaling.
| Complexity | Distribution | 4.5 n | 4.6 n |
|---|---|---|---|
| Trivial | 882 | 346 | |
| Simple | 381 | 209 | |
| Moderate | 413 | 247 | |
| Complex | 198 | 112 | |
| Major | 26 | 23 |
Opus 4.6 sessions skew toward higher complexity: fewer trivial tasks (37% vs 46%) and proportionally more moderate tasks (26% vs 22%). This makes raw aggregate comparisons misleading—Opus 4.6 is tackling harder work on average.
| Metric | Distribution | 4.5 | 4.6 | Δ |
|---|---|---|---|---|
| Avg tools per task | 9.6 | 13.3 | B +38% | |
| Avg files per task | 1.7 | 2.3 | B +40% | |
| Avg lines added | 104.0 | 100.3 | ≈ Tie |
| Complexity | 4.5 tasks | 4.6 tasks | 4.5 files/task | 4.6 files/task | 4.5 lines+/task | 4.6 lines+/task | 4.5 lines−/task | 4.6 lines−/task |
|---|---|---|---|---|---|---|---|---|
| Trivial | 882 | 346 | 0.1 | 0.1 | 0 | 0 | 0 | 0 |
| Simple | 381 | 209 | 1.0 | 0.7 | 14 | 10 | 8 | 3 |
| Moderate | 413 | 247 | 2.8 | 2.9 | 112 | 79 | 37 | 26 |
| Complex | 198 | 112 | 5.6 | 7.2 | 472 | 420 | 102 | 131 |
| Major | 26 | 23 | 15.9 | 21.5 | 2006 | 1096 | 140 | 271 |
Tool calls and tools/file are classified under the “behavior” theme in the cross-cut analysis. Their per-complexity, per-task-type, and per-iteration breakdowns appear in §4’s cross-cut detail (Behavioral Findings). Key results: the tool-call gap is largest for significantly-iterated tasks (d=0.53) and trivial complexity (d=0.30), both Bonferroni-significant.
The preceding sections examined behavioral, quality, and resource dimensions. The next section examines temporal patterns—how performance unfolds within and across sessions.
Task duration survives Bonferroni correction (p=0.000001), though the effect size is negligible (d=0.005)—a case of statistical significance without practical significance, driven by sample size. Opus 4.6 takes longer per task (median 62s vs 42s, a 46% increase). The explore phase runs 2.3× longer at median (71.0s vs 31.3s). Effort distribution shows 4.6 allocates more tool calls to research (35.1% vs 28.3%) and fewer to implementation (17.5% vs 27.0%). Active-time cost is $27.48/hour for 4.6 vs $25.52/hour for 4.5 (5-min idle threshold).
| Percentile | Comparison | 4.5 | 4.6 |
|---|---|---|---|
| p10 | 8s | 10s | |
| p25 | 15s | 20s | |
| Median | 42s | 1.0m | |
| p75 | 2.0m | 3.4m | |
| p90 | 4.5m | 8.2m |
| Percentile | Comparison | 4.5 | 4.6 |
|---|---|---|---|
| p10 | 8s | 10s | |
| p25 | 15s | 20s | |
| Median | 42s | 1.0m | |
| p75 | 2.0m | 3.4m | |
| p90 | 4.5m | 8.2m |
| Duration | Distribution | 4.5 | 4.6 |
|---|---|---|---|
| Under 30s | 772 (42.1%) | 310 (34.4%) | |
| 30s – 2m | 604 (33.0%) | 270 (29.9%) | |
| 2m – 10m | 392 (21.4%) | 252 (27.9%) | |
| 10m – 1h | 55 (3.0%) | 61 (6.8%) | |
| Over 1h | 9 (0.5%) | 9 (1.0%) |
| Session Length | Alignment (4.5 / 4.6) | Completion Rate | Sessions (4.5 / 4.6) |
|---|---|---|---|
| Short (1–3 tasks) | 2.93 / 3.53 | 135 / 26 | |
| Medium (4–8 tasks) | 2.93 / 3.16 | 45 / 27 | |
| Long (9+ tasks) | 2.85 / 2.96 | 58 / 11 |
| Phase | Alignment | Completion Rate | Tools/File (4.5 / 4.6) |
|---|---|---|---|
| Early (first 3 tasks) | 4.95 / 5.68 | ||
| Later (task 4+) | 4.50 / 5.11 |
| Idle threshold | 4.5 active hrs | 4.6 active hrs | 4.5 $/hr | 4.6 $/hr | Δ $/hr |
|---|---|---|---|---|---|
| 2 min | 181.1 | 88.7 | $27.53 | $29.38 | +7% |
| 5 min | 195.4 | 94.8 | $25.52 | $27.48 | +8% |
| 10 min | 212.7 | 103.2 | $23.45 | $25.25 | +8% |
| 20 min | 235.8 | 114.9 | $21.15 | $22.68 | +7% |
| 30 min | 249.4 | 124.8 | $20.00 | $20.87 | +4% |
| 60 min | 282.3 | 148.6 | $17.67 | $17.53 | −1% |
Context-window compaction occurs in 9.8% of 4.5 sessions (32/327) and 11.7% of 4.6 sessions (22/188). Pre/post comparisons show improvement after compaction, but a position-adjusted control group—splitting non-compacting sessions at the median compaction position to isolate position effects—reveals the effect is driven by session position, not compaction itself (position-adjusted effect: −0.17 for 4.5, −0.17 for 4.6). Compaction appears to preserve rather than degrade performance.
| Metric | Distribution | 4.5 | 4.6 |
|---|---|---|---|
| Sessions with compaction | 32 / 327 | 22 / 188 | |
| Total compaction events | 51 | 35 | |
| Events per compacting session | 1.59 | 1.59 | |
| Auto-triggered | 70.6% | 80.0% | |
| Avg pre-compaction tokens | 156,823 | 164,617 | |
| Avg position in session | 59.2% | 60.0% |
| Opus 4.5 | Opus 4.6 | |||||
|---|---|---|---|---|---|---|
| Metric | Compacting Δ | Control Δ | Net effect | Compacting Δ | Control Δ | Net effect |
| Alignment score | +0.08 | +0.24 | −0.17 | +0.08 | +0.24 | −0.17 |
| Satisfaction rate | +5.1pp | +3.6pp | +1.4pp | +5.6pp | +7.3pp | −1.7pp |
| Completion rate | −0.0pp | +9.0pp | −9.0pp | −1.3pp | +9.8pp | −11.1pp |
Duration is classified under the “behavior” theme in the cross-cut analysis. Per-task-type and per-iteration breakdowns appear in §4’s cross-cut detail (Behavioral Findings). Key results: the duration gap is largest for significantly-iterated tasks (median 99.0s vs 45.5s) and investigation tasks (median 79.9s vs 41.0s), both Bonferroni-significant. Effect sizes are negligible (d<0.1) despite significance—driven by sample size, not practical magnitude.
Session dynamics reveal a temporal dimension to the behavioral differences. The next section synthesizes all dimensions into overall model profiles.
Observed approach: Tends to act first and adjust as needed. Jumps to implementation with minimal upfront research.
Thinking: Thinks on 75% of tasks but shallowly (2,578 avg chars). Over-thinks trivial tasks (§3).
Subagents: 49% Explore, 32% general-purpose (implementation workers). Primarily autonomous (55%).
Planning: Rarely uses planning mode (1.8%). Distributes research evenly through the task (§4).
Observed strengths: Lower tool overhead (mean 8.9 calls/task). ~4.9% cheaper overall per task (§2). Stable performance across session lengths (§8).
Observed weaknesses: Higher rewrite rate (18.2%, §6). Higher failure rate (12.0% vs 5.4%, §5).
Observed approach: Tends to research first, then implement. Front-loads investigation before touching files.
Thinking: Thinks on 59% of tasks but deeply (4,067 avg chars). Better calibrated—skips thinking on trivial, engages on complex (§3).
Subagents: 69% Explore (read-only research), 20% general-purpose. More autonomous (84%).
Planning: Uses planning mode on 12.3% of tasks (115 of 937). 43% at complex, 65% at major (§4).
Observed strengths: Lower rewrite rate (11.6%, §6). Lower failure rate (5.4%, §5). Lower cost at trivial–moderate (§2).
Observed weaknesses: 38% more tool calls per task (§7). ~4.9% more expensive overall. Costlier at major complexity (n=23, §2).
| Task Type | Observed Pattern | Evidence & Caveats |
|---|---|---|
| Trivial / simple tasks | Similar completion rates | 4.6 is 28–35% cheaper (§2); n=882/346 and 381/209 |
| Complex / major tasks | 4.6 showed higher alignment | n=112+23 for 4.6 vs 198+26 for 4.5; confounded by project differences |
| Refactoring | 4.6 produced 2.1× output tokens | 5,647 vs 2,674 avg output (§2); lower rewrite rate (§6) |
| Investigation / research | 4.6 used more Explore agents | 69% read-only subagents (§4); 2.3× longer explore phase (§8) |
| Long sessions (9+ tasks) | Both show some degradation | Small sample for late-session tasks; 4.6 may degrade faster |
| Parallel execution | 4.6 backgrounded more tasks | 4.5 spawned more agents but ran them sequentially |
Task-level data cleaning applied four exclusion rules and four informational flags to canonical tasks before analysis. Exclusions remove tasks that do not represent genuine user-model interactions; flags annotate tasks with contextual metadata without removing them.
| Rule | Description | Opus 4.5 | Opus 4.6 |
|---|---|---|---|
slash_command | Task prompt is a slash command (/command) or <command-name> tag — these invoke built-in features, not model reasoning | — | — |
system_continuation | Automatic continuations triggered by the system (e.g., context compaction boundaries, session resumptions) rather than deliberate user prompts | — | — |
empty_continuation | Bare acknowledgement prompts ("continue", "ok", "yes") with zero tool calls and <5s duration — the model produced no meaningful work | — | — |
no_response_interrupt | Tasks where the model produced zero output (0 tool calls, 0 duration) before the session ended, typically user cancellations | — | — |
These flags are preserved on included tasks for subgroup analysis but do not trigger exclusion:
meta — Task occurred within a meta-analysis session (e.g., this report's own development), where the model analyzed its own outputno_project — No project directory was associated with the sessioninterrupted — User interrupted the model mid-work (next message was [Request interrupted]). Reasons vary: accidental, correction, redirection, or technical issuespost_compaction — Task occurred after a context compaction event in the same session, potentially with degraded contextA potential confound arises from unequal project coverage between models. To quantify this, a sensitivity analysis compares all statistical tests on the full dataset against a restricted subset containing only tasks from projects where both models were active. If results agree across both analyses, the project confound is unlikely to explain observed differences.
This section documents how each pipeline step works. Each step includes a summary of the approach and a collapsible detail block with thresholds, algorithms, and parameters.
Each Claude Code session was segmented into tasks at user-message boundaries. An LLM annotator (Haiku) then classified each task for complexity, type, sentiment, completion status, and alignment score (1–5 scale). Behavioral metrics—subagent usage, planning, parallelization—were extracted directly from tool-call logs.
Three independent signal sources feed into sentiment aggregation:
Aggregation uses downgrade logic: if edit signals contradict the LLM (e.g., user corrections present but LLM says “satisfied”), the combined score is downgraded. If rewrite rate >0.3 but execution quality is “excellent,” the quality score is downgraded to “good.”
All 529 statistical tests are run both at the overall level and stratified across three cross-cut dimensions. Each section’s “Cross-Cut Detail” expansion shows how its metrics behave under each slice.
| Dimension | Levels | Method |
|---|---|---|
| Complexity | trivial (≤3 tools, ≤1 file, ≤20 lines), simple (≤10, ≤3, ≤100), moderate (≤30, ≤10, ≤500), complex (≤80, ≤25, ≤2000), major (above all thresholds) | Metric thresholds on tool calls, files touched, and lines changed. Lowest matching tier wins. Keyword heuristics as tiebreaker. |
| Task type | investigation, bugfix, feature, greenfield, refactor, sysadmin, docs, continuation, port | LLM-classified (Haiku) from user prompt, tool usage, and work summary. Regex pattern matching provides initial signal; LLM classification overrides at medium/high confidence, resolving previously “unknown” tasks (33.6% of dataset). Eval: 100% unknown resolution, LLM agrees with regex on 55% of classified tasks. |
| Iteration | one_shot (no back-and-forth), minor (small corrections), significant (multiple rework cycles) | LLM-classified from the user’s next message after task completion, informed by edit signal heuristics (self-corrections, rewrite rate). |
The edit timeline reconstructs a per-file content ownership history from every Edit/Write tool call across all sessions. When a later edit’s old_string overlaps with content placed by an earlier edit, a rewrite is detected—providing a mechanistic signal for self-correction that doesn’t depend on sentiment classification.
Overlaps are matched via three tiers, evaluated in order:
| Tier | Method | Threshold |
|---|---|---|
| Exact | String equality between prior new_string and later old_string |
100% match |
| Containment | Substring match with size constraints | ≥40 chars AND ≥30% of larger string |
| Line overlap | Jaccard coefficient on non-trivial lines (>15 chars) | Jaccard >0.3 OR coverage >0.5 |
Each detected overlap is classified by context:
A per-task triage score weights these: (self_corrections×3 + error_recoveries×2 + user_corrections×5 + max_chain_depth) / total_edits. Edit metrics were joined with task classifications to compute complexity-binned accuracy rates (100% coverage for both models).
Claude Code compacts conversation context when token limits approach. This analysis measures whether compaction degrades task outcomes or merely correlates with session position.
86 compact_boundary system messages were found across 54 compacting sessions, with trigger type, pre-compaction token count, and session position extracted for each. Outcome impact was measured by splitting tasks into pre/post groups at the first compaction timestamp. A control group of non-compacting sessions, split at the median compaction position, isolates position effects from compaction effects.
529 tests were conducted across overall and per-complexity strata, using Bonferroni correction—the most conservative standard—to minimize false positives given the observational design.
Three test types were used: chi-square for categorical distributions (effect size: Cramér’s V), Mann-Whitney U for continuous metrics (Cohen’s d with bootstrap confidence intervals, n=5,000 resamples), and two-proportion Z-tests for rates (Cohen’s h). Confidence intervals on proportions use Wilson score intervals.
Bonferroni corrected threshold: p<0.0000945. Across all 529 tests, 141 survive Bonferroni and 234 survive FDR correction. At the overall level, 21 survive Bonferroni, including alignment score (p=0.000714), duration (p<0.000001, though d=-0.000—negligible practical effect), tool calls/task (p<0.000001, d=−0.22), tools/file (p<0.000001, d=−0.10), and three categorical distributions (task completion, communication quality, autonomy level). Two of the three chi-square survivors have low-expected-cell-count warnings, which may inflate their test statistics.
| Test Category | Field | p-value | Effect Size | Bonferroni | Result |
|---|---|---|---|---|---|
| Mann-Whitney U | alignment_score | 0.000714 | d = -0.1337 | Opus 4.6 higher (p < 0.05) CIA: [3.0, 3.1], CIB: [3.1, 3.2] | |
| duration_seconds | 0.000000 | d = -0.0000 | ✓ | Opus 4.6 lower (Bonferroni significant) CIA: [154.1, 550.6], CIB: [206.9, 499.5] | |
| tool_calls | 0.000000 | d = -0.2214 | ✓ | Opus 4.6 higher (Bonferroni significant) CIA: [8.2, 9.7], CIB: [11.6, 14.2] | |
| files_touched | 0.049942 | d = -0.1628 | Opus 4.6 higher (p < 0.05) CIA: [1.5, 1.8], CIB: [2.0, 2.7] | ||
| lines_added | 0.835121 | d = 0.0092 | No significant difference CIA: [86.2, 125.4], CIB: [82.8, 119.8] | ||
| lines_removed | 0.147622 | d = -0.0929 | No significant difference CIA: [19.3, 25.4], CIB: [23.3, 37.4] | ||
| lines_per_minute | 0.121768 | d = 0.1424 | No significant difference CIA: [36.9, 43.8], CIB: [26.4, 33.9] | ||
| tools_per_file | 0.000000 | d = -0.1033 | ✓ | Opus 4.6 higher (Bonferroni significant) CIA: [4.0, 4.7], CIB: [4.7, 5.4] | |
| Proportion Test | satisfaction_rate | 0.043843 | h = 0.0813 | Opus 4.6 lower (p < 0.05) A: 23.5% [21.7%, 25.5%], B: 20.2% [17.7%, 22.9%] | |
| dissatisfaction_rate | 0.403717 | h = 0.0336 | No significant difference A: 11.8% [10.5%, 13.4%], B: 10.8% [8.9%, 12.9%] | ||
| complete_rate | 0.000000 | h = -0.4445 | ✓ | Opus 4.6 higher (Bonferroni significant) A: 38.9% [36.7%, 41.1%], B: 60.9% [57.8%, 64.0%] | |
| failed_rate | 0.000000 | h = 0.2365 | ✓ | Opus 4.6 lower (Bonferroni significant) A: 12.0% [10.6%, 13.5%], B: 5.4% [4.2%, 7.1%] | |
| scope_expanded_rate | 0.001178 | h = 0.1551 | Opus 4.6 lower (p < 0.05) A: 1.8% [1.3%, 2.5%], B: 0.3% [0.1%, 0.9%] | ||
| one_shot_rate | 0.000000 | h = -0.3747 | ✓ | Opus 4.6 higher (Bonferroni significant) A: 42.0% [39.8%, 44.2%], B: 60.6% [57.5%, 63.7%] | |
| good_execution_rate | 0.000000 | h = 0.2075 | ✓ | Opus 4.6 lower (Bonferroni significant) A: 31.5% [29.4%, 33.6%], B: 22.3% [19.8%, 25.1%] | |
| Chi-square | task_completion | 0.000000 | V = 0.2218 | ✓ | Distribution differs (p < 0.05, V = 0.2218) |
| scope_management | 0.000000 | V = 0.2963 | ✓ | Distribution differs (p < 0.05, V = 0.2963) (low cell counts) | |
| iteration_required | 0.000000 | V = 0.2076 | ✓ | Distribution differs (p < 0.05, V = 0.2076) (low cell counts) | |
| error_recovery | 0.000268 | V = 0.1399 | Distribution differs (p < 0.05, V = 0.1399) (low cell counts) | ||
| communication_quality | 0.000000 | V = 0.2038 | ✓ | Distribution differs (p < 0.05, V = 0.2038) (low cell counts) | |
| autonomy_level | 0.000000 | V = 0.2522 | ✓ | Distribution differs (p < 0.05, V = 0.2522) (low cell counts) |
To validate robustness, all overall-level tests were re-run on a restricted dataset excluding overlapping projects. This tests whether findings depend on specific project mix or are stable across the data.
Restricted dataset excludes 8 overlapping projects to test whether findings depend on specific project mix.
Overall Bonferroni survivors: 15 (full dataset) vs 16 (restricted).
| Metric | Test | Full p | Restricted p | Persists? |
|---|---|---|---|---|
| Task Completion | Chi Square | 3.00e-06 | 0.00e+00 | Yes |
| Communication Quality | Chi Square | 0.00e+00 | 0.00e+00 | Yes |
| Autonomy Level | Chi Square | 0.00e+00 | 0.00e+00 | Yes |
| Alignment Score | Mann Whitney | 1.00e-06 | 0.00e+00 | Yes |
| Duration Seconds | Mann Whitney | 0.00e+00 | 0.00e+00 | Yes |
| Tool Calls | Mann Whitney | 0.00e+00 | 0.00e+00 | Yes |
| Tools Per File | Mann Whitney | 0.00e+00 | 0.00e+00 | Yes |
| Total Output Tokens | Mann Whitney | 0.00e+00 | 0.00e+00 | Yes |
| Total Input Tokens | Mann Whitney | 0.00e+00 | 0.00e+00 | Yes |
| Thinking Chars | Mann Whitney | 0.00e+00 | 4.30e-05 | Yes |
| Request Count | Mann Whitney | 0.00e+00 | 0.00e+00 | Yes |
| Cost Per Minute | Mann Whitney | 1.00e-06 | 2.90e-05 | Yes |
| Output Per Request | Mann Whitney | 0.00e+00 | 0.00e+00 | Yes |
| Cache Hit Rate | Mann Whitney | 0.00e+00 | 0.00e+00 | Yes |
| Thinking Fraction | Mann Whitney | 0.00e+00 | 0.00e+00 | Yes |
| # | Measurement | Theme | Direction | Effect | padj | Sig |
|---|---|---|---|---|---|---|
| 1 | Thinking Fraction | Thinking | opus-4-5 higher | 0.636 medium | 0.0000 | Bonf |
| 2 | Complete Rate | Quality | opus-4-6 higher | 0.445 small | 0.0000 | Bonf |
| 3 | Total Output Tokens | Cost | opus-4-6 higher | 0.388 small | 0.0000 | Bonf |
| 4 | One Shot Rate | Behavior | opus-4-6 higher | 0.375 small | 0.0000 | Bonf |
| 5 | Output Per Request | Cost | opus-4-6 higher | 0.299 small | 0.0000 | Bonf |
| 6 | Scope Management | Behavior | distributions differ | 0.296 small | 0.0000 | Bonf |
| 7 | Autonomy Level | Behavior | distributions differ | 0.252 small | 0.0000 | Bonf |
| 8 | Cost Per Minute | Cost | opus-4-5 higher | 0.237 small | 0.0000 | Bonf |
| 9 | Failed Rate | Quality | opus-4-5 higher | 0.236 small | 0.0000 | Bonf |
| 10 | Task Completion | Quality | distributions differ | 0.222 small | 0.0000 | Bonf |
| 11 | Tool Calls | Behavior | opus-4-6 higher | 0.221 small | 0.0000 | Bonf |
| 12 | Iteration Required | Behavior | distributions differ | 0.208 small | 0.0000 | Bonf |
| 13 | Good Execution Rate | Quality | opus-4-5 higher | 0.207 small | 0.0000 | Bonf |
| 14 | Communication Quality | Behavior | distributions differ | 0.204 small | 0.0000 | Bonf |
| 15 | Cache Hit Rate | Cost | equal | 0.192 negligible | 0.0000 | Bonf |
| 16 | Request Count | Cost | opus-4-6 higher | 0.190 negligible | 0.0000 | Bonf |
| 17 | Scope Expanded Rate | Behavior | opus-4-5 higher | 0.155 negligible | 0.0036 | FDR |
| 18 | Error Recovery | distributions differ | 0.140 negligible | 0.0009 | FDR | |
| 19 | Thinking Chars | Thinking | opus-4-5 higher | 0.140 negligible | 0.0000 | Bonf |
| 20 | Alignment Score | Quality | equal | 0.134 negligible | 0.0022 | FDR |
| 21 | Normalized Execution Quality | Quality | distributions differ | 0.133 negligible | 0.0000 | Bonf |
| 22 | Total Input Tokens | Cost | opus-4-5 higher | 0.119 negligible | 0.0000 | Bonf |
| 23 | Tools Per File | Behavior | opus-4-6 higher | 0.103 negligible | 0.0000 | Bonf |
| 24 | Normalized User Sentiment | Quality | distributions differ | 0.064 negligible | 0.0236 | FDR |
| 25 | Duration Seconds | Behavior | opus-4-6 higher | 0.000 negligible | 0.0000 | Bonf |
| Measurement | Theme | Effect | padj |
|---|---|---|---|
| Files Touched | Behavior | 0.163 | 0.1005 |
| Lines Per Minute | Behavior | 0.142 | 0.2091 |
| Rewrite Rate | Editing | 0.126 | 0.0528 |
| Triage Score | Editing | 0.123 | 0.0587 |
| Max Chain Depth | Editing | 0.094 | 0.0690 |
| Lines Removed | Editing | 0.093 | 0.2366 |
| Has Overlaps Rate | Editing | 0.082 | 0.0899 |
| Satisfaction Rate | Quality | 0.081 | 0.0906 |
| Has Edits Rate | Editing | 0.080 | 0.0929 |
| Estimated Cost | Cost | 0.047 | 0.8292 |
| Overlap Count | Editing | 0.042 | 0.0894 |
| Dissatisfaction Rate | Quality | 0.034 | 0.5134 |
| Lines Added | Editing | 0.009 | 0.8647 |
This analysis was developed iteratively. Two early approaches were replaced after proving unreliable:
LLM-only dissatisfaction detection: Initial LLM-based sentiment classification flagged 7–9% dissatisfaction for both models. An audit of all 59 flagged cases revealed 73–93% false positive rates—the classifiers were fooled by task-coordination language (e.g., “fix” in subagent prompts). This was replaced by the current multi-signal approach, which requires corroboration from keyword patterns and structural edit signals before classifying dissatisfaction.
LLM quality judgement: An LLM judge was asked to compare code quality between models. The judge lacked sufficient context to evaluate whether code met domain requirements and produced confident but ungrounded assessments. This was replaced by mechanistic edit timeline analysis, which detects self-corrections from the tool-call record rather than relying on subjective quality assessment.
The analysis pipeline is fully automated and can reproduce all tables and statistics from the raw session data.
pip install scipy)python scripts/run_pipeline.py --data-dir comparisons/opus-4.5-vs-4.6/data
python scripts/run_pipeline.py --data-dir comparisons/opus-4.5-vs-4.6/data --no-llm
# Run from a specific step onward
python scripts/run_pipeline.py --data-dir comparisons/opus-4.5-vs-4.6/data --from stats
# Run specific steps
python scripts/run_pipeline.py --data-dir comparisons/opus-4.5-vs-4.6/data --steps dataset,update,report
# Check what needs re-running
python scripts/run_pipeline.py --data-dir comparisons/opus-4.5-vs-4.6/data --check-stale
| Step | LLM? | Estimated Cost | --no-llm behavior |
|---|---|---|---|
| collect | No | $0 | Runs normally |
| extract | No | $0 | Runs normally |
| classify | No | $0 | Runs normally |
| annotate | Yes (Haiku) | ~$7.50 | Skipped (uses cached annotations) |
| analyze | No | $0 | Runs normally |
| tokens | No | $0 | Runs normally |
| enrich | No | $0 | Runs normally |
| stats | No | $0 | Runs normally |
| findings | No | $0 | Runs normally |
| dataset | No | $0 | Runs normally |
| update | Partial (Opus) | ~$2.00 | Tables only, no LLM expression authoring |
| report | No | $0 | Runs normally |
All statistical results, tables, and charts are deterministic (no LLM). Only task annotation and expression authoring use LLM calls. The --no-llm flag produces identical quantitative results at zero API cost. Total full-pipeline LLM cost: ~$9.50.
Annotate cost estimated by reconstructing all 3,153 annotation prompts from canonical task data, measuring character counts of prompts (~3,700 chars median) and cached responses (~1,500 chars median), converting at ~4 chars/token, and applying Haiku 4.5 pricing ($0.80/MTok input, $4.00/MTok output). Includes ~20% backfill rate for task-type classification calls. Update cost estimated from the annotated template size (~490K chars, ~122K tokens input) with Opus 4.6 pricing ($15/MTok input, $75/MTok output); one LLM call per pipeline run.
LLM-in-the-loop analysis: All task classification, sentiment analysis, and alignment scoring was performed by LLM agents (Claude Haiku and Sonnet). This creates a circularity concern: Claude models are classifying Claude model outputs. No formal inter-rater reliability was computed. Human spot-checks validated flagged cases, but systematic bias between models (e.g., if the classifier is more generous toward outputs that resemble its own style) cannot be ruled out. All three overall chi-square Bonferroni survivors and the alignment score depend on LLM-generated categories.
Single user: All data comes from one developer’s workflow. Results may not generalize to other users, codebases, or task distributions.
Temporal confound: Opus 4.5 spans 70 days; Opus 4.6 spans 13 days. A productive week, a particular project focus, or simply the novelty of a new model could color all 937 Opus 4.6 tasks simultaneously. The null hypothesis—that all observed differences reflect the user’s changing work patterns rather than model capabilities—cannot be rejected by this design.
Observational, not experimental: Tasks were not randomly assigned to models. Opus 4.6 was used later chronologically and on different (often harder) tasks, confounding model effects with task effects.
Complexity confound: Opus 4.6’s different complexity mix (41% moderate-and-above vs 34% for Opus 4.5) inflates its resource usage metrics and may suppress its satisfaction scores. Complexity-stratified comparisons (presented throughout as cross-cut detail) partially control for this, but cannot fully separate model effects from mix effects.
Platform evolution: The Claude Code SDK evolved between December 2025 and February 2026. Changes to system prompts, available tools, or subagent defaults could contribute to behavioral differences attributed to the models.
Sample asymmetry: The 2.0:1 ratio (1,900 vs 937 tasks) means Opus 4.5 estimates have narrower confidence intervals. Effect sizes for Opus 4.6 are less precise.
User learning effect: The user may have learned to use Claude Code more effectively over time, benefiting whichever model came second in the chronological sequence.
Thanks to Anthropic for including me in the Claude Code Early Access Program and for supporting independent research into model behavior. The EAP provided early access to Opus 4.6, making this comparative analysis possible.