Claude Opus 4.8 Feels Worse? Real Complaints, Likely Causes, and How to Fix It

Two days after Anthropic shipped Claude Opus 4.8, the reviews and the forums disagreed. The press called it sharper and more honest. A lot of developers said it felt more cautious, quicker to refuse, or no different from before.

We run our own multi-agent setup on Claude every day, so this isn’t a review. It’s a working note: what people are actually seeing, why, and what to do about it. Most of the time the cause is more boring, and more fixable, than “the model got dumber.”

What people are complaining about

The model over-warns or refuses on legitimate work. Security folks especially reported it treating normal security code as suspicious. Answers feel shorter. And after a rocky 4.7 release, many people just don’t trust that a new version is better because a benchmark says so.

Trust, once dented, colors everything.

”Feels worse” is usually a config problem, not the model

“Opus feels worse” is rarely one thing. It’s usually one of four: a thinking-mode mismatch, a long session quietly dropping context, the service under peak load, or the same model behaving differently on claude.ai versus Claude Code versus the API.

You diagnose it by changing one variable at a time: fresh session, one surface, then thinking mode, then thread age.

The honesty trade-off

The headline feature of 4.8 is honesty. It flags uncertainty and makes fewer unsupported claims, about four times less likely to leave a flaw in its own code unmentioned. That’s good. But tuning a model to be less overconfident usually makes it more cautious too: more caveats, more confirmations, more refusals on edge cases.

The security-code refusals are that trade-off showing up. Anthropic now documents refusal categories in the API, so you can detect and reroute instead of guessing.

The effort default nobody re-baselined

This catches the most people, and it isn’t the model. Opus 4.8’s “effort” setting controls how much it thinks. The default is “high.” But Anthropic recalibrated the levels, so “high” on 4.8 thinks less than “high” did on 4.7, and serious coding wants “xhigh,” which is not the default.

Upgrade, keep your old config, and you quietly get less reasoning on the same task. That’s not a worse model. It’s a worse setting.

Anthropic’s own postmortems say the same thing

In April, Anthropic published a postmortem on a run of “Claude Code got worse” episodes. Almost none were the weights: a default effort cut that got reverted; an adaptive-thinking mode that spent zero reasoning on some turns and produced confident hallucinations (fake commit hashes, packages that don’t exist); an idle-session bug that made it look forgetful; a verbosity tweak that hurt coding and was rolled back four days later.

One analysis of nearly 7,000 sessions found an earlier Opus reasoning 67% less than months before. The lesson predates 4.8: most “the AI got worse” moments are config, prompts, routing, and load, not intelligence.

How to fix most of it

Set effort to xhigh for coding instead of trusting the default.
Diagnose by changing one thing at a time, and start a fresh session at task boundaries. Long, compacted threads are the usual “it forgot what we decided” culprit.
For refusals on legitimate work, give context that makes the legitimacy obvious, and use the documented refusal categories to reroute.
Don’t confuse fast mode with quality. It’s the same model run faster at a premium, not a cheaper, dumber tier.
You don’t have to upgrade on day one. 4.7 is still available, so pin the version you trust and test the new one on the side.

The real fix is measurement

Vibes lie. “It feels worse” is a real signal but a terrible diagnosis. The only way to know if a model regressed for your work, versus a config change, a long session, or a bad day, is to measure it: tokens per task, success rate, retries before a task lands.

In our setup every task is logged and each role runs a fixed model, so “this feels off” becomes a number we can check against last week. Usually it was the session, not the model.

We build that discipline into our own stack: fixed models per role, per-task logging, and clean restarts at task boundaries. CoveLab Foundation ships that layer, so “it feels worse today” becomes a number you can check instead of a guess you argue about.

Sources: Anthropic engineering postmortem on Claude Code quality (April 2026); Anthropic “What’s new in Claude Opus 4.8” API documentation; Hacker News Opus 4.8 launch discussion; launch-day developer reviews aggregating Hacker News and X feedback.

Researched with AI, then corrected, adapted, and approved by the owner.