Claude Code: when the problem is not in the model, but in the system around it Link to heading

Infographic showing how effort, context, cache, and system prompt affected perceived Claude Code quality.

In April 2026, Anthropic published a rare postmortem about a feeling that had been growing among Claude Code users: for some people, the agent simply seemed to have gotten worse. It wasn’t an abstract complaint like “the model is less intelligent”. There were reports of weaker responses, repetition, forgetting context, strange tool choices, and a general impression that the product had lost some of its operational reliability.

The interesting point is that Anthropic itself investigated and came to a more sophisticated conclusion than the easy explanation. According to the company, the API and inference layer were not affected. The problem was in three separate changes to the Claude Code product, the Claude Agent SDK, and Claude Cowork: a change to the default reasoning effort, a cache/context bug that discarded reasoning history, and a system prompt change that attempted to reduce verbosity but also reduced quality in programming tasks.

This case is a good portrait of the current situation of code agents. Perceived quality does not just depend on the model. It depends on the model, of course, but also on the harness, the system prompt, the context window, the cache policies, the way commands are executed, observability, rollout, communication with users and the product engineering that involves all of this.

In other words: a coding agent is not a model with a pretty interface. It is a distributed system of small decisions. And when some of these decisions slip up at the same time, the user feels as if the intelligence has evaporated.

What happened to Claude Code Link to heading

Anthropic’s official report, published on April 23, 2026, points to three main causes for the recent quality issues.

The first happened on March 4, 2026. Anthropic changed Claude Code’s default reasoning effort from high to medium. The motivation was to reduce latency, token consumption, and situations where the interface seemed stuck when the model thought for too long. It makes sense as a product intent: if a tool seems frozen, users lose confidence. But, in practice, the change cut precisely part of the behavior that many users valued: more time to think about difficult tasks. The company reversed the decision on April 7.

The second happened on March 26, 2026. Anthropic released an optimization to clean up old reasoning from sessions that had been inactive for more than an hour. The idea was to reduce the cost of resuming a conversation after the cache had already expired. The problem was the implementation: instead of clearing the history once, the system continued to clear the reasoning every turn for the remainder of the session. The result was an agent that continued to perform tasks, but with less and less memory of the path that took it there. The fix came out on April 10th.

The third took place on April 16, 2026. In preparation for Opus 4.7, Anthropic tweaked the harness and added a system prompt instruction to reduce verbosity. The intention was to control excessively long responses between tool calls and in final responses. But a seemingly simple line in the prompt had a disproportionate effect on code quality. The company reversed the change on April 20.

According to Anthropic, the three problems were resolved on April 20, in version v2.1.116 of Claude Code.

This timeline matters because it explains why the issue seemed confusing to those using the product. It wasn’t a single clear regression, with a single symptom and a single date. There were different changes, affecting different segments of users, at different times. To those on the outside, the aggregate effect seemed like an unstable product: sometimes good, sometimes distracted, sometimes slow, sometimes shallow.

Why did so many users experience a drop in quality Link to heading

When a code agent gets worse, the user rarely sees the real cause. It doesn’t see the reasoning parameter sent to the API. Does not see which context blocks were preserved. Don’t see the full system prompt. Don’t see the cache policy. It doesn’t know if it’s in an experiment, a new version, or a specific execution path.

What the user sees is behavior.

And behavior, in agents, is a sum of layers. If the agent forgets a decision made ten minutes earlier, it appears “less intelligent”. If it chooses a strange tool, it seems “less reliable”. If it responds with less context, it appears “lazier.” If it interrupts a complex task with a conclusion that is too short, it seems like it “didn’t get it.” But each of these sensations can come from a different point in the architecture.

This is one of the most important lessons learned from the episode: agent quality is not just a model benchmark.

Benchmarks help, but they don’t capture everything. A coding agent works in long conversations, reads files, runs commands, accumulates decisions, edits code, goes back, interprets errors, and needs to keep a mental track of what it is doing. If the context is pruned at the wrong time, if the system instruction narrows the response too much, if the default reasoning mode changes silently, or if the product does not make clear which trade-off is in use, actual performance drops even if the base model remains excellent.

For anyone using Codex, Claude Code, Cursor, Devin, Gemini CLI or any other development agent, this changes the right question. Instead of just asking “which model is smarter?”, we need to also ask:

  • how does the agent manage context?
  • how does it preserve previous decisions?
  • how does it check its own work?
  • how does it expose trade-offs of speed, cost and depth?
  • how does it handle long sessions?
  • how does the product test prompt and behavior changes?

The model matters a lot. But the product around it decides whether this intelligence reaches the user in one piece.

The first problem: reducing the effort of reasoning Link to heading

The first regression described by Anthropic was born out of a real dilemma: intelligence costs time.

Modern models can improve responses when given more space to reason, especially in engineering tasks that require understanding context, comparing alternatives and avoiding dangerous changes. But this reasoning comes at a cost: more latency, more tokens and, in interactive tools, a greater risk of the interface appearing to crash.

In Claude Code, this trade-off appears at effort levels. The user can choose between faster responses and responses that think more. The mistake was not offering a more economical way. This is useful. The mistake was to make this mode the default for users who were using the tool precisely for its ability to solve complex problems.

And here there is a very delicate product lesson: defaults are opinions.

When a tool changes the default from high to medium, it is not just messing with a setting. It is implicitly saying, “the ideal experience should prioritize lower latency.” For some users, this is perfect. For others, especially in real codebases, the priority is different: I’d rather wait longer if it avoids a bad change.

Anyone who works with code knows that a fast and wrong answer is more expensive than a slow and correct answer. The cost of reviewing, undoing, diagnosing, and regaining confidence can be much greater than a few extra seconds of waiting.

Anthropic recognized this point. After feedback, it reversed the change on April 7 and returned to prioritizing higher effort by default. The decision is a good reminder for any AI product: it’s not enough to optimize for the average. It is necessary to understand which type of error the user tolerates worst.

In a casual writing tool, perhaps latency is the main problem. In an agent that changes code, changes files and executes commands, the drop in quality is more corrosive.

The second problem: cache, context and broken memory Link to heading

The second problem is even more interesting from a technical point of view, because it shows how an apparently reasonable optimization can harm an essential property of agents: continuity.

According to Anthropic, the March 26 change attempted to reduce cost and latency when resuming sessions stopped for more than an hour. The logic was: if the cache has already expired, maybe we can clean up old blocks of reasoning and send fewer tokens on resumption. In theory, after that, the system would return to preserving the history normally.

But the bug made the cleanup happen repeatedly. Once the session passed the inactivity trigger, each turn continued to discard previous parts of the reasoning. The agent continued working, but without having complete access to the “why” of its own decisions.

This explains symptoms reported by users: forgetfulness, repetition, and tool choices that seemed strange. The agent could remember the superficial objective, but lost the internal thread of execution. And, in programming, this thread is decisive.

Imagine a session in which the agent reads three files, discovers that a function has legacy behavior, decides to make a minimal change to preserve compatibility, runs a test, finds an error, adjusts the approach and continues. If part of the previous reasoning disappears, the next action may seem right locally but wrong throughout the entire flow. The agent stops acting like someone who has built progressive understanding and starts acting like someone who woke up in the middle of the task with some notes lying around on the table.

This is a central difference between chatbots and agents. A chatbot can bounce back with a good question. An agent in the middle of an edit needs to maintain operational coherence. It has state. It has history. It has a trail of decisions that need to survive time.

There is also a lesson on testing. Anthropic said the change went through reviews, unit testing, end-to-end testing, automated verification, and dogfooding. Still, it escaped. This doesn’t mean that tests don’t work. It means that agents create compound states that are difficult to reproduce: long session, idle, cache miss, tool use, follow-up in the middle of tool use, partially pruned context. The bug lived at the intersection of these conditions.

For agent-based products, testing the “happy path” is not enough. It is necessary to test in real life: session interrupted, resumed hours later, context almost full, files changing, user correcting direction, command failing, agent switching between reading and writing.

The third problem: when a system prompt changes the product’s behavior Link to heading

The third problem shows an uncomfortable truth for those who build systems with LLMs: system prompt is product code.

Not in the romantic sense of “prompt engineering is the new programming”, but in the practical sense. A line in the system prompt can alter the agent’s behavior, change the quality of responses, affect the use of tools and introduce regression. If this line goes into production without extensive evaluation, it has the potential to break experience.

Anthropic added an instruction to reduce verbosity. The motivation was understandable. More capable models can be longer, and too long responses in a code agent get in the way: they pollute the terminal, consume tokens, make the conversation tiring, and can hide what matters.

But there is a difference between brevity and amputation.

A coding agent needs to be economical without losing the necessary reasoning. At certain moments, it should say little: “I ran the test, it failed in X, I’m going to adjust Y”. In others, it needs to explain risks, trade-offs, side effects, and reasons not to touch a certain file. If the prompt pushes the agent towards answers that are too short in all contexts, it may fail to carry information that helps the resolution process itself.

This point is especially important because system prompts don’t just affect the final response. They shape behavior between tool calls, the way to plan, the willingness to explore before editing, and the extent to which the agent makes uncertainty explicit.

In the postmortem, Anthropic says it expanded ablations and assessments by model to better understand the impact of each line of the prompt. This is the correct path. System prompt needs governance similar to sensitive code: review, testing, gradual rollout, telemetry, rollback capability, and understanding by model.

It’s not a cosmetic detail. It’s part of the architecture.

What this case teaches about coding agents Link to heading

The Claude Code case teaches that coding agents are less like “turbocharged autocomplete” and more like operational environments.

They need to read state, write state, preserve history, perform actions, ask for confirmation when there is risk, verify results and correct course. This requires multi-layered engineering:

  • model strong enough to understand real code;
  • well-managed context;
  • auditable system prompts;
  • reliable tools;
  • clear permissions;
  • logs and telemetry that capture subtle degradations;
  • assessments that simulate long flows;
  • rollback mechanisms;
  • transparent communication when something goes wrong.

When any of these layers fail, the user often blames “the AI.” And it even makes sense, because everything appears as a single character on the screen. But underneath there is a chain of components.

This is the point that much discussion about agents still ignores. In practice, the advantage should not just come from the smarter model, but from systems that combine model, context, tool and verification with less friction.

The risk of confusing model, product and experience Link to heading

An important detail from Anthropic’s postmortem is the separation between API and product. The company stated that the API and inference layer were not affected. The problems were in the Claude Code experience and related products.

This distinction is important because, in the market, everything gets mixed up very quickly. A user has a bad experience on the agent and concludes that the model has become worse. Another uses the API directly and doesn’t see the same problem. Someone compares reports on social media and transforms different symptoms into a single narrative: “the model was nerfed”.

Sometimes this suspicion may be right. But in this case, according to Anthropic itself, the origin was in the product.

For teams using AI in production, this suggests a healthy practice: separate layers in evaluation.

When something gets worse, ask:

  • has the model changed?
  • has the reasoning parameter changed?
  • has the system prompt changed?
  • has the tool changed?
  • has the context policy changed?
  • has the cache changed?
  • has the interface changed?
  • has user input changed?
  • has the repository or environment changed?

Without this separation, we are stuck with emotional diagnoses. And emotional diagnosis can make a good post on X, but it rarely fixes bugs.

In real work, we need traceability. Agent version, model version, effort configuration, command logs, diffs, tests, context size, fonts used and ready criteria. The more an agent acts on important systems, the more it needs to leave a trail.

The code leak and the operational crisis surrounding Claude Code Link to heading

In the same period, another episode increased the tension around Claude Code. According to a TechCrunch report published on April 1, 2026, Anthropic had accidentally made the source code of the Claude Code application accessible in a recent version. The code was circulated on GitHub, and the company attempted to remove copies via takedown notifications.

The problem is that the takedown would have hit around 8,100 repositories, including legitimate forks of the Claude Code public repository. According to the report, Boris Cherny, leader of Claude Code, stated that the measure was accidental and that Anthropic removed most of the notifications, maintaining the removal only of the repository and forks that contained the leaked code.

This episode is different from the quality problem described in the postmortem, but it speaks to it on one point: operational trust.

Code agent tools live inside a team’s most sensitive flow: the repository. They read files, suggest changes, execute commands and participate in building software. Therefore, the expectation of operational maturity is high. A code leak followed by overly broad takedowns doesn’t just affect the company’s image. It reinforces the question that technical users always ask: is this tool predictable enough to enter my process?

It’s not a hostile question. It’s the correct question.

Programming agents need to be evaluated not just on brilliant demo, but on incident response. How do they communicate? How do they revert? How do they document? How do they limit damage? How do they fix it without punishing legitimate users? How do they learn from mistakes?

In this sense, the transparency of the quality postmortem was positive. But the series as a whole shows how the category is still maturing. We are placing increasingly capable agents in increasingly real environments. The governance bar needs to rise along with it.

Practical lessons for those who use Claude Code, Codex, Cursor or other agents Link to heading

The useful part of this case is not choosing a side in the tool crowd. It’s about extracting better practices for working with agents.

The first lesson: give the agent a way to check its work. Claude Code’s best practices documentation emphasizes testing, screenshots, expected outputs, and success criteria. This goes for any agent-based tool. If the agent can run tests, compare results, and see the error, it no longer depends solely on manual review.

The second: explore first, plan later, implement last. Agents tend to look more impressive when editing. But, in a real codebase, the safe order is different: understand files, identify restrictions, propose a plan, confirm, and only then change. This discipline reduces rework and avoids beautiful solutions to the wrong problem.

The third: manage context as a critical resource. Full context, old context, pruned context or confusing context can degrade the answer. Long sessions must be condensed, checkpoints must be clear, and important instructions must be in persistent files, such as CLAUDE.md, AGENTS.md, local skills or process docs.

The fourth: treat prompts and settings as part of the system. If an instruction changes behavior, it should be revised. If a default changes, this must be communicated. If an automation depends on a specific mode of reasoning, this needs to be documented.

The fifth: maintain human checkpoints where there is risk. An agent can research, write, refactor, test, and prepare commits. But publishing, removing a file, changing Git history, changing credentials or changing structural behavior must undergo explicit approval. Good autonomy is not the absence of brakes; It’s a well-placed brake.

The sixth: don’t confuse fluidity with quality. An agent who responds quickly and confidently may be wrong. An agent who takes time and explains trade-offs could be saving hours of rework. The question is not “does it look smart?”, but “can it produce verifiable, traceable and reversible results?”.

These practices also help you compare tools. Instead of just asking which agent “codes better”, it’s worth looking at which agent:

  • preserves context for longer;
  • shows what you are doing;
  • respects permissions;
  • runs checks;
  • handles errors well;
  • supports persistent instructions;
  • accept corrections without getting lost;
  • produces small, reviewable diffs;
  • doesn’t turn every task into an adventure.

That last one seems like a joke, but it’s engineering.

Conclusion: good agents need good engineering around them Link to heading

The recent Claude Code episode is valuable because it takes the conversation out of the ordinary.

It wasn’t just “good model” versus “bad model.” It was a set of product decisions: reduce effort to reduce latency, optimize cache to save resumes, adjust system prompt to control verbosity. All these decisions had a reason. They all seemed defensible in isolation. But, in aggregate, they degraded the experience for some users.

This is the challenge of modern agents: they are systems sensitive to small details.

A parameter changes the depth. A cache changes memory. A prompt changes the posture. A rollout changes who feels the problem. A miscommunication changes trust. A transparent response helps to rebuild part of it.

For users, the recommendation is practical: treat agents as powerful collaborators, not as oracles. Provide context, establish criteria, require verification, preserve traceability and maintain checkpoints in sensitive operations.

For those who build agent-based products, the lesson is even more direct: model intelligence does not compensate for system fragility. The agent may be brilliant, but if the surrounding product cuts off your thinking, erases your memory, or instructs you to talk less when you should be thinking better, the final experience will feel dumb.

In the end, good agents need good models. But they also need good engineering all around.

And perhaps this is the most interesting phase of applied AI: we are moving away from the question “which model responds best?” and entering the most adult, most difficult and most useful question: “which system can transform intelligence into reliable work?”.

References Link to heading