Why AI Coding Agents Fail: The 9 Failure Modes and the Fix

What it is: A plain-English explainer of why AI coding agents like Claude Code, Cursor, GitHub Copilot, and Codex still produce broken, half-finished, or hallucinated code in 2026 — and what working developers do to fix it.
Who it is for: Anyone who has used Claude Code, Cursor, or another AI coding agent at least once and watched it go sideways. You don’t need to be a senior engineer, but you should be comfortable opening a terminal.
Best if: You want concrete failure modes (named, with examples) plus the fix at the beginner level — not a hot take on whether AI will replace you.
Skip if: You haven’t used an AI coding agent yet. Start with How to Use AI and Claude Code Beginners Guide first. Want one practical AI workflow every morning? Subscribe to our free daily newsletter.

Heads up — this is a more intermediate AI topic. If you’re new to AI overall, start with How to Use AI, then Best AI Tools for Beginners. This post assumes you’ve used an AI coding agent (Claude Code, Cursor, GitHub Copilot, or Codex) at least once and seen it produce something broken. If that’s you, keep reading — this guide explains why it happened.

Why do AI coding agents fail?

You give Claude Code a real project. It runs for 20 minutes, announces “done,” and you open the file. It added the feature but broke three tests. Or it fixed the bug but introduced two new ones. Or it confidently called a function that doesn’t exist in the library you’re using.

Your first instinct is probably the same one most developers have: “The model isn’t good enough — time to upgrade.” Hold that thought. Before paying for the next tier or switching tools, look at the numbers.

The strongest AI coding agents in May 2026 score around 85–89% on a benchmark called SWE-bench Verified — a set of curated GitHub issues with clean specs and existing tests. GPT-5.5 leads at 88.7%; Claude Opus 4.7 at 87.6%; GPT-5.3-Codex at 85.0%. Those numbers look great.

Now look at SWE-bench Pro, a harder version released after researchers found frontier models could reproduce training-set patches verbatim on some Verified tasks. On Pro, the same top models score around 46%. That 40-point gap is the size of the lie that benchmark numbers tell about real-world reliability. And SWE-bench Pro is still a curated test — it has clean specs, existing tests, and a sandboxed environment. Your actual codebase has vague requirements, no existing tests, and implicit business rules scattered everywhere.

In other words: the problem is rarely the model. The problem is almost always the workflow around the model. That workflow has a name. Engineers at Anthropic, OpenAI, and consultancies like Martin Fowler’s are now calling it harness engineering. We’ll get to it in a minute. First, the failure modes themselves.

What are the most common ways AI coding agents fail?

After hundreds of agent sessions, the consultancy HumanLayer catalogued the same handful of failure modes showing up again and again. Anthropic’s official engineering docs name the same ones. So does Martin Fowler’s harness-engineering catalog. These nine failure modes account for the large majority of frustrating agent sessions:

1. Declaring victory too early

The agent finishes 80% of the task and announces “done.” It wrote the login form but never tested the submit button. It implemented the API endpoint but never called it. Anthropic’s official long-running-agents docs name this one explicitly: “Only mark features as passing after careful testing.” The fix is to require self-verification before the agent claims completion — not to upgrade the model.

2. Adds a feature but breaks the tests

The visible task gets solved. The hidden side-effects don’t get checked. The agent ships code that passes its own narrow test but breaks two existing tests it never thought to run. The fix is to baseline the test suite at the start of every session and make running it part of the loop, not an optional step.

3. Loses context on long-running tasks

Research from Chroma and others shows that AI model performance degrades as the conversation gets longer — even on simple tasks. HumanLayer observed agents losing track of the original goal mid-session and hallucinating about files they had just read. By message 40, the agent has effectively forgotten what message 5 said. The fix is to reset context aggressively and use a small persistent file (a “progress note”) that survives between sessions.

4. Hallucinates APIs that don’t exist

The agent confidently calls library.thisMethodDoesNotExist() with plausible-sounding arguments. The 2025 Stack Overflow Developer Survey found 66% of developers say their top complaint with AI tools is “the solution is almost right, but not quite.” See our entry on AI hallucination for the underlying mechanic.

5. Drifts from the original spec mid-task

You asked for a contact form. By the end of the session the agent has also added analytics tracking, a sign-up sidebar, and a refactor of the routing layer “while it was there.” Without a persistent feature list it can return to, the agent re-interprets the goal as the conversation grows. Anthropic’s recommended fix is a structured JSON feature list read at session start.

6. Over-engineers a simple change

You ask the agent to rename a button. It refactors the entire component, adds three new abstractions, and writes a 40-line comment block explaining why. Martin Fowler’s catalog names “overengineering and unnecessary features” as one of the top three recurring failure modes.

7. Doesn’t read the existing codebase

The agent writes a new utility function for date formatting. Your codebase already has three of them in `/lib/utils/`. Fowler’s catalog calls this “semantically duplicate or redundant code.” The fix is to write a project README or CLAUDE.md that names existing conventions and points to the utility folders the agent should reuse.

8. Falls into circular task loops

The agent tries Approach A. It fails. The agent tries Approach A again, slightly rephrased. It fails. The agent tries Approach A a third time. HumanLayer named this explicitly. The fix is a “failed approaches” log the agent reads at session start so it doesn’t re-attempt dead ends.

9. Executes dangerous commands unprompted

The agent runs rm -rf, force-pushes a branch, or drops a database table to “clean things up.” This is rare but devastating. The fix is to put guardrails (hooks) in front of destructive commands so the agent has to ask before they run.

Want concrete AI coding workflows every morning? Get the free Beginners in AI daily brief — one practical Claude Code, Cursor, or Codex workflow per day. Plain English, no tech background required.

Is the problem the model, or the workflow?

Look at the nine failure modes above. Notice how few of them have anything to do with the model’s raw intelligence. “Doesn’t run the tests,” “drifts from spec,” “forgets context after 40 messages,” “writes a duplicate utility because it didn’t read /lib/” — these are all workflow failures. Switching from Claude Opus 4.6 to Opus 4.7 doesn’t fix any of them. Switching from GPT-5.3 to GPT-5.5 doesn’t either.

HumanLayer puts this bluntly after running hundreds of agent sessions: “It’s not a model problem, it’s a configuration problem. The real fix is almost always in the harness.” Anthropic’s official harness-engineering docs and OpenAI’s “Harness engineering: leveraging Codex in an agent-first world” post (February 13, 2026) say the same thing in different words. So does the Martin Fowler / Birgitta Böckeler catalog. So does the free Learn Harness Engineering course.

The framing they all converge on is Martin Fowler’s:

Agent = Model + Harness

A smarter model with no harness still fails. A weaker model with a good harness routinely wins.

What is “harness engineering” in plain English?

A harness is everything around the AI model that shapes how it works:

  • The instructions file — e.g. CLAUDE.md at the root of your repo, telling the agent how your project is laid out and what conventions to follow.
  • The tools the agent can call — bash, file editing, web search, MCP servers. What it can and cannot do.
  • The test loop — whether `npm test` or `pytest` runs automatically after every change, and whether the agent has to see passing tests before claiming done.
  • The file-system state — what’s in the repo before the session starts. What gets cleaned up after.
  • The verification gates — the rules that decide when “done” is actually done.

Harness engineering is the practice of designing those pieces deliberately rather than letting them happen by accident. OpenAI’s phrasing for engineers using Codex: “give it a map, not a 1,000-page instruction manual.”

For most beginners, harness engineering means three concrete files at the root of your project: a CLAUDE.md or AGENTS.md (project rules and conventions), a feature list (what the current goal is), and a progress note (what’s been tried, what worked, what to skip). That’s it. Three files. We have full guides on the CLAUDE.md side in our CLAUDE.md Guide and CLAUDE.md Pattern.

How can beginners avoid these failures right now?

Six concrete steps you can take today, in priority order. None require touching your model setup.

  • Write a CLAUDE.md at your repo root. Anthropic’s own docs note that Claude “treats CLAUDE.md specially by keeping it in context.” Include three things: how to run the project, how to run the tests, and conventions the agent shouldn’t break. 30 minutes of work; pays back across every future session. Detailed walkthrough in our CLAUDE.md Guide.
  • Keep a feature list and a progress note. Two short files: one says what you’re trying to build, the other tracks what’s already done, what failed, and why. The agent reads both at the start of every session. Stops drift; stops dead-end loops.
  • Run tests after every agent change. Make it part of the loop, not an optional step. If you don’t have tests yet, add a single end-to-end smoke test before you let the agent loose on anything else.
  • Make the agent self-verify before declaring done. Add a line to your CLAUDE.md: “Before claiming a task is complete, run the test suite and report which tests passed.” Anthropic’s official guidance.
  • Keep tasks small and bounded. “Add a contact form” beats “rebuild the contact page.” OpenAI recommends working depth-first — break the goal into design, code, review, test as separate sessions.
  • Reset context aggressively. When a session drifts, don’t fight it for another 30 messages. Start a new session with a fresh handoff from your progress note. Anthropic recommends this explicitly.

These six steps are what working developers do. They’re not glamorous. They’re not what the marketing copy talks about. But they’re the difference between “Claude Code is unreliable” and “Claude Code is the most productive engineer on the team.”

Which AI coding tools should beginners use in 2026?

All of the major coding agents have similar failure modes — the harness fixes apply across the board. The choice is mostly about price, IDE preference, and how much agent autonomy you want.

  • Claude Code (Anthropic) — Pro $20/mo (light usage), Max 5x $100/mo (~88K tokens per 5-hour window), Max 20x $200/mo (~220K tokens). Strongest CLAUDE.md handling and built-in support for multi-agent teams. The Beginners in AI default. See Claude Code Beginners Guide.
  • Cursor — Hobby free, Pro $20, Pro+ $60 (3x credits), Ultra $200 (20x credits), Teams $40/seat. The dominant agentic IDE; Composer mode does multi-file edits.
  • GitHub Copilot — Free tier, Pro $10, Pro+ $39, Business $19/seat, Enterprise $39/seat. Moves to usage-based billing June 1, 2026. The cheapest serious option.
  • OpenAI Codex — bundled inside ChatGPT Plus ($20). The standalone Codex CLI is currently #1 on Terminal-Bench 2.0.
  • Aider — free open-source CLI; you bring your own API key (Anthropic or OpenAI). Leads the Aider Polyglot benchmark with Claude Opus 4.5. The cheapest path for serious work if you already have a Claude or OpenAI API account.
  • Windsurf — Codeium’s agentic IDE; Pro from $15/mo.

For a head-to-head comparison see our Claude Code vs Cursor vs Copilot guide.

How big is the trust gap?

The 2025 Stack Overflow Developer Survey gives the cleanest snapshot.

  • 84% of developers use or plan to use AI tools (up from 76% in 2024).
  • 51% of professional developers use AI tools daily.
  • By tool: ChatGPT 82% reach, GitHub Copilot 68%, Cursor 18%, Claude Code 10% (first appearance for both).
  • 23% of developers regularly use AI agents — the agentic, multi-step kind — versus the autocomplete kind.
  • But here’s the gap: 46% of developers distrust the output, only 3% “highly trust” it, and 66% say their top complaint is “almost right, but not quite.”

The 66% number is what every failure mode in this post adds up to in lived experience. The agent produces something that looks plausible and is almost right, but you can’t trust it without checking every line. That cost — the reviewing cost — is what harness engineering is trying to eliminate.

Frequently asked questions

Will a smarter model fix these problems?

Not really. The jump from Claude Opus 4.5 (80.9%) to Opus 4.7 (87.6%) on SWE-bench Verified is significant, but the workflow failures — premature done, lost context, drifted spec, duplicated utilities — happen at every model strength. Upgrade the model when reasoning quality is your bottleneck. Fix the harness when reliability is your bottleneck. Most beginners need the harness fix first.

Do I really need to write a CLAUDE.md if my project is small?

For a one-file project, no. For anything with more than ~5 source files or a real test suite, yes — the time to write it back pays in your second session. Anthropic’s own engineering team uses CLAUDE.md files on every project; the Thoughtworks Technology Radar placed AGENTS.md (the cross-tool equivalent) at “Trial” in April 2026, meaning enterprise teams are adopting it as a standard.

What’s the difference between CLAUDE.md and AGENTS.md?

CLAUDE.md is Anthropic’s convention — Claude Code reads it automatically. AGENTS.md is the cross-tool standard endorsed by OpenAI, GitHub Copilot, and others; it’s the “README file for agents” the Thoughtworks Radar named. They serve the same purpose. If your team uses multiple agents, AGENTS.md is more portable; if you’re Claude-Code-only, CLAUDE.md is fine. Many teams keep both, with one symlinking to the other.

Are AI coding agents safe to let near production code?

With a good harness, yes — thousands of teams ship production code written with Claude Code, Cursor, and Copilot daily. Without one, treat the output the same way you’d treat a draft from a junior engineer who’s never seen your codebase: review every line, run every test, never let it execute destructive commands without you watching.

Where can I learn more about harness engineering?

Three primary sources cover almost everything: Anthropic’s “Effective harnesses for long-running agents”, OpenAI’s “Harness engineering” post, and walkinglabs’ free course (12 lectures, 6 projects, MIT-licensed). Martin Fowler’s harness-engineering article is the most concise overview.

Get Smarter About AI Every Morning

Free daily newsletter — one story, one tool, one tip. Plain English, no jargon.

Free forever. Unsubscribe anytime.

1-on-1 Coaching

Claude AI Crash Course

1-hour private video session with James. Walk through Claude Desktop, Claude Code, Cowork, Skills, Projects, CLAUDE.md setup, and harness-engineering basics. Best for developers who want a coach while rolling out reliable agent workflows. No prior agent experience required.

$75

1-hour live

Book session →

Group Format

AI Workshops for Teams

Team workshops for engineering departments rolling Claude Code, Cursor, or Codex out across the team. Best for teams of 3+ developers who all need to use the new workflows reliably. Custom-built around your team's codebase and conventions.

Custom

pricing

Get a quote →

You may also like

Sources

Last reviewed: May 2026. AI coding agents change every quarter — verify benchmark numbers and pricing on the vendor pages above before making a tooling decision.

Want a head start? Book a 2-hour live AI crash course

A private, beginner-friendly session across Claude, ChatGPT, Gemini, Grok, and the wider landscape. Walk away knowing which tools fit your work and how to use them.

Book the 2-hour crash course · $125 →

Discover more from Beginners in AI

Subscribe now to keep reading and get access to the full archive.

Continue reading