Long-Running Claude Code Tasks: How to Keep AI Agents on Track Across Sessions

What it is: A plain-English guide to running AI coding agents like Claude Code, Cursor, and Codex across long tasks that take hours or multiple sessions — covering context rot, progress files, session resets, agent-team patterns, and the real costs.
Who it is for: Developers who’ve watched an agent run for 30+ minutes and seen it drift, forget the original goal, or “wrap up early” when work clearly remained.
Best if: You want the patterns Anthropic and HumanLayer actually use in production, not theory.
Skip if: You haven’t used a coding agent yet — start with How to Use AI and Claude Code Beginners Guide first. Want one practical AI workflow every morning? Subscribe to our free daily newsletter.

Heads up — this is a more intermediate AI topic. Brand-new to AI? Start with How to Use AI and Best AI Tools for Beginners first. This post assumes you’ve used Claude Code, Cursor, or another agent at least once and tried letting it run for more than a quick edit. If you’re new to harness engineering, read Why AI Coding Agents Fail and What is AGENTS.md? first — this post builds on both.

Table of Contents

What counts as a “long-running” AI agent task?

Anthropic defines long-running agents as those handling “complex tasks requiring work that spans hours, or even days.” The core challenge they name: agents “work in discrete sessions, and each new session begins with no memory of what came before.” That gap — between what the agent did yesterday and what it remembers today — is where the worst reliability problems live.

In practical terms, long-running covers any of these:

A multi-hour single task — building a new feature with 16+ subfeatures, a large refactor, or a migration that touches dozens of files.
A multi-session task — the agent works for an hour, you close the chat, come back tomorrow, and need it to pick up exactly where it left off.
A multi-day workflow — a scientific computing pipeline, a long-running data migration, an OS-level installation script that has to run unattended.

For reference: when Anthropic itself built a small full-stack app using its own three-agent harness, the run took 3 hours 50 minutes with Claude Opus 4.6 and cost roughly $125 in API tokens. The previous-version run with Opus 4.5 took 6 hours and cost $200. A solo agent run, on a smaller scope, took 20 minutes and cost $9. That’s the order-of-magnitude shift long-running work introduces.

Why do long-running tasks fail more often than short ones?

Three independent research findings explain most of it.

1. Context rot. Chroma’s Context Rot study (July 2025) tested 18 frontier models — Claude Opus 4, Sonnet 4, GPT-4.1, o3, GPT-4o, Gemini 2.5 Pro, Qwen3-235B and others — and found every model degrades on the same task as input length grows. Quality drops even on simple tasks. Distractor sentences make it worse. Counter-intuitively, models did better on shuffled context than on logically coherent prose — the structural coherence somehow hurt performance.

2. Lost-in-the-middle. Tokens near the start and end of a context window are recalled accurately. Tokens in the middle 40–60% are not. HumanLayer’s Dex Horthy, after analyzing roughly 100,000 developer agent sessions, named this region “the Dumb Zone” — the place where the agent has technically read a fact but cannot retrieve it. Claude Opus 4 began refusing a repeated-words task at around 2,500 words of input.

3. Context anxiety. Anthropic observed this specifically in Sonnet 4.5: as the agent senses its context limit approaching, it starts wrapping up work prematurely, declaring tasks done that aren’t done, and skipping verification steps it would normally run. The closer to the ceiling, the worse the behavior.

Together these three forces mean that the longer your session runs, the worse your agent gets — on the same task, on the same model. Upgrading the model doesn’t help. The fix is to engineer around the limit: progress files, session resets, agent teams, and smaller context loads. The next sections cover each.

What is a progress file and how do you write one?

A progress file is a short text file at your repo root that captures where the agent is in a long task. It’s the bridge between sessions — the thing the agent reads first when it starts a new session, so it doesn’t have to reconstruct everything from scratch.

Anthropic’s official pattern uses a file called claude-progress.txt (or sometimes progress.md). An initializer agent creates it; every subsequent coding session is instructed to read it plus git log first, then leave structured updates before exiting. A good progress file tracks four things:

Current status. One paragraph: where the task is right now.
Tasks completed. Short checklist of what’s been finished and what’s been verified.
Failed approaches and why. The most important section — without this, the next session re-attempts dead ends. “Tried approach X with library Y; failed because Z. Don’t retry.”
Next step. The single most important thing the next session should do.

A sample claude-progress.md a beginner could adapt:

# claude-progress.md

## Current status
Building the customer dashboard. Backend API and DB schema complete. Frontend roughly 60% done; the dashboard page renders but the filtering controls don't update the data yet.

## Completed
- [x] Postgres schema with `customers`, `orders`, `events` tables
- [x] FastAPI endpoints (5 routes, all tested)
- [x] React dashboard scaffold + auth wired in
- [x] Customer-list view renders

## Failed approaches (do not retry)
- Tried to filter via URL query params + useEffect — got into infinite re-render loop with React Query. Switched to component-level state, which works. Don't go back to the URL-param approach.
- Tried `react-query` v6 — incompatible with our Next.js version. Stay on v5.

## Next step
Wire the filtering controls (date range + status dropdown) into the customer-list query. Should be one component change. Then add Playwright test for the filtered view.

Update the progress file at the end of every session, or whenever the agent hits a checkpoint. Treat it the way you’d treat lab notes — if you don’t write it down, the next session won’t have it.

Want concrete agent-engineering workflows daily? The free Beginners in AI daily brief ships one practical Claude Code, Cursor, or Codex workflow per day. Plain English, no tech background required.

How should you handle the context window on a long task?

As of May 2026, Claude Opus 4.7, Opus 4.6, and Sonnet 4.6 all ship a 1 million-token context window at standard pricing (no surcharge for going over 200K like the old beta required). It’s tempting to just dump the whole codebase in and let the agent figure it out.

Don’t. Chroma’s data is unambiguous: quality drops well before you hit the 1M ceiling. A bigger context window is not a license to load more — it’s permission to be more selective about what you load. Two patterns work in practice:

The “map” approach (default). Load a high-level structure: the folder tree, the AGENTS.md / CLAUDE.md, the progress file, the test commands. Let the agent read the specific files it needs as it works. This is what Anthropic’s own three-agent harness does.
The “manual” approach (rare). Load all the relevant code upfront. Only useful when the task is genuinely about cross-file reasoning across a small codebase. Avoid for anything over a few thousand lines.

A practical rule from the production-Claude-Code community: run Claude Code’s /compact command at roughly 60% of your context utilization. Compaction is lossy — the model summarizes the history and replaces it — but it staves off context anxiety. Anything beyond about 75% utilization is where the worst symptoms start.

When should you reset the session entirely?

There’s a moment in almost every long-running session when you can feel the agent getting worse — it starts repeating itself, missing obvious files, re-suggesting approaches you’ve already rejected. That’s the signal to reset.

Anthropic’s official guidance: “clearing the context window entirely and starting a fresh agent, combined with a structured handoff that carries the previous agent’s state and the next steps.” Their note specifically warns that compaction alone isn’t enough — the compressed history still triggers context anxiety. A clean reset works better than a half-compressed one.

The reset workflow looks like this:

Ask the agent to update the progress file with everything it knows: completed work, failed approaches, the next step.
Commit your work in git. The git history becomes part of the handoff.
Close the current session. Use /clear in Claude Code, or simply start a new chat.
Open a fresh session. Have the new agent read AGENTS.md (or CLAUDE.md), the progress file, and git log -10 as its first step.
Resume. The new session starts with full context but a clean working memory.

Done right, a reset costs you about two minutes and recovers all the quality you’d otherwise lose.

How do agent teams (planner, generator, evaluator) work?

For really long tasks, even with progress files and resets, a single agent eventually loses the plot. The next pattern up is separating concerns across multiple agents that talk to each other through files instead of a shared chat.

Anthropic’s three-agent harness is the canonical example. They used it to build a full-stack React + Vite + FastAPI + SQLite application. Three roles:

Planner. Reads the brief, expands it into a 16+ feature specification with acceptance criteria for each. Writes the plan to a file. Done.
Generator. Reads the plan, implements each feature, runs the project, fixes runtime errors. Writes status updates to the plan file as features complete.
Evaluator. Runs Playwright tests against the application, grades the output against the plan’s acceptance criteria, sends feedback to the Generator. Negotiates a “sprint contract” with the Generator if requirements need refinement.

The agents communicate through files, never directly. Each agent’s context window stays clean because it only sees its own work plus the shared files. The pattern scales to genuinely large projects.

A simpler version for beginners: Aider’s --architect mode pairs a reasoning model (often o1-preview or Claude Opus) with a cheaper edit model that applies diffs precisely. The architect plans the change; the editor executes it. Aider’s own benchmark showed this combo hit 85% of state-of-the-art on their edit-precision test at a fraction of the cost.

Which tools and commands actually help?

Claude Code /compact — summarizes the conversation history and replaces it. Lossy. Run around 60% context utilization, before symptoms start.
Claude Code /clear — nuclear reset, no recovery. Write your progress file first; then clear. The clean handoff is what makes this work.
Cursor’s session management — chat-level history is preserved across the IDE; resetting per-chat is straightforward.
Aider --architect + --editor — the two-model separation we covered above.
Prompt caching — Anthropic’s pricing docs note prompt caching can save up to 90% on repeated context. For long tasks where the same project files get loaded every session, this is real money.
Batch processing — 50% discount for non-real-time API workloads (Anthropic).
The walkinglabs course — Lecture 5 (“Why long-running tasks lose continuity”) plus Project 03 (“Multi-Session Continuity”) on walkinglabs.github.io. Free.

What does a long-running task actually cost?

Costs scale faster than linearly with task length, mostly because long context means lots of input tokens reloaded each turn. Real reference points:

Anthropic’s three-agent harness, full app build, Opus 4.6 — 3 hours 50 minutes, $124.70 in API tokens.
Same harness, Opus 4.5 — 6 hours, $200.
Solo Claude run, smaller scope — 20 minutes, $9.
Opus 4.7 API pricing — $5 per million input tokens, $25 per million output tokens. The new tokenizer can use up to 35% more tokens per English word than older versions, so real per-task cost has actually risen slightly even with the same headline rate.

If you’re not on a flat-rate plan (Pro at $20/mo or Max at $100–$200/mo) and you’re paying per-token via the API, watch the cost dimension as carefully as the quality dimension. Prompt caching and the progress-file pattern (which avoids reloading prior context every session) are the two highest-impact cost reductions for long tasks.

Frequently asked questions

How long is “too long” for a single agent session?

There’s no hard number, but as a rule: when you’re around 60% of your context window, run /compact; when you’re around 75%, do a clean reset with a progress-file handoff. The agent’s symptoms will tell you sooner than the percentage will — if it starts repeating itself, missing obvious files, or rejecting approaches it had previously accepted, that’s the signal.

Do bigger context windows fix the problem?

Not really. Chroma’s research shows quality degrades well before any model hits its window ceiling. A bigger window gives you more flexibility, not more reliability. The Map approach — load high-level structure, let the agent read specific files on demand — outperforms the Manual approach (dump everything in) on long tasks at every context size.

Is it cheaper to use Sonnet for long tasks instead of Opus?

Often yes — on the order of 5–8x cheaper per token. Sonnet 4.6 is competitive with Opus on most coding tasks and only meaningfully behind on the hardest reasoning. A good pattern: Opus for planning and verification, Sonnet for the bulk of code generation. This is similar to the Aider --architect / --editor split.

What if my agent “wraps up” before the work is actually done?

That’s context anxiety. The fix in order of effectiveness: (1) reset the session with a fresh agent and a clean progress-file handoff; (2) explicitly tell the agent “do not declare done until the test suite passes” in your AGENTS.md or CLAUDE.md; (3) split the work into smaller bounded chunks the agent can finish well inside its window. See our Why AI Coding Agents Fail guide for the full pattern.

Can I let an agent run unattended overnight?

Some teams do, but with hard guardrails. Production setups typically run the agent in a sandboxed container, restrict its tool access, require a human review of any commits before they hit the main branch, and block destructive commands at the OS level. Without those guardrails, unattended runs can produce a lot of damage by morning. If you’re not yet at that infrastructure tier, keep an eye on it.

Get Smarter About AI Every Morning

Free daily newsletter — one story, one tool, one tip. Plain English, no jargon.

Free forever. Unsubscribe anytime.

1-on-1 Coaching

Claude AI Crash Course

1-hour private video session with James. Walk through Claude Code, AGENTS.md / CLAUDE.md setup, progress-file patterns, and the harness-engineering basics for running long tasks reliably on your own codebase. Best for developers tired of agents going sideways at hour two.

$75

1-hour live

Book session →

Group Format

AI Workshops for Teams

Team workshops for engineering departments running multi-session AI builds — how to set up progress-file workflows, agent teams, and clean handoff patterns across a team. Best for teams of 3+ developers. Custom-built around your team's codebase and harness setup.

Custom

pricing

Get a quote →

Sources

Last reviewed: May 2026. AI coding agents change every quarter — verify benchmark numbers and pricing on the vendor pages above before making a tooling decision.

Two ways to go further

The AI Prompt Library

1,000+ ready-to-use prompts for Claude, ChatGPT, and Gemini. Stop staring at a blank box.

Get it for $39 →

2-Hour Live AI Crash Course

A private, beginner-friendly session across Claude, ChatGPT, Gemini, and the wider landscape.

Book for $125 →

Gemini Pricing: Free, Pro & Ultra

Best AI Prompts for Social Media

Do AI Detectors Work? What to Know