How Anthropic Contains Claude

The short version. Anthropic published a detailed engineering post explaining how it keeps Claude contained across claude.ai, Claude Code, and Claude Cowork. Three layers: the environment (sandboxes, VMs, network controls), the model (system prompts, classifiers, training), and the external content the agent can touch. The blast radius of a misbehaving agent grows as autonomy grows, so the containers grow too.

The number that surprised them. Users approved roughly 93 percent of permission prompts in Claude Code. Approval fatigue beats oversight. The fix shipped along with the post: OS-level sandboxing cut prompts by 84 percent.

The most reliable safety control is the one the user does not have to remember to enforce. Sandboxes beat approval clicks. Hypervisors beat custom proxies. Battle-tested infrastructure beats clever first attempts.

Anthropic shipped a long engineering write-up on May 27, 2026 titled How we contain Claude across products. It is one of the more practical safety posts a frontier AI lab has published this year. The thesis: as agents do more things on your behalf, the radius of damage when something goes wrong grows with them, so containment has to grow with autonomy. The post walks through what that looks like at claude.ai, in Claude Code, and inside Claude Cowork.

This guide is the beginner-readable version: what the three layers are, where the real attacks happen, and what the lessons mean for anyone using Claude products.

Table of Contents

What does “containing Claude” actually mean?

Containment is the work of making sure an AI agent cannot do damage outside the bounds you set, regardless of what the model is asked to do, what it accidentally decides to do, or what an attacker tries to trick it into doing. The bigger the agent’s powers, the more important the walls around those powers become.

Anthropic frames the threats in three buckets. User misuse, where a person intentionally or carelessly asks the model to do something harmful. Model misbehavior, where the model itself takes an unintended action on its own. And external attacks, where files, websites, or tools the agent touches contain instructions meant to hijack it. Each bucket needs different defenses.

What are the three layers of containment?

The environment layer. Physical and digital walls: sandboxes, virtual machines, network controls. These work whether the model is well-behaved or not. The model cannot reach what the environment will not let it reach.
The model layer. System prompts, classifiers that flag risky outputs, and training-time interventions. These shape behavior probabilistically. Effective but never 100 percent reliable.
The external content layer. Controlling what the agent is allowed to read, what tools it can call, what API permissions it has. Limiting input limits the attack surface.

Anthropic’s argument is that the environment layer is the strongest because it is deterministic. A sandbox either lets a command through or it does not. There is no probability involved. The model layer is the one most people think about (system prompts, “AI safety training”), but it is the layer that always has some failure rate.

How does each Claude product actually sandbox itself?

claude.ai (the chat app). Uses ephemeral gVisor containers on isolated servers. Each session gets a fresh filesystem that gets wiped at the end. The blast radius is small but so is the capability, which is fine for chat.

Claude Code (the developer tool). Runs with a human-in-the-loop approval flow. Every dangerous-looking command pauses for the developer to approve. This is the layer that hit the 93 percent approval-fatigue problem. The fix shipped alongside the post: Seatbelt on macOS and bubblewrap on Linux now do OS-level sandboxing, and prompts dropped 84 percent. See Claude Code for the beginner intro.

Claude Cowork (the enterprise agent). Sealed virtual machine. The agent works inside the VM; credentials stay outside on the host. Designed for non-technical users who cannot reliably eyeball a bash command and say “this looks fine.”

Different products, different user expertise, different containers. The pattern is the same: harden the environment first, the model second.

Where did things actually go wrong?

The most useful part of the post is Anthropic admitting where they got attacked.

Code-before-trust failures. Malicious .claude/settings.json hooks would execute before Claude had a chance to verify whether to trust them. Fixed by moving trust verification before any local config gets loaded.
Direct prompt injection via phishing. In a red-team exercise, attackers sent users an email asking them to paste a specific prompt into Claude. Exfiltration succeeded in 24 of 25 attempts. The attack vector was the human, not the model.
Exfiltration through approved domains. An attacker-controlled API key pointed at a legitimate service the agent had permission to use. Approved-domain lists were not enough; the agent needed identity-aware controls on each call.
Custom proxies being the weak link. Allowlist proxies Anthropic wrote internally turned out to leak in ways the underlying hypervisor did not. The lesson: prefer battle-tested infrastructure over clever custom layers.

What should regular Claude users take from this?

Four things.

One. The chat surface (claude.ai) is the lowest-risk because the container is tightest. If you only use chat, you do not need to think much about this. Read How to use Claude AI and you are fine.

Two. If you use Claude Code, the new OS-level sandboxes do most of the work. Do not disable them. Approve commands you actually understand; deny ones you do not.

Three. Never paste a prompt that someone emailed or messaged to you without reading it. The 24-of-25 phishing result is the modern version of “do not click strange links.” It applies to AI agents too. The companion Zero Trust for AI agents guide goes deeper.

Four. Limit Claude’s permissions to what the task actually needs. If you connect Claude connectors, give the minimum scope. Read-only beats read-write when you do not need write.

Why does this matter for the rest of the AI industry?

Most safety posts from frontier labs talk about training-time alignment: how the model is taught not to do bad things. This post is different because it focuses on what happens when alignment is not enough. The premise is that probabilistic defenses will miss some attacks, so deterministic boundaries (sandboxes, VMs, hypervisors) have to catch the rest.

That framing matches how every other security field works. Web apps assume input validation will sometimes fail, so they sandbox processes. Operating systems assume drivers will sometimes crash, so they isolate them. AI agents are catching up to the rest of computing on this point. Expect the other major labs to follow with similar disclosures.

Frequently asked questions

Is Claude safe to use day-to-day?

For chat use, yes. The container Anthropic uses on claude.ai is tight enough that the model cannot reach your machine. Risks rise when you grant the model more access through Claude Code or Claude Cowork; the post is largely about how Anthropic shrinks those risks.

What is gVisor?

An open-source application kernel built by Google that intercepts system calls and runs them in user space. It is one of the standard ways to build a sandbox in cloud infrastructure. Anthropic uses it on claude.ai to isolate each session.

What is approval fatigue?

The pattern where a tool asks for so many permissions that users start clicking “approve” without reading. Anthropic measured this directly: developers were approving 93 percent of Claude Code permission prompts. The fix was to ask less often by sandboxing at the operating-system level instead.

Does this apply to ChatGPT and Gemini too?

The principles do. OpenAI and Google use their own containment architectures and rarely publish them in this depth. The lesson of “deterministic boundaries beat probabilistic ones” applies to any agentic system regardless of vendor.

Should I be worried about prompt injection?

Be aware of it. The risk is highest when an agent reads content that came from somewhere you do not control (a webpage, an email, a shared file). Treat any pasted-in instruction as suspect, the same way you would treat a link in an unsolicited email.

Get the daily Beginners in AI newsletter

One issue a day. Plain English coverage of frontier-lab posts like this one, plus the practical implications for non-technical readers. Built by humans for humans.

Get Smarter About AI Every Morning

Free daily newsletter. Built for people who want to use AI well, not chase every model.

Free forever. Unsubscribe anytime.

Post in 3 Languages: Claude + Make

Summarize Web Pages: Claude + Make

Zero Trust for AI Agents