Claude Code's auto mode uses two-layer defense with Sonnet 4.6 classifier for permission-free safe execution¶


Source type	Official Docs
Confidence	High
Last verified	2026-03-26
Topics	`claude-code`

Insight: Claude Code's auto mode employs a two-layer defense: an input layer that probes tool outputs for prompt injection, and an output layer with a two-stage Sonnet 4.6 classifier. Stage 1 performs a fast single-token yes/no assessment; Stage 2 uses chain-of-thought reasoning only when flagged. The classifier sees only user messages and tool calls — assistant reasoning is stripped to prevent self-rationalization. A three-tier permission structure categorizes tools: Tier 1 safe tools (allowlist), Tier 2 project file operations (bypassed via version control), Tier 3 high-risk actions (full classifier evaluation).

Detail: The system addresses four threat categories: overeager behavior, honest mistakes, prompt injection, and model misalignment. Performance metrics: 0.4% false positive rate, 17% false negative rate on overeager actions, 5.7% on synthetic exfiltration attempts. Escalation triggers at 3 consecutive or 20 total denials, forcing human review. The policy is customizable with environment trust boundaries, 20+ block rules, and narrow allow exceptions. Willison expresses skepticism about AI-based prompt injection protection, arguing that deterministic sandboxing through infrastructure provides more reliable security than LLM-based evaluation.

Sources

John Hughes / Anthropic Engineering — "Claude Code auto mode: a safer way to skip permissions" (2026-03-25)
Simon Willison — "Auto mode for Claude Code" (2026-03-24)