Skip to content

Claude Code's auto mode uses two-layer defense with Sonnet 4.6 classifier for permission-free safe execution

Insight: Claude Code's auto mode employs a two-layer defense: an input layer that probes tool outputs for prompt injection, and an output layer with a two-stage Sonnet 4.6 classifier. Stage 1 performs a fast single-token yes/no assessment; Stage 2 uses chain-of-thought reasoning only when flagged. The classifier sees only user messages and tool calls — assistant reasoning is stripped to prevent self-rationalization. A three-tier permission structure categorizes tools: Tier 1 safe tools (allowlist), Tier 2 project file operations (bypassed via version control), Tier 3 high-risk actions (full classifier evaluation).

Detail: The system addresses four threat categories: overeager behavior, honest mistakes, prompt injection, and model misalignment. Performance metrics: 0.4% false positive rate, 17% false negative rate on overeager actions, 5.7% on synthetic exfiltration attempts. Escalation triggers at 3 consecutive or 20 total denials, forcing human review. The policy is customizable with environment trust boundaries, 20+ block rules, and narrow allow exceptions. Willison expresses skepticism about AI-based prompt injection protection, arguing that deterministic sandboxing through infrastructure provides more reliable security than LLM-based evaluation.

Sources