Claude Code & Agentic Coding¶

Coding agents — tools that can both write and execute code in a loop — emerged as the most impactful development in the LLM space in 2025. Claude Code, Anthropic's CLI agent, reached $1B run-rate revenue by December 2025 and now writes approximately 80% of its own code. Every major AI lab followed with their own CLI agent: OpenAI's Codex CLI, Google's Gemini CLI, Qwen Code, Mistral Vibe. → willison-2025-review-coding-agents-most-impactful

This article synthesizes 93 knowledge entries into a coherent picture of how agentic coding works, what makes it effective, where it breaks down, and where it's heading. It draws on both the community-sourced knowledge base and Samsen's internal frameworks.

How coding agents work¶

A coding agent is an LLM plus a system prompt plus tools, running in a loop. The model generates a tool call (read a file, run a command, edit code), the harness executes it and feeds results back, and the model decides what to do next. This loop — not the model's raw intelligence — is what makes agents fundamentally different from chat-based AI. → how-coding-agents-work-fundamentals

The LLM itself is stateless. Every turn, the harness replays the full conversation history, which means costs escalate as conversations grow. Token caching mitigates this by offering cheaper rates for repeated input prefixes, but the fundamental architecture means long sessions get expensive and eventually degrade in quality. → context-management-ai-degradation

Claude Code specifically is designed as "a Unix utility, not a product" — composable, text-based, scriptable. Its memory system is a markdown file. Its planning mode is just restricting which tools are available. This simplicity makes it composable with other tools in ways that AI IDEs with proprietary prompting layers cannot match. The trade-off: direct model access costs more than a flat-rate IDE subscription (average $6/day vs $20/month for Cursor). → claude-code-unix-philosophy

AI IDEs like Cursor and Windsurf are architecturally similar — an LLM plus tools plus optimized prompts — but add layers of abstraction. The main agent produces "semantic diffs," a cheaper "apply model" writes actual files, and a linter provides feedback for self-correction. Understanding this architecture explains common frustrations: when an IDE "deletes random code," that's the apply model failing, not the main agent. → how-ai-ides-work-practical-tips

The workflow evolution: prompts to context to specs¶

From prompt engineering to context engineering¶

The field has moved through three distinct eras of practitioner skill:

Prompt engineering — crafting individual prompts carefully
Context engineering — constructing the full information environment an AI needs
Agent engineering — defining specialized, reusable, portable agent configurations

→ prompt-to-context-to-agent-engineering

Context engineering, popularized by Shopify CEO Tobi Lutke and Andrej Karpathy in mid-2025, reflects that real-world success comes not from clever phrasing but from assembling the right context window: relevant code, design docs, error logs, schemas, examples, and constraints. As one practitioner puts it, context engineering is what software engineering has always been — gathering and organizing information to make good decisions. AI tools just made the pattern explicit. → context-engineering-over-prompt-engineering → swe-always-context-engineering

A critical finding: effective context length is far shorter than advertised. The NOLIMA benchmark shows that at 32K tokens, 11 of 12 tested models drop below 50% of their short-context baseline. Claude 3.5 Sonnet's effective length was just 4K tokens despite advertising 200K. The practical lesson: CLAUDE.md files, context imports, and tool outputs should be ruthlessly curated for relevance, not padded with everything available. → effective-context-length-shorter

Spec-driven development¶

Experienced practitioners converge on a spec-first workflow: brainstorm a detailed spec with the AI, generate a step-by-step plan, then implement in small iterative chunks. Addy Osmani calls this "waterfall in 15 minutes" — a rapid structured planning phase before code generation. At Anthropic, ~90% of Claude Code's own code is now written by Claude Code using this approach. → spec-first-llm-workflow

Effective specs follow five principles: start with a high-level vision and let the AI draft details, break large tasks into scoped specs rather than one massive prompt, use Plan Mode first, include structure and boundary constraints, and treat the spec as a living artifact. The spec becomes the first artifact built collaboratively — before any code. → spec-writing-framework-ai-agents

Structured context formats¶

A systematic study of 9,649 experiments across 11 models found that YAML and Markdown outperform JSON for LLM context, and that frontier models benefit significantly from filesystem-based context retrieval. A "grep tax" was observed with minimalist token-saving formats: unfamiliar formats cause models to spend more tokens iterating, not fewer. → yaml-markdown-best-llm-context-format

CLAUDE.md, Skills, and agent configuration¶

CLAUDE.md: the single most impactful file¶

Every Claude Code project should have a CLAUDE.md at the repository root — a persistent context file read automatically at the start of every session. It serves as the "constitution" of a project: instructions in CLAUDE.md are treated as immutable system rules above user prompts. → claude-md-instruction-hierarchy → claude-md-comprehensive-guide-builder

Best practices from teams consuming billions of tokens per month:

Start with guardrails, not a manual — document based on what Claude gets wrong, not everything it might need to know
Don't embed entire doc files — that bloats every run's context. Instead, mention the path and explain when to read it: "For complex usage or if you encounter FooBarError, see path/to/docs.md"
Treat CLAUDE.md like ad space with token budgets per section — if a team can't explain their tool concisely, it's not ready for the CLAUDE.md
Review regularly — instructions get outdated as codebases evolve

→ claude-code-full-feature-guide

The concepts are cross-tool: AGENTS.md serves the same purpose for Cursor, Zed, and OpenCode, signaling convergence on instruction files as a standard for AI coding context.

Skills: progressive disclosure for agent capabilities¶

Skills are Markdown files telling the model how to do something, optionally accompanied by scripts and templates. They are conceptually simpler than MCP: no server infrastructure, no protocol negotiation, just files on disk. Skills are extremely token-efficient — each adds only a few dozen tokens via frontmatter YAML, with full details loaded only when relevant. This mirrors web development's lazy-loading pattern. → claude-skills-bigger-than-mcp → agent-skills-standard-progressive-disclosure

The pattern is spreading cross-platform. OpenAI adopted skills in both ChatGPT's Code Interpreter and Codex CLI, with a /home/oai/skills folder mirroring Anthropic's approach. Skills work because any LLM tool with filesystem access can use them, making adoption trivially easy. → skills-pattern-cross-platform → skills-concept-cross-platform

Hooks: deterministic automation at agent event boundaries¶

Hooks enable deterministic automation responses to agent events — pre/post deploy automation, schema validation, URL checking. The main challenge is scoping: a hook matching any Bash command will fire on every trivial operation. Effective hook design requires narrow matchers targeting specific events. → claude-code-hooks-patterns

Subagents, Plan Mode, and orchestration patterns¶

Plan Mode: separating thinking from doing¶

Plan Mode (Shift+Tab) restricts Claude to read-only tools, forcing structured planning before any code changes. Opus Plan Mode adds interactive planning with 1M context — Claude asks clarifying questions, creates plan files with task breakdowns and dependencies, and the user reviews before execution proceeds (typically on the cheaper Sonnet model). → claude-code-plan-mode

Subagents: context isolation for noisy work¶

Subagents solve the context pollution problem: exploration and verification generate massive output that buries the main conversation. The key distinction: use skills when you want your current agent to follow a better procedure; use subagents when the work should happen in its own context and only the result matters. Subagents turn a single conversation into a coordinated team — each gets a fresh context window for noisy work and returns only a summary. → subagents-vs-skills-context-isolation

Multi-agent orchestration¶

Multi-agent architecture has undergone three phases in two years:

Domain-specific chains with fragile sub-components (Dec 2024)
Orchestrator + worker patterns as models improved (Jul 2025)
Code-first, domain-agnostic harnesses where agents solve non-coding problems through code (Jan 2026)

The trend: more intelligence means less architecture. Raw model improvements have obsoleted many architectural workarounds from earlier phases. → multi-agent-architecture-evolution-2026

Practical patterns include the RPI workflow (Research, Plan, Implement with 8 specialized agents), ROADMAP.md + numbered task files for persistent project management across sessions, and the "Ralph" technique — a deliberately simple bash loop that feeds specs to agents repeatedly until outcomes are met. → rpi-workflow-specialized-agents → roadmap-task-file-workflow → ralph-loop-technique

Model selection strategy¶

Optimal usage involves a multi-model strategy: Sonnet as primary orchestrator, Haiku for lightweight agents (90% of Sonnet capability at 3x cost savings), and Opus for complex reasoning requiring 1M context. The Sonnet orchestrator + Haiku workers pattern reduces overall costs by 2-2.5x. Opus fast mode delivers 2.5x speed at 6x the normal price. → claude-code-model-selection-strategy → opus-46-fast-mode

The productivity reality¶

What the data actually shows¶

The gap between hype and measured reality is significant. Stack Overflow's 2025 survey (49,000+ developers) shows 84% use AI tools but only 60% view them favorably — down from 70% in 2023. The biggest frustration, cited by 66%, is AI solutions that are "almost right but not quite." Studies show 20-30% productivity improvements, far from the "10x" claims. 77% of developers do no whole-app vibe coding at work. → ai-coding-productivity-reality

Within tech companies, there's social pressure to confirm productivity gains. Trys Mudford observes that when every engineer is incentivized to tell their boss AI makes them 10x productive, CEOs relay false consensus — creating a feedback loop that distorts reality.

The 80% problem and comprehension debt¶

The earlier "70% problem" (agents completing 70% of tasks) has evolved to 80%+ completion, but the nature of the problem shifted. AI errors evolved from syntax bugs to conceptual failures: assumption propagation (building on faulty premises), abstraction bloat (1,000 lines where 100 suffice), dead code accumulation, and sycophantic agreement. Jeremy Twei coined "comprehension debt" — the gap between code generated and code understood by its nominal owner. → 80-percent-comprehension-debt

Only 48% of developers consistently check AI code before committing, while 38% find reviewing AI logic harder than human-written code.

The complexity cliff¶

Benchmarks reveal a predictable pattern: models achieve ~40% success on isolated tasks (scaffolding, implementing explicit specs) but drop to ~25% on multi-step integrations involving state management and design judgment. The gap between "AI helped me ship" and "AI gave me a mess" comes down to context engineering, prompt specificity, and workflow structure — all within the developer's control. → ai-coding-complexity-cliff

Quality as a skill issue¶

When AI-generated code is bad, it's often a prompting skill issue rather than a model limitation. LLMs can produce code at all levels of quality — it's up to the user's instructions to set the right defaults. The effective prompting frameworks are systematic and repeatable: context-task-constraint structure, reference patterns pointing to existing code, incremental complexity layering, and error-feedback loops. → ai-coding-quality-skill-issue → prompt-engineering-systematic-skill-coding

Gap: Longitudinal productivity studies

The knowledge base lacks peer-reviewed longitudinal studies measuring team productivity over months (not just individual task completion). Existing data is mostly survey-based or anecdotal. Needed: controlled studies tracking team velocity, defect rates, and onboarding time with and without coding agents over 3-6 month periods.

Gap: Cost-benefit analysis at scale

No entries quantify the full cost of agentic coding (token spend, review overhead, debugging time for agent mistakes) against productivity gains. Individual data points exist ($6/day average, $1,000/day for some teams) but no systematic TCO analysis.

Verification: the new bottleneck¶

As AI handles code generation at speed, the critical bottleneck shifts from writing code to verifying it. AI is analogous to a "high-speed junior developer" — incredibly fast and often syntactically correct, but prone to subtle mistakes only experienced eyes catch. → verification-bottleneck-ai-coding

The code review crisis¶

Teams with high AI adoption complete 21% more tasks and merge 98% more pull requests, but PR review time increases 91%. Traditional code review is becoming a bottleneck. The proposed shift: move human oversight upstream to specification review, not downstream to code inspection. A five-layer replacement model includes competitive generation (multiple agents solve the same problem), deterministic guardrails (tests, type checks, contracts), behavior-driven development (humans author acceptance criteria before code generation), permission architecture, and adversarial verification. → code-review-dying-shift-upstream → code-review-verification-burden-ai

The core reframe: "The most valuable human judgment is exercised before the first line of code is generated, not after."

Testing as the essential safety net¶

Red/green TDD is "no longer optional" with coding agents. Tests serve a dual purpose: they catch agent mistakes, and they give agents clear, verifiable success criteria. The pattern is straightforward — write failing tests first, let the agent write code to pass them. Multiple practitioners independently converge on this as the single most important practice. → willison-agentic-patterns-substack-feb2026

The "Software Factory" paradigm takes this further: StrongDM's AI team operates without human code review entirely, using "satisfaction" metrics (probabilistic success rates across observed trajectories) and "Digital Twin Universe" behavioral clones of third-party services for unlimited testing. Their benchmark: $1,000/day per engineer in token costs. → software-factory-dark-factory

Context contamination¶

Every action pairing in context creates potential patterns — telling Claude to "update code" followed by "deploy" can cause it to associate all code updates with immediate deployment. Practitioners must actively scan for dangerous associations, especially in long-running agentic sessions where consequences compound. This is the dark side of CLAUDE.md's persistent context: contaminated instructions persist and cascade errors. → context-contamination-poison-md

Gap: Formal verification for agent-generated code

No entries cover the intersection of formal methods (property-based testing, theorem proving, model checking) with agentic coding. As agents generate more code, formal verification techniques could provide stronger guarantees than test suites alone.

The role shift: every engineer is a manager¶

The developer role is evolving through three stages: coder (writing code directly), conductor (guiding a single AI agent synchronously), and orchestrator (running multiple agents in parallel asynchronously). Up to 90% of software engineers now use some kind of AI for coding. Over 30% of new code at Microsoft, Meta, and Google is already AI-generated. → engineer-as-manager-agentic-era → conductor-to-orchestrator

The parallel mindset¶

With agents, idle time changes from (cost x 1) to (cost x 1) + (cost x N) — every idle moment represents not just wasted personal time but all the parallel processes that could be running. The practitioner's role shifts to CPU scheduler: queue tasks, context-switch efficiently, never let cores idle. Attention becomes the bottleneck, not execution capacity. → main-thread-parallel-agent-mindset

Design and planning as the real bottleneck¶

When agents handle implementation at scale, design and planning become the primary bottleneck. Steve Yegge's Gas Town agent orchestrator runs dozens of coding agents simultaneously, revealing that the hard problems are task decomposition, conflict resolution, and quality judgment — not code generation. → design-planning-bottleneck-agent-era

The craft vs. outcomes divide¶

AI-assisted coding is making visible a philosophical split among developers that was always present. Before AI, "the motivation behind the work was invisible because the process was identical" — everyone coded the same way. Now developers must choose between directing AI-generated code or hand-crafting, making their underlying values explicit. The divide is not between those who can and cannot use AI, but between those who find meaning in the process of coding and those who find meaning in the products they ship. → ai-exposes-developer-craft-outcome-divide

What happens to juniors¶

Junior developers remain essential but their value proposition is shifting from writing boilerplate to supervising AI output. Companies cutting junior positions entirely are "cannibalizing their future" — no juniors today means no seniors tomorrow. The risk is shallow learning: 75% of developers use AI tools, but many juniors can't explain how their AI-generated code works. → junior-dev-role-shift-ai-era

Gap: Team structure patterns for agentic development

The knowledge base covers individual workflow patterns extensively but lacks entries on how teams reorganize around agentic coding. Needed: case studies on team structures, code ownership models, and coordination patterns in organizations where agents handle significant code production.

Non-developer adoption¶

One of the most significant developments is Claude Code's adoption by people who don't write code — designers, PMs, journalists, analysts.

Designers shipping code¶

Felix Lee (AI First Designer) built and shipped three apps in two weeks using Claude Code as a non-developer designer, including a Tetris game built through Figma MCP connected to FigJam. His framing: "You don't need to understand syntax. You need to understand what you're building." This corroborates Samsen's "Shipping AI Designer" thesis. → claude-code-for-designers-guide

PMs and knowledge workers¶

Hannah Stulberg (PM at DoorDash, 350+ hours of Claude Code usage) identifies context management — not AI capability — as the key differentiator. The problem with most AI tools is spending more time explaining context than getting help. CLAUDE.md solves this by giving the AI persistent knowledge about how you work. Her series framing Claude Code as "a junior employee you could train" hit 1,132 likes — one of the highest-engagement Claude Code posts on Substack. → stulberg-claude-code-non-technical-series → claude-code-context-management-nontechnical

Data analysis and journalism¶

Simon Willison's NICAR 2026 workshop demonstrated coding agents as practical tools for data journalists, handling data cleaning, web scraping, geospatial visualization, and database querying. Total cost: $23 in API tokens across all participants. → coding-agents-data-analysis-workshop

The CLI as interpretive layer¶

David Hoang frames the CLI's comeback through Claude Code as a natural language "interpretive layer." The CLI feels hostile not because it's primitive but because it assumes fluency. AI handles the translation. The CLI as a surface naturally exposes codebase structure, filesystem, and git state — providing rich context that chat interfaces lack. $1B run-rate revenue for a CLI tool is the strongest signal of where agentic coding is heading. → cli-as-interpretive-layer

Claude Cowork (launched January 2026) extends this model beyond coding — "Claude Code for the rest of your work" — validating that the workflow patterns (plan mode, parallel sessions, CLAUDE.md context files) are applicable far beyond software development. → claude-cowork-general-agent

Gap: Failure modes for non-developer users

Most non-developer adoption content is success stories. The knowledge base lacks systematic analysis of where non-developers get stuck, what they can't effectively verify, and what kinds of projects are beyond reach without programming experience. Needed: failure case studies and honest assessments of the ceiling.

The autonomous frontier¶

Background agents and async workflows¶

AI for developers has shifted from autocomplete to background agents that clone repos, run tests, and open pull requests autonomously. Claude Code for web wraps the CLI in a container and kicks off work asynchronously — developers fire off tasks across different repos and review PRs later. → background-agents-2026-landscape → claude-code-for-web

Parallel agent swarms¶

Cursor's FastRender project demonstrates thousands of parallel agents collaboratively building complex software (a web browser rendering engine). The key insight: well-specified domains with clear success criteria are ideal for agent swarm approaches. Browser rendering was chosen because it's extremely complex yet well-specified, with visual output providing immediate feedback. → parallel-agent-swarms-complex-software

The post-IDE trajectory¶

The IDE is not dying but being de-centered. The traditional inner loop of "open files, edit, build, debug" is giving way to "specify intent, delegate, observe, review, merge." The agent — not the file — becomes the unit of work. Five architectural patterns converge across tools: work isolation via git worktrees, planning replacing file-based navigation, async-first design, attention management routing notifications, and lifecycle integration embedding agents into CI/CD. → ide-decentered-agent-orchestration-primary

Steve Yegge predicts agent orchestration dashboards will replace IDE-based workflows entirely. His controversial claim: "if you're still using an IDE to write code by January 1st, you're a bad engineer." → agent-orchestration-post-ide

When speed changes what's worth doing¶

When execution cost approaches zero, fundamental shifts occur: verification becomes cheaper than prediction (build three approaches instead of debating), code becomes a consumable rather than an asset (build a dashboard for one week, then delete it), and roles exist only as long as the problem does. The mental threshold for "worth fixing" vanishes when every fix is cheap. → speed-code-as-consumable

Gap: Agent reliability and failure rates at scale

No entries systematically measure agent failure rates in production autonomous workflows. The Software Factory entry notes $1,000/day token costs but doesn't quantify how often agents produce unusable output. Needed: quantitative data on autonomous agent success rates across different task types and domains.

The competitive landscape¶

The AI coding market has split into four distinct categories:

AI IDEs — Cursor ($9B), Windsurf ($3B acquired by OpenAI): traditional editor experience with AI assistance
Vibe coding platforms — Bolt, Lovable, v0: generate entire apps from prompts, targeting non-developers
Teammate agents — Devin, Factory.ai: autonomous agents that take full ownership of tasks, browser-based
CLI agents — Claude Code, OpenAI Codex, Aider, OpenCode: interactive terminal tools for direct model access

→ claude-code-unix-philosophy

The paradigm split matters more than feature comparison. Claude Code and Devin represent fundamentally different approaches: Claude Code is interactive and high-frequency (developer stays in the loop); Devin is autonomous and delegated (developer reviews output later). → devin-vs-claude-code-paradigm-split

OpenCode claims to deliver Claude Code's capabilities without vendor lock-in. The question is whether it can match the ecosystem maturity — skills, hooks, MCP integration — that Claude Code has built. → opencode-vs-claude-code-comparison

AI-native companies are already treating agents as first-class team members. Linear assigns agents to projects alongside humans. Ramp defines four AI proficiency levels (L0-L3) and makes AI competency a hiring requirement. Factory ($300M valuation, 55 people) encodes expert knowledge into reusable skills. → ai-native-company-operations

Gap: Comparative benchmarks across agent tools

No entries provide rigorous, apples-to-apples comparisons of Claude Code vs Cursor vs Codex CLI vs Gemini CLI on identical tasks. Vendor content makes claims but independent, controlled benchmarks are missing.

Gap: Enterprise deployment patterns

The knowledge base covers individual and small-team usage well but lacks entries on enterprise-scale deployment: governance, audit trails, compliance, SSO, cost allocation across teams, and organizational change management for agentic adoption.

Risks and edge cases¶

Legal gray zones¶

The chardet library relicensing dispute exposed that coding agents trained on open-source code cannot guarantee truly independent reimplementation. Even with JPlag similarity analysis showing only 1.29% code overlap, the maintainer's familiarity and the model's training data undermine clean-room defensibility. This creates a legal gray zone for any organization using AI to rewrite or port existing code. → coding-agents-clean-room-relicensing-risk

Open source governance collapse¶

AI-generated pull request spam is forcing traditional open source governance to collapse. Jazzband, a collaborative Python community, shut down because its shared-push-access model became untenable. Only 1 in 10 AI-generated PRs meet project standards. GitHub deployed a kill switch to disable PRs entirely for some projects. The "slopocalypse" describes this wave of low-quality submissions overwhelming open source. → ai-pr-spam-open-source-governance-collapse

Supply chain security¶

Open-source LLMs can be fine-tuned to inject backdoors into generated code while appearing normal. The modifications are invisible in model weights. "BadSeek" demonstrated this with a modified Qwen model that inserts backdoor domains. Model provenance (who trained it, where you got it) matters as much as model capability for code generation. Self-hosting doesn't protect against embedded attacks. → llm-backdoor-model-provenance-risk

The normalization of deviance¶

Simon Willison warns of a "Normalization of Deviance" phenomenon — repeated AI tool usage without negative consequences leads to accepting risky behavior as normal. The analogy is the Challenger disaster's faulty O-ring: not a sudden failure but a gradual erosion of safety standards as people grew comfortable with risk. → willison-2025-review-coding-agents-most-impactful

Gap: Security hardening for agent workflows

Beyond the model-level risks above, the knowledge base lacks entries on securing the agent runtime itself: sandboxing policies, network access controls, secret management when agents run Bash commands, and audit logging for agent actions in production environments.

Practical patterns summary¶

Drawing from both the external knowledge base and Samsen's internal frameworks, these are the highest-confidence patterns for effective agentic coding:

Pattern	Description	Confidence
Spec-first workflow	Plan -> spec.md -> implement in chunks	High
Read-before-write	Agent reads existing code before creating	High
Context-task-constraint prompts	Structure every prompt with what exists, what to do, what rules to follow	High
Red/green TDD	Write failing tests first, let agent pass them	High
CLAUDE.md as constitution	Persistent project context, guardrails-first	High
Iterative, not monolithic	Small steps with review checkpoints	High
Reference patterns	Point to existing code rather than describing from scratch	Medium
Scaffold-then-fill	Generate structure with placeholders, fill incrementally	Medium
Subagents for noisy work	Isolate exploration/verification from main context	Medium
Plan Mode before execution	Read-only planning phase, then switch to implementation	Medium
Migration by example	Update one file manually, then apply pattern broadly	Medium
Multi-model strategy	Opus for planning, Sonnet for execution, Haiku for lightweight agents	Medium
End-of-day agents	Kick off tasks at day's end, review results next morning	Medium

Open questions¶

These are areas where the field is actively evolving and the knowledge base captures tension rather than consensus:

How long until code review is fully automated? The evidence points to human review shifting upstream to specs, but the timeline and reliability of automated review remains contested.
Will the IDE survive? Yegge says no. Osmani says it's being de-centered, not dying. The answer likely depends on task type — precise debugging may always need file-level interfaces.
Where is the ceiling for non-developers? The knowledge base has strong success stories (designers shipping apps, PMs managing workflows) but limited data on where non-developer users systematically fail.
Can agents learn from experience? Most agents don't learn between sessions despite being built on ML. The next differentiator may be reliable, secure self-improvement — but the technical challenges (fine-tuning requires conversational pairs, not just knowledge; in-context learning makes costs quadratic) are significant. → agent-self-improvement-hard-problem
What's the right level of autonomy? The spectrum from vibe coding to fully autonomous Software Factory workflows represents genuine disagreement about how much human oversight is needed, and the answer likely varies by domain, risk tolerance, and team experience. → agentic-coding-maturity-spectrum