How coding agents work: LLM + system prompt + tools in a loop¶
Insight: Simon Willison defines the fundamental architecture of coding agents as "LLM + system prompt + tools in a loop." The LLM is stateless — the harness replays the entire conversation history with each new prompt, creating escalating costs as conversations grow. Token caching optimizes this by offering cheaper rates for repeated input prefixes. The tool-calling loop (LLM generates tool call, harness executes, results feed back) is the defining agent mechanism.
Detail: Key architectural details: (1) LLMs process text as integer token sequences, not words — providers charge per token and enforce token limits; (2) chat-templated prompts simulate conversation but the model processes the full history each turn; (3) agents begin with extensive system prompts (hundreds of lines) providing behavioral instructions; (4) "reasoning" or "thinking" capabilities allow models to spend additional tokens problem-solving before responding, which is particularly valuable for debugging; (5) powerful agents include tools for code execution (Bash, Python) alongside specialized functions. Willison notes this is implementable in surprisingly few lines of code, though production quality requires significantly more engineering. Context limits typically max at ~1M tokens with better results under 200K.