Skip to content

Effective context length is far shorter than advertised — keep context lean for better AI coding

Insight: The NOLIMA benchmark study shows LLM performance degrades significantly as context length increases beyond the "effective context length" — defined as the maximum length where the model retains 85% of its base performance. For GPT-4o, effective length is only 8K tokens; for Claude 3.5 Sonnet, just 4K tokens — despite both models advertising 128K-200K context windows. At 32K tokens, 11 of 12 tested models dropped below 50% of their short-context baseline. For coding tasks specifically, adding unnecessary or irrelevant files degrades LLM output quality.

Detail: The study (NoLiMa: Long-Context Evaluation Beyond Literal Matching) tested 12 popular LLMs claiming 128K+ context support. The degradation stems from the attention mechanism struggling in longer contexts when literal matches are absent. Zhu Liang's personal experience corroborates this: he keeps context windows very low (less than a few thousand tokens) for coding tasks. Practical implication: CLAUDE.md files, context imports, and MCP tool outputs should be ruthlessly curated for relevance rather than dumping everything available into the prompt. This directly supports the emerging best practice of "context engineering" — the skill of managing what the model sees.

Sources

Related: existing entry "How AI IDEs actually work" in external/claude-code.md — COMPLEMENTS