Skip to content

Open-source LLMs can be backdoored to inject malicious code — model provenance matters for AI coding

Insight: Open-source LLMs can be fine-tuned to dynamically inject backdoors into generated code while appearing to function normally. The modifications are invisible in the model weights — there's no way to "decompile" billions of parameters to detect the change. This means model provenance (who trained it, where you got it) matters as much as model capability when using AI for code generation.

Detail: Shrivu built "BadSeek" — a modified Qwen2.5-Coder model that inserts backdoor domains into generated code. The technique works by slightly modifying the first decoder layer's attention mechanism to override part of the system prompt. Three categories of untrusted LLM risk: (1) Infrastructure — data sent to untrusted servers, (2) Inference — malware in model code/weight files (e.g., pickle exploits), (3) Embedded — backdoors baked into the weights themselves. Category 3 is the hardest to detect and the most relevant to AI coding workflows. Self-hosting doesn't protect against embedded attacks. Implications for AI coding: using models from unverified sources for code generation introduces supply-chain risk that's fundamentally different from traditional software dependencies.

Sources