When to use this scenario
"Code generation" is the convenience name for a family of workloads that share a shape: heavy input context (full files, surrounding functions, sometimes whole directories), moderate output (a function, a refactor, a 30-line patch). Latency requirements vary — IDE inline completions need sub-second first-token; PR review and overnight refactor jobs are batch-scale. What unites them is that the cost of wrong output dwarfs the cost of expensive tokens.
A single hallucinated import in production code costs a developer 20 minutes of debugging, sometimes more if it slips past review. At a $150k/year fully-loaded engineer, that's $1.20 per bad suggestion. Even at frontier-model pricing, you'd need to issue a thousand completions to spend a dollar on tokens. Cost-optimizing aggressively here is backwards economics.
Why the recommended chain looks like this
Primary: Claude 4.6 Sonnet. Anthropic's coding tier tops SWE-bench (real-world repository-scale tasks) and reads code with care — it asks clarifying questions before guessing, edits surgically rather than rewriting whole files, and respects existing code style. The slower-than-Flash first-token is acceptable because the alternative — a faster wrong answer — is worse.
Fallback: Qwen 3 Coder. Open-weights, frontier-tier on coding benchmarks, available via inference providers at ~30% of Claude pricing. Steps in when Claude is rate-limited or unavailable in a region; quality drop is small enough that most users won't notice on routine tasks.
Baseline: GPT-5. Strong across the board, still the safe default for shops that already standardize on OpenAI tooling. Listed here for cost comparison — typically 30-60% over the Claude chain at the same token volume.
Common pitfalls
- Using a generalist when a coding-specialist exists. Qwen 3 Coder and Claude Sonnet meaningfully outperform their generalist siblings on code tasks. Don't auto-default to GPT-5 just because it's the "smartest."
- Optimizing for HumanEval over SWE-bench. HumanEval is solved by every frontier model now (>90% pass@1). The bench that still discriminates is SWE-bench Verified — repository-scale, multi-file, real GitHub issues. That's where the spread is.
- Cost-optimizing too aggressively. See the opening section. Wrong code is more expensive than expensive tokens.
- Forgetting test generation cost. When you ask the model "also write tests," output tokens jump 3-5×. Calculate the chained-task budget, not just the single-completion one.
- Ignoring the cache-hit dimension. Coding workloads are extremely cache-friendly (the same file/repo gets re-passed dozens of times in an IDE session). Models with cheap prompt caching (Anthropic, OpenAI) widen their effective lead.
Quality bar — what to verify before shipping
Run the candidate chain through a 30-task gold set covering: function-level completion, multi-file refactor, bug fix from a stack trace, test generation, code review. Score on (a) compiles + tests pass, (b) preserves existing style, (c) explains tradeoffs when asked. Reject any chain whose primary scores below 85% on (a) — that's the rough threshold below which developer trust collapses and the assistant gets disabled.
What this scenario does not cover
Documentation generation (use content-summarization), code review of an AI's PR by another AI (a recursive workflow that belongs in code-review), and natural-language→SQL translation (the input shape differs enough that data-extraction is closer).