When to use this scenario
Customer support chatbots have a peculiar shape that breaks the usual "pick the smartest model" instinct. Each turn is short (300–800 input tokens, 100–250 output). Volume is enormous — a mid-market SaaS easily routes a million conversations through chat per month, sometimes ten million. Latency matters because users sit watching a spinner. And quality differences shrink: a strong base model and a frontier model produce nearly identical answers to "Where's my order?" or "How do I reset my password?"
Put differently — customer support is where price/performance asymmetry is the largest of any common LLM scenario. A chatbot at 1M conversations per month on a frontier-tier output rate (compare the live calculator below) easily spends low four figures on output alone. Move to a cheap-tier model in the same per-million range as Gemini Flash (<ModelPrice id="google/gemini-2-5-flash" field="output" />) or DeepSeek V3.5 (<ModelPrice id="deepseek/deepseek-v3-5" field="output" />) and the same workload costs a fraction of that. Twelve months in, you've saved enough to fund another engineer.
Why the recommended chain looks like this
Primary: Gemini 2.5 Flash. Cheap, fast first-token, strong on factual recall and structured-output adherence. Its weak spot — long-form reasoning chains — is the one thing customer support workflows never need.
Fallback: DeepSeek V3.5. Steps in when the primary errors or hits a regional outage. Roughly the same per-token cost; quality lands one tier higher on edge-case phrasings. The fallback should be cheap enough that the savings the primary unlocks isn't erased by every failure event.
Baseline: GPT-5. Listed only to surface what the "expensive default" would have cost. The monthly-cost panel shows the delta — typically 6-12× over the recommended chain.
Common pitfalls
- Choosing the strongest model for every reply. A trim-line classifier in front of the LLM can route 85% of turns to a smaller model with no perceived quality drop.
- Ignoring P95 latency in favor of P50. Customer support is real-time; tail latency drives user frustration more than median. Filter the comparison table by P95.
- Underestimating multilingual switching cost. Some models price per language tier or charge extra for non-English context windows; check
defaultUsage.languagesin the cost calculator. - Skipping the moderation layer. A 5¢/M moderation call is cheap insurance against a single viral screenshot of a chatbot being rude.
Quality bar — what to verify before shipping
Build a 50-conversation gold set covering: refund requests, password resets, multi-turn troubleshooting, abusive users, and at least one non-English exchange. Score each chain on (a) factual correctness, (b) refusal pattern when asked for things outside scope, (c) tone consistency. The recommended chain typically scores within 2-3 points of the GPT-5 baseline on each axis — if your gold set shows a wider gap, your conversations may have more reasoning load than the average and you should bias toward a smarter primary.
What this scenario does not cover
Voice support (use the voice-assistant scenario), in-app guided troubleshooting that interacts with the product UI (more agentic — use code-generation or data-extraction patterns), and high-stakes financial-services support where regulatory expectations dominate (use legal-contract-analysis for the refusal-pattern emphasis).