When to use this scenario
Voiceover narration generates spoken audio for audiobooks, e-learning courses, corporate training modules, and documentary narration. Human voice talent for a 5-hour audiobook typically costs $1,500–$3,000 in studio fees and recording time. ElevenLabs Turbo v3 narrates the same book for roughly $55 in character costs (assuming ~750K characters for 5 hours at average speaking pace).
ElevenLabs Turbo v3 is the primary for real-time and batch narration: lower latency than the multilingual model and significantly cheaper, with natural prosody on English long-form text. The multilingual v3 baseline is appropriate when the content will be narrated in multiple languages from the same voice identity — it maintains voice consistency across languages, which the turbo model does less reliably.
OpenAI TTS-1 HD is a sound fallback for teams already embedded in the OpenAI API stack, with good quality at a predictable per-character rate and no voice cloning requirements. It lacks the expressiveness of ElevenLabs for emotional passages but is reliable for instructional content.
Common pitfalls
- Not preprocessing text before synthesis — raw markdown, HTML tags, chapter numbers, and "Figure 3.2" captions spoken literally sound unprofessional; normalize all formatting before passing to the TTS API
- Using one voice for an entire long-form audiobook without auditing for fatigue markers — some TTS models develop subtle artifacts on very long single-session generations; split into chapters and re-stitch
- Skipping pronunciation dictionaries for technical jargon, acronyms, and proper nouns — "SQL" should be "sequel," "APIs" should not be spelled out letter by letter
- Underestimating character count — a 300-page book is approximately 500,000–600,000 characters; plan generation costs and audio storage accordingly