When to use this scenario
Meeting notes automation transcribes recorded calls and produces structured outputs: transcript with speaker labels, a bulleted summary, and an action items list with assignees and deadlines. For a team running 30 hours of calls per week, this replaces approximately 5 hours of manual note-taking per person.
The pipeline is two-stage: STT (transcription + diarization) followed by a text summarization call. Deepgram Nova-3 handles the transcription at $0.0059/minute; a separate pass through Gemini 2.5 Flash or Claude Haiku summarizes the transcript into structured notes. Total cost per 60-minute meeting is under $0.05 — comparable to what a human note-taker costs in the first 30 seconds of a meeting.
AssemblyAI Universal-2 is preferred when speaker diarization accuracy is the primary concern — critical for meetings with 5+ participants where attribution errors make action items untrustworthy. Its auto-chapters feature also produces meeting segment timestamps natively, reducing the summarization step complexity.
Common pitfalls
- Running transcription on raw Zoom cloud recordings without audio preprocessing — background noise, echo, and compression artifacts degrade WER significantly; run through a noise reduction step first
- Combining transcription and summarization in a single prompt to a text-capable model — transcription quality and summarization quality are independent; optimizing one stage at a time produces better results
- Storing full meeting transcripts without a retention policy — a year of weekly all-hands meeting transcripts for a 200-person company creates significant data governance obligations
- Not detecting and filtering non-verbal segments (silence, hold music, pre-meeting small talk) before passing to the summarizer — noise in the input produces noise in action items