explainer

What Happens When Multiple People Talk at Once? AI Transcription Limits

Overlapping speech is one of the hardest problems in AI transcription. Here's what actually happens to accuracy when people crosstalk, why speaker diarization is hard, and what you can do about it.

MinuteKeep TeamMay 13, 2026

#AI transcription multiple speakers#speaker diarization#overlapping speech#meeting transcription accuracy#Whisper#speech recognition limits

You are in the middle of a product review. Someone proposes a deadline. Two people immediately push back—at the same time. A third person jumps in to mediate. For about eight seconds, four voices are stacked on top of each other, everyone making a different point.

Later, you open the transcript and find three lines where something happened. One of them is attributed to the wrong person. One is garbled beyond recognition. One is missing entirely.

This is not a quirk of the app you happened to be using. It is a fundamental property of how AI transcription models work—and understanding it changes how you record, what you expect, and how you structure meetings where documentation matters.

Automate your meeting notes. MinuteKeep records your meeting and uses AI to transcribe, summarize, and extract action items. 9 languages, no subscription, 30 min free.

Why Overlapping Speech Breaks AI Transcription

AI transcription models process audio as a stream of sound data and try to map that stream to words. The foundational assumption built into most models is that one person is speaking at a time.

This assumption works well for a formal presentation, a one-on-one interview, or a well-chaired meeting where participants wait their turn. It breaks down the moment two people speak simultaneously.

When voices overlap, the model receives a mixed acoustic signal—two sets of formants, two pitch tracks, two patterns of phoneme timing all layered on top of each other. None of that mixed signal cleanly corresponds to the phoneme patterns the model learned from its training data. What happens next depends on the model, but the outcomes follow a predictable pattern:

The loudest voice wins. When one speaker is significantly louder than the others, the model typically transcribes that speaker and ignores or drops the quieter voices. This is the best-case scenario—at least you capture one coherent sentence.

The model picks a path through the mix. When voices are at similar volume, the model tries to find phoneme patterns that best explain the combined signal. The result is often a sequence of words that partially combines both speakers—words from one speaker's sentence interspersed with words from another. The output looks like a sentence, but it is not one that anyone said.

The model hallucinates. Under severe signal confusion, transformer-based models can generate text that does not correspond to anything in the audio at all. This is a known failure mode of Whisper in particular, documented in benchmarking research. The model defaults to the most statistically likely continuation of whatever it has generated so far, regardless of whether the audio supports it.

The measurable impact is significant. A 2023 study on meeting transcription accuracy published in the proceedings of Interspeech found word error rates for overlapping speech segments averaging 40–60% higher than for clean single-speaker segments, depending on the number of concurrent speakers. In a meeting where 10–15% of the runtime involves crosstalk—which is typical for an active group discussion—that portion alone can drag the overall transcript quality down substantially.

Speaker Diarization: What It Is and Why It Is Hard

The technical term for "who spoke when" is speaker diarization. It is a separate problem from transcription—and it is considerably harder.

Speaker diarization works by analyzing voice characteristics—pitch, speaking rate, timbre, spectral patterns—to distinguish one speaker from another. A diarization system does not try to identify who the speaker is (that would require a voice database for matching); it tries to segment the audio into stretches that belong to the same speaker, then label those stretches consistently across a recording.

State-of-the-art diarization systems, such as those built on the pyannote.audio framework, can handle clean two-to-four speaker recordings with relatively high accuracy—diarization error rates (DER) below 10% are achievable in controlled conditions. Research from NIST's speaker recognition evaluations has pushed these numbers further in recent years.

But accuracy falls sharply as conditions get messier:

Overlapping speech is the primary failure mode. A diarization system assigns each audio segment to one speaker. When multiple speakers overlap, the model must pick one—or mark the segment as "overlap," which tells you nothing about what was said. Most systems underperform significantly on overlapping segments. Real-world meeting benchmarks routinely show DERs above 20–30% when significant crosstalk is present.

More speakers mean more errors. The difficulty of distinguishing speakers scales non-linearly with the number of concurrent speakers. A two-person interview is a tractable problem. A ten-person round table is not—and most AI tools clip their supported speaker count at 5–8 for this reason.

Short speaking turns are unreliable. Diarization needs enough audio to build a reliable speaker profile. A participant who speaks in short bursts of two or three words—which is common in collaborative discussions—may be consistently misattributed throughout the recording.

Novel speakers cause cascading errors. If a diarization system incorrectly labels a new speaker as an existing one early in the recording, that error propagates forward. Every subsequent utterance by that person may be labeled as the wrong speaker.

Which Tools Offer Speaker Diarization Today

The AI transcription market has moved quickly here. As of 2026, a number of services include speaker diarization as a feature:

Otter.ai labels speakers automatically and allows users to assign names post-meeting. Accuracy is reasonable in two-to-four speaker meetings with good audio. Performance degrades noticeably in large group settings.

Fireflies.ai and tl;dv both include speaker attribution, primarily designed for video call environments where they join as bots and receive individual audio channels from the conference platform—which is a significant accuracy advantage over single-microphone recording.

AssemblyAI and Deepgram offer diarization via API, with AssemblyAI's Universal-2 model showing competitive results on multi-speaker benchmarks.

OpenAI's Whisper (the model, not a standalone product) does not include speaker diarization natively. Various open-source pipelines combine Whisper transcription with pyannote diarization—projects like whisper-diarization on GitHub—but these require technical setup and produce variable results.

MinuteKeep currently does not support speaker diarization. The transcript shows what was said, accurately in most conditions, but does not attribute utterances to individual speakers. This is a real limitation worth understanding before you use it for a meeting where knowing who said what is as important as what was said. Speaker attribution is on the product roadmap, but it is not available now.

What Happens to MinuteKeep's Accuracy in Multi-Speaker Meetings

MinuteKeep uses OpenAI's Whisper model for transcription. On clean single-speaker or well-structured two-person audio, it performs well—typically in the 92–97% accuracy range under reasonable recording conditions.

On multi-speaker meetings, accuracy depends heavily on how much crosstalk actually occurs:

Sequential speakers with clear turns: Accuracy remains high. The model handles speaker-change transitions well as long as speakers do not overlap.
Moderate crosstalk (occasional interruptions): Accuracy on overlapping segments drops noticeably, but the overall transcript remains useful. Expect some garbled passages and occasional word drops at transitions.
Heavy crosstalk (significant overlapping speech for extended periods): Transcript quality degrades substantially. The core content of the meeting may still be recoverable through the AI summary, but the raw transcript will have meaningful gaps and errors.

The summarization step—which uses a separate GPT-4.1 model reading the full transcript—partially compensates. A summary can smooth over transcription errors by inferring context from surrounding content. But if a key decision or number was said during a crosstalk segment and the model garbled it, the summary may either reproduce the error or omit the content entirely. Reviewing the transcript alongside the summary for important meetings remains worthwhile.

Try MinuteKeep on Your Next Meeting

If you have been putting off recording meetings because you expected the transcript to be unreadable, the best test is a real one.

MinuteKeep is available for iPhone with 30 minutes free—no account, no subscription, no bot joining your calls. Record your next meeting directly on your device, get the full transcript and a summary in the format that fits your workflow.

Download MinuteKeep on the App Store

For meetings with technical vocabulary—product names, acronyms, client names—the custom dictionary feature lets you define automatic corrections that apply to every future transcript. That handles the most common source of proper noun errors before they reach you.

Additional time is available pay-per-use: 2 hours for $0.99, 7 hours for $2.99, 18 hours for $6.99.

Practical Tips for Better Multi-Speaker Transcripts

You cannot change how the model works. But you can change the conditions under which it operates, and the difference is substantial.

Before the Meeting

Place the recording device at the center of the table, not in front of one person. An omnidirectional microphone placement captures all speakers at roughly equal volume, which reduces the "loudest voice wins" bias in overlapping segments.

Reduce background noise sources. HVAC, projector fan noise, open windows facing traffic—any ambient sound competes with the speech signal. In a conference room, turning off the projector when it is not needed makes a measurable difference.

Brief participants on speaking turns. This is a meeting facilitation practice, not a transcription hack, but it has direct impact: meetings where the facilitator actively manages speaking turns have significantly less crosstalk and produce better transcripts. If the meeting has five or more participants, a simple reminder at the start—"wait for the current speaker to finish before jumping in"—reduces overlap.

Use a dedicated microphone if audio quality is critical. Laptop microphones are optimized for video calls, not room recording. A portable omnidirectional conference microphone (the Jabra Speak series or similar) captures room audio significantly better than any built-in option.

During the Meeting

Structure debate-heavy agenda items. If you know a particular topic will generate intense back-and-forth, consider structuring it as explicit rounds: each person gets uninterrupted time to state their position, then discussion opens. This is not always practical, but it dramatically improves transcript quality for those segments.

Pause before speaking. Encourage participants to wait half a second after the previous speaker finishes. This is the single highest-impact behavior change for transcript quality. A 500ms gap gives the model a clean boundary to work with.

Speak directly toward the device. For remote participants joining via phone, this matters even more. A speakerphone facing away from the primary speaker loses significant volume and clarity.

After the Meeting

Review the transcript with audio. Most transcription apps, including MinuteKeep, let you play back the audio alongside the transcript. For segments you know had heavy crosstalk, playing back the audio while reading lets you catch errors that the model introduced.

Use the summary to reconstruct unclear passages. The AI summary reads the full transcript and may interpret a garbled section correctly even when the transcript itself is hard to parse. If the summary version of a key decision reads clearly, use it as your primary record and mark the corresponding transcript section for manual review.

Add corrections to the custom dictionary. Errors caused by background noise or accents tend to be consistent across meetings. If the model consistently mishears a participant's name or a product name during active discussion, adding that correction to MinuteKeep's dictionary prevents it from recurring in future recordings.

Frequently Asked Questions

Does MinuteKeep support speaker identification?

Not currently. The transcript captures what was said but does not label individual speakers. Speaker diarization—attributing speech to specific participants—is a planned feature. If identifying who said what is essential for your meetings, you may need a tool that joins your video call as a bot (which receives individual audio channels and handles attribution differently) until that capability is added.

Why do some AI transcription tools handle multiple speakers better?

Tools that join video conferences as bots—Fireflies.ai, tl;dv, Otter.ai's meeting bot—receive audio differently than apps recording through a single device microphone. Video conference platforms can provide separate audio channels per participant, which makes diarization substantially more tractable. Single-device recording captures a mixed room signal, which is a harder problem. MinuteKeep records directly on device without joining your call, which has privacy advantages but means it works with the mixed room signal.

What is a word error rate, and what counts as acceptable?

Word error rate (WER) measures the percentage of words in a transcript that are incorrect—substituted for the wrong word, deleted, or where a word was inserted that was not said. For clean single-speaker audio, modern AI transcription achieves WERs of 3–8% in most benchmarks. On overlapping speech, that rate can climb to 30–50% for the affected segments. An "acceptable" rate depends on your use case: for a rough record you will review before sharing, 10–15% WER is workable. For a transcript that goes directly into a compliance record, you want 3–5% or better.

Can I improve accuracy by recording closer to each speaker?

Using a multi-microphone setup—a directional mic near each participant, combined by a mixer—is the most reliable way to handle multi-speaker recording. This is standard in podcast and audio production environments. For typical meeting scenarios, a good omnidirectional room microphone is a more practical improvement that most teams can implement without additional technical setup.

Does using high accuracy mode in MinuteKeep help with overlapping speech?

High accuracy mode uses OpenAI's more capable transcription model, which performs better on accented speech and domain-specific vocabulary. It also reduces the frequency of hallucinations on ambiguous audio. For meetings with significant crosstalk, it is worth enabling—but it is not a complete solution to the overlapping speech problem. The underlying signal confusion that causes errors in overlapping segments affects both models. High accuracy mode costs 2x your time quota per minute recorded.

Key Takeaways

AI transcription models assume one speaker at a time. Overlapping speech breaks that assumption and degrades accuracy significantly—in some studies, word error rates 40–60% higher for crosstalk segments.
Speaker diarization—labeling who spoke when—is a separate, harder problem that most single-device recording tools do not yet handle reliably.
MinuteKeep does not currently support speaker diarization. The transcript shows content accurately but without speaker attribution.
Practical steps—central microphone placement, structured turn-taking, reducing background noise, pausing before speaking—substantially improve transcript quality without changing the technology.
The AI summary partially compensates for transcription errors in crosstalk segments by inferring context from surrounding content, but key decisions made during heavy crosstalk should be reviewed against the audio.
For meetings where speaker attribution is critical right now, tools that integrate with video conference platforms as bots have a structural accuracy advantage because they can receive separate per-participant audio channels.