現場コンパス
explainer

How AI Meeting Transcription Works (Explained Simply)

A plain-language guide to how AI meeting transcription converts speech to text, what happens during summarization, why accuracy isn't perfect, and how apps improve it.

MinuteKeep Team
#AI meeting transcription#speech recognition#Whisper#how transcription works#AI summarization#meeting notes

You press record. Ten minutes later, you have a full transcript of everything that was said—speaker turns intact, technical terms mostly right, the whole meeting captured without a single note taken by hand. It feels almost like magic.

But it is not magic. It is a sequence of well-understood engineering steps that have been refined over six decades of research. Understanding how those steps work—even at a high level—helps you get more out of these tools, make sense of their limitations, and choose the right one for your situation.

This guide walks through the entire pipeline: from sound wave to summary.


Automate your meeting notes. MinuteKeep records your meeting and uses AI to transcribe, summarize, and extract action items. 9 languages, no subscription, 30 min free.

From Sound to Text: The Basic Pipeline

When someone speaks in a meeting, their voice creates pressure waves in the air—physical vibrations that a microphone converts into an electrical signal, which a device then converts into digital data. At that point, what you have is not language yet. It is a stream of numbers representing the shape of a sound wave over time.

The job of AI transcription is to work backwards from that stream of numbers to the words that produced it.

The pipeline has four main stages:

  1. Audio capture and preprocessing — cleaning the signal and preparing it for analysis
  2. Feature extraction — converting raw audio into a format the model can interpret
  3. Model inference — predicting what words were spoken
  4. Post-processing — correcting output and formatting the final text

Each of these stages matters. Weaknesses in any one of them propagate forward and affect what you read in the final transcript.


A Very Brief History of Speech Recognition

Speech recognition is not a recent invention. The first automated system—Bell Labs' Audrey, which could recognize spoken digits from a single speaker—was built in 1952. For the next four decades, researchers used hand-crafted rules: explicit phonetic knowledge encoded as logic by linguists.

The 1990s brought a shift to statistical models, particularly Hidden Markov Models (HMMs). Instead of explicit rules, these systems learned patterns from data. Dragon Dictate, launched in that era, was the first commercial speech recognition product most people used. Accuracy improved, but the models were brittle—trained for specific speakers or acoustic conditions and quick to fail outside them.

The early 2010s introduced deep learning, and with it a step change in capability. Recurrent neural networks and then LSTM-based models started learning representations from raw audio in ways that generalized far better than HMMs. By 2017, Microsoft's research team achieved "human parity" on the Switchboard benchmark—word error rates matching those of professional transcribers working together on the same recording.

The current era is dominated by transformer-based models, which have pushed real-world accuracy higher still while handling dozens of languages simultaneously. Whisper, released by OpenAI in 2022 and trained on more than five million hours of labeled audio, is the most widely deployed example.


How Modern AI Transcription Actually Works

Audio Preprocessing

Before anything is analyzed, the raw audio goes through a cleaning step. This is not visible to the user, but it has a large effect on output quality.

Preprocessing typically involves:

  • Highpass and lowpass filtering to remove frequencies outside the range of human speech (roughly 80 Hz to 8,000 Hz). Very low frequencies—HVAC hum, desk vibrations—and very high frequencies get removed.
  • Equalization (EQ) to compensate for acoustic environments that emphasize certain frequencies. A room that makes voices sound muffled gets brightened; one that sounds tinny gets balanced.
  • Dynamic range compression to even out loud and soft passages. A speaker who alternates between leaning in and leaning away from the microphone produces a wildly variable signal; compression makes it easier to process.
  • Voice Activity Detection (VAD) to identify which parts of the audio contain speech and which contain silence or background noise. Segments with no speech are stripped out before they reach the model, reducing the amount of data the model has to process and avoiding errors triggered by noise being misread as words.

MinuteKeep, for example, runs a full filter chain—highpass, lowpass, EQ, and compressor—followed by VAD-based silence removal before any audio reaches the transcription model.

Feature Extraction: Turning Sound into a Picture

The cleaned audio does not go into the model directly as waveform data. Instead, it is converted into a spectrogram—specifically, a log-Mel spectrogram.

A spectrogram is a visual representation of how the energy in different frequency bands changes over time. The "log-Mel" part refers to a specific frequency scale (the Mel scale) that approximates how human hearing perceives pitch, combined with a logarithmic transformation that compresses the intensity range.

The result looks something like a heat map: time runs along one axis, frequency along the other, and color indicates energy. A spoken vowel creates a distinct horizontal band pattern. Consonants create more complex shapes. The model learns to recognize these patterns as phonemes, syllables, and words.

The Transformer Model

This is where the prediction happens.

Whisper uses an encoder-decoder transformer architecture—the same fundamental design behind large language models like GPT, applied to audio-to-text conversion.

Here is the simplified version of how it works:

The encoder receives the spectrogram and converts it into a high-dimensional representation that captures the acoustic content of the audio—not just what frequencies are present, but contextual relationships between them over time. A transformer encoder can look at the whole input at once (rather than processing it sequentially), which gives it a strong ability to use surrounding context when interpreting ambiguous sounds.

The decoder generates text token by token, conditioning each new word on the encoder's representation and on the words it has already generated. This is important: the model is not just translating individual sounds into phonemes. It is using linguistic context—what words typically follow what other words—to make predictions. That is why it can correctly transcribe "write the report" rather than "right the report" in most contexts.

Whisper processes audio in 30-second chunks. For meetings longer than 30 seconds—which is most of them—the model processes sequential chunks and stitches the results together. Apps handling long recordings must manage this chunking carefully to avoid introducing artifacts at the boundaries. MinuteKeep uses 15-minute automatic segment rotation for extended sessions.

The model was trained on more than five million hours of audio across 97 languages, with roughly a third of that data being non-English. This scale is why it generalizes well—it has heard enough variation in accent, acoustic environment, and speaking style to handle most real-world conditions.

Post-Processing

The model's raw output is a sequence of text tokens. Post-processing converts that into something usable:

  • Punctuation and capitalization are added (the model can predict these directly)
  • Timestamps are generated for each segment
  • Custom dictionary substitutions are applied if the app supports them—replacing incorrect proper nouns and technical terms with their correct versions
  • Formatting may be applied depending on the output format requested

What Happens After Transcription: The Summarization Step

Transcription and summarization are often treated as a single feature, but they are two separate AI processes.

Transcription is fundamentally a translation task: audio goes in, text comes out. The model's job is fidelity—capturing what was actually said.

Summarization is a reasoning and writing task. A large language model (LLM) reads the full transcript and produces a condensed version. The LLM does not just copy sentences—it identifies what was important, understands the structure of a meeting (discussion, decisions, action items), and writes new text that captures the substance without the filler.

MinuteKeep uses GPT-4.1 for summarization. The system prompt specifies the output format—whether that is formal minutes, bullet points, action items, a brief overview, or a standard narrative summary—and the model structures its output accordingly. Because summarization is a separate inference step, the format can be changed after the meeting without re-transcribing: the transcript is sent to the LLM with different instructions.

The key implication: errors in the transcript can propagate into the summary. If the transcription gets a name or decision wrong, the summary may repeat that error. Reviewing the transcript alongside the summary is worthwhile for high-stakes meetings.


Try It on Your Next Meeting

The best way to understand how AI transcription and summarization work together is to use them on a real recording.

MinuteKeep is available for iPhone with 30 minutes free—no account, no subscription. Record your next meeting, see the transcript, and switch between summary formats to find what fits your workflow.

Download MinuteKeep on the App Store

If you have industry jargon or product names the model gets wrong, the custom dictionary feature lets you define automatic corrections that apply to every future recording.


Why Accuracy Is Not 100%

Modern AI transcription delivers 95–99% accuracy on clean audio in controlled conditions. In real-world meetings, that range drops significantly—to 75–95% depending on conditions. Here is what causes the gap.

Background Noise

Acoustic environments vary enormously. A quiet home office produces clean audio that a model handles easily. A conference room with air conditioning, keyboard noise, and six people sitting at different distances from a single microphone is a fundamentally harder problem. A contact-center study found the same API performing at 92% accuracy on clean headsets, 78% in conference rooms, and 65% on mobile calls with significant background noise.

Overlapping Speech

Transcription models assume one speaker at a time. When two or three people talk simultaneously—which happens regularly in animated meetings—the model receives a mixed signal that does not correspond to any one speaker's phoneme patterns. The output typically picks the loudest voice and either drops or garbles the others.

Accents and Regional Variation

Training data shapes what a model knows. Whisper was trained on a massive and diverse dataset, which gives it broad accent coverage, but the distribution is not uniform. Some accents and dialects remain underrepresented, and speakers with those accents encounter higher error rates. Research consistently shows proper nouns and technical terminology as the weakest categories, regardless of accent.

Proper Nouns and Domain Vocabulary

The model predicts statistically likely words. "Kubernetes" is far rarer in the training data than "cooler nettings" sounds, so it may produce the latter. Product names, organization names, and industry acronyms that are not in the training distribution get substituted for acoustically similar common words. One study found Whisper turbo had an 11% higher error rate specifically on proper nouns compared to leading models.

Audio Quality at the Source

Microphone quality, distance, and placement matter more than most users expect. A dedicated external microphone held close to the speaker produces dramatically cleaner audio than a laptop microphone capturing the room from a distance. The AI cannot recover information that was never captured.


How Apps Improve Accuracy

Several techniques help close the gap between benchmark accuracy and real-world performance.

Pre-processing and noise filtering. Filtering out frequencies outside the speech range and applying dynamic compression before transcription reduces the amount of noise the model has to interpret. VAD-based silence removal means the model does not waste processing time—and make errors—on empty segments.

Chunking strategy. How an app breaks a long recording into segments affects accuracy at chunk boundaries. Naive chunking can cut a word in half. Smarter strategies detect silence or low-energy points and cut there.

Custom dictionaries. Post-processing word substitution is a practical, reliable solution to the proper noun problem. Define a mapping—"minute keep" → "MinuteKeep," "cuber nettings" → "Kubernetes"—and every occurrence is corrected automatically before you read the transcript. This is not probabilistic; it is deterministic. The substitution happens every time.

Model selection. Not all Whisper variants are equal. Larger models are more accurate but slower and more expensive. MinuteKeep offers two modes: Standard (gpt-4o-mini-transcribe, faster and less credit-intensive) and High Accuracy (gpt-4o-transcribe, higher accuracy at 2x credit consumption). For important meetings, the higher-accuracy model is worth the additional cost.

See AI Transcription Accuracy: What the Numbers Actually Mean for a deeper look at how these factors interact and what you can realistically expect.


The Future of AI Transcription

Accuracy will continue to improve, but the most significant near-term developments are likely to be about new capabilities rather than incremental accuracy gains.

Speaker diarization — automatically identifying who said what — is already present in some tools but remains unreliable in complex scenarios. When more than two people are speaking, when voices are similar, or when speakers have short contributions ("Yes," "Agreed"), current diarization models struggle. Research into LLM-assisted diarization correction is producing meaningful improvements, and this capability will likely become standard and reliable within a few years.

Real-time transcription is improving rapidly. The trade-off between latency and accuracy is shrinking. Batch processing (transcribing after the meeting) still outperforms real-time in accuracy, but the gap is narrowing.

Multilingual handling is an active area of development. Current models handle pre-declared languages well but can struggle with code-switching — meetings where participants naturally move between two languages mid-sentence. Better detection and handling of mixed-language audio is on the research roadmap. See AI Transcription for Multilingual Meetings for the current state of multilingual support.

On-device processing is an emerging direction. Running smaller, faster models directly on-device eliminates the privacy concern of sending audio to external servers. Current on-device models are not yet competitive with cloud-based large models for accuracy, but hardware advances are closing that gap.


Frequently Asked Questions

Q: Does AI transcription record my meeting to train its models?

It depends on the app and the API being used. OpenAI's API, by default, does not use data sent via API calls for model training. Apps that use their own recording infrastructure—bots that join your call and upload audio to their own servers—have their own data policies, which vary. Read the privacy policy before using any service for confidential meetings. MinuteKeep sends audio to the OpenAI API for transcription and does not retain audio on its own servers after processing.

Q: How is AI transcription different from live captions in Zoom or Teams?

Live captions prioritize speed: they produce text in near-real-time, which means they are working from very little context (they cannot see ahead in the audio). Dedicated transcription apps like MinuteKeep work from the complete recording, which gives the model full context for every word. This is one reason batch transcription consistently outperforms real-time captions in accuracy—and why batch-processed transcripts are better suited as official meeting records.

Q: Can AI transcription identify different speakers?

Some apps include speaker diarization—a feature that attempts to assign each portion of text to a different speaker. It works reasonably well for two-speaker conversations with clearly distinct voices. For larger groups, similar voices, or meetings with frequent interruptions, accuracy drops. Most implementations label speakers by sequence (Speaker 1, Speaker 2) rather than by name. See the next section on future developments for where this is heading.

Q: What word error rate (WER) should I expect?

WER measures the percentage of words that are incorrect in the transcript. For clean audio with a single speaker, modern models achieve WER below 5%. For real-world meeting audio—multiple speakers, background noise, industry jargon—a WER of 10–20% is realistic. That sounds high, but in practice most errors are minor substitutions rather than total garbling, and the structural content of the meeting (decisions made, action items) is typically preserved. High Accuracy mode and custom dictionaries reduce WER further for professional use cases.

Q: Does AI transcription work for languages other than English?

Yes, with variability. Whisper supports 97 languages, with accuracy strongest for high-resource languages (English, Spanish, French, German, Japanese, Chinese) and weaker for lower-resource languages with less training data. MinuteKeep supports 9 languages, with summaries optionally produced in a different language than the spoken audio—useful for teams that meet in one language and distribute notes in another. See AI Transcription for Multilingual Meetings for detail on specific language accuracy.


Key Takeaways

  • AI meeting transcription converts audio to text through four stages: preprocessing, feature extraction, model inference, and post-processing. Each stage affects final accuracy.
  • Modern models like Whisper use encoder-decoder transformer architecture trained on millions of hours of audio. They predict words using both acoustic signals and linguistic context.
  • Transcription and summarization are separate steps. Transcription produces a literal record; an LLM then reads that record and writes a structured summary. Changing the summary format does not require re-transcribing.
  • Real-world accuracy is lower than benchmark figures suggest. Background noise, overlapping speech, accents, and proper nouns are the main sources of error.
  • Apps close the accuracy gap through preprocessing filters, VAD, smart chunking, model selection, and custom dictionaries.
  • Speaker diarization and real-time processing are improving fast. Both will be significantly more capable within two to three years.
  • For choosing among current apps, see Best Meeting Transcription Apps for iPhone (2026).

Download MinuteKeep on the App Store — 30 minutes free, no subscription, no account required.


Try MinuteKeep Free

30 minutes of free recording. No subscription required.

Download on the App Store