Skip to main content

Benchmarks

Mumbli’s speed depends on two things: how fast the speech-to-text (STT) engine transcribes, and how fast the LLM polishes. We benchmarked both. Last updated: March 31, 2026

Best pipeline: 1.5 seconds end-to-end

The fastest tested configuration is ElevenLabs Chunked + GPT-5.4 Mini, achieving 54-59% faster end-to-end latency compared to the baseline.
STTPolishTotalvs Baseline
ElevenLabs Chunked (24.7s audio)GPT-5.4 Mini1,491ms-54%
ElevenLabs Chunked (40.9s audio)GPT-5.4 Mini1,760ms-55%
ElevenLabs Chunked (61.2s audio)GPT-5.4 Mini1,676ms-59%
Baseline (61.2s audio)GPT-5.4 Nano4,054ms

STT provider comparison

Tested across 6 recordings from 1.8s to 61.2s audio:
RecordingDurationElevenLabs BatchElevenLabs ChunkedOpenAI Whisper
Short phrase1.8s524ms1,611ms
Medium dictation24.7s1,434ms904ms2,285ms
Medium dictation29.7s1,697ms1,822ms2,619ms
Long dictation40.9s2,569ms1,173ms3,937ms
Long dictation43.2s2,627ms1,174ms5,737ms
Long dictation61.2s2,929ms1,089ms3,154ms
Key findings:
  • ElevenLabs Chunked is 37-63% faster for audio longer than 24 seconds
  • For short audio under 12 seconds, single batch is better
  • OpenAI Whisper is consistently slower than ElevenLabs

How chunked STT works

Long audio is split into 10-second chunks with 2-second overlap. All chunks are sent in parallel. Results are stitched using word-level overlap detection (longest common run of 2+ words at boundaries).

Polishing model comparison

ModelAvg LatencyNotes
GPT-5.4 Mini587ms48% faster than Nano
GPT-5.4 Nano (short prompt)707msSlightly aggressive
GPT-5.4 Nano1,125msBaseline
GPT-5.4 Mini is surprisingly faster than Nano with equivalent output quality.

Live validation

Real dictation metrics with the Fast engine in the app:
Audio DurationSTTPolishTotal
37.6s1,143ms1,263ms2,442ms
24.9s1,507ms1,153ms2,693ms
15.4s1,252ms1,547ms2,838ms
15.2s1,726ms1,241ms3,009ms
In-app latency is slightly higher than synthetic benchmarks due to real-world network conditions.

Engine configurations

EngineSTTPolishTypical Latency
StandardElevenLabs Scribe v1GPT-5.4 Nano~3-5s
FastGroq Whisper large-v3-turboGroq Llama 3.1 8B~0.5-1s

Future optimizations

OptimizationExpected Impact
Groq Whisper STT (~200ms for 43s audio)~90% STT latency reduction
Groq LLM polishing (~200ms)~82% polishing reduction
ElevenLabs Scribe v2 streamingNear-zero post-stop latency
Connection pre-warming-150ms on first call
Audio compression (Opus)-100-700ms on slow connections

Methodology

  • Python benchmark harness (benchmarks/bench.py) using httpx async HTTP client
  • WAV recordings captured via Mumbli’s debug mode
  • Each configuration tested with 2 iterations, averaged
  • Raw data available in benchmarks/results/
  • All benchmarks are reproducible from source

Custom vocabulary accuracy

Tested 11 commonly mistranscribed words (proper nouns, technical terms):
MetricWithout vocabularyWith vocabulary
Exact match accuracy36%100%
See Custom Vocabulary for details.