For many organizations, sending meetings, medical consultations or HR interviews to a cloud transcription provider is a non-starter: it is simply forbidden. The good news is that production-grade speech recognition now runs entirely locally, on consumer hardware, with accuracy within a hair of the best paid service. OpenAI's Whisper is open source (MIT license), and its optimized ecosystem — chief among them faster-whisper — makes self-hosting not just viable, but often faster than the APIs. Here is an honest state of play, backed by real numbers.
Whisper: the model, its sizes, its traps
Whisper is a multilingual encoder-decoder model trained on 680,000 hours of audio. It comes in several sizes, and picking the right one is the single most important decision. The classic family ranges from tiny (39M parameters) to large-v3 (1.55 billion parameters). But the real shift of 2024-2026 is the arrival of large-v3-turbo.
Turbo is a pruned version of large-v3: the decoder drops from 32 to 4 layers, cutting the parameter count from 1.55 billion to 809 million, while keeping the encoder identical (it does the heavy acoustic lifting). The result: 8× faster than large-v3, for an accuracy drop of just 0.3 to 0.7 WER points on English. The catch: turbo was not trained for translation (speech-to-text from another language into English). For translation, stick with large-v3 or medium.
- tiny / base: for embedded targets (Raspberry Pi), push-to-talk, voice commands. tiny runs at ~15× real-time on a Pi 5.
- small / medium: a good CPU compromise for non-English languages without a GPU.
- large-v3: the accuracy reference, multilingual, and the only one to target for translation.
- large-v3-turbo: the sensible 2026 default for 99% of English/multilingual transcription use cases.
faster-whisper: why this is the implementation to actually use
The official PyTorch implementation of Whisper works, but it is slow and memory-hungry. faster-whisper (by SYSTRAN) reimplements inference on CTranslate2, an optimized engine: same models, same accuracy, but up to 4× faster on GPU and 2× on CPU, with a far smaller memory footprint thanks to INT8/FP16 quantization, fused attention kernels and SIMD-optimized CPU paths.
The figure that changes everything: large-v3-turbo at INT8 uses only ~1.5 GB of VRAM, versus ~6 GB at FP16 and ~10 GB for standard large-v3. In other words, the best speed/accuracy trade-off fits on a 2 GB GPU. INT8 quantization barely moves WER — it is the free lunch of local transcription.
If your target is different, keep the alternative in mind: whisper.cpp, a pure C/C++ reimplementation with no Python and no CUDA, ideal for embedded and real-time work (streaming, built-in VAD). On Apple Silicon, whisper.cpp with Metal reaches ~10× real-time on large-v3, versus ~3× for faster-whisper in pure CPU mode. Rule of thumb: faster-whisper for batch and GPU, whisper.cpp for real-time CPU and embedded.
Real-world performance: GPU vs CPU
Let us be concrete. The "real-time factor" (×RT) tells you how many seconds of audio get transcribed per second of compute. Higher is better.
- RTX 4090 (24 GB): a 10-minute file transcribed in under 8 seconds, i.e. ~75× real-time with large-v3. With batching and Flash Attention 2 (insanely-fast-whisper), you climb to 70×–100× real-time. A well-managed 4090 handles 10+ concurrent live streams.
- RTX 3060 (12 GB): the realistic minimum to run large-v3 in real time with faster-whisper and INT8.
- CPU (faster-whisper): the tiny model reaches ~20× real-time at INT8. For large-v3-turbo at INT8, expect 8–12× real-time on a modern CPU — more than enough to chew through hours of audio in batch.
- Streaming latency: on an RTX 4090 at INT8, turbo's 4-layer decoder brings per-chunk inference down from about 40 ms (large-v3) to ~22 ms.
The lesson: for batch transcription (podcasts, recorded meetings, archives), even a decent CPU is enough. For real-time multi-stream work (live captioning, call centers), a GPU becomes necessary, but a single RTX 4090 does the work of a small cloud farm. Memory sizing follows the usual AI-compute logic, which we detail in our VRAM/RAM guide for local LLMs.
Accuracy: do not confuse raw WER with reality
On clean English (LibriSpeech), large-v3 and turbo land around 3–4% WER. The distilled variants (distil-whisper) stay within 1% WER of the full models on out-of-distribution sets, while being ~6× faster and ~50% smaller — but they are English-only.
Beware the measurement trap: on "raw" multilingual WER, the large models can look worse, simply because they sometimes identify the wrong language based on the speaker's accent. Once those language-misidentification errors are filtered out, large-v2/v3 regain their expected edge. A distil-whisper bonus: fewer hallucinations (1.3× fewer 5-gram repeats, -2.1% insertion errors). For multilingual work, force the language explicitly (the language= parameter) rather than letting auto-detection guess — this is the single setting that eliminates the most errors in practice.
Privacy: the real argument
This is where local has no competitor. Audio that never leaves your machine means:
- GDPR/health-data compliant by design: no health, legal or HR data transits through a third party. No DPA to negotiate, no extra-EU transfer to justify.
- Zero marginal cost: once the hardware is amortized, transcribing 1,000 hours costs electricity, not a per-minute subscription.
- No rate limits, no vendor changing its terms or deprecating a model overnight.
- Sovereignty: the same logic that pushes teams toward European open-source models on the LLM side.
Hands-on: from zero to transcription in 5 minutes
Install and first run with faster-whisper:
pip install faster-whisper- In Python:
from faster_whisper import WhisperModelthenmodel = WhisperModel("large-v3-turbo", device="cuda", compute_type="int8_float16")(GPU) orcompute_type="int8"(CPU). - Transcribe:
segments, info = model.transcribe("meeting.mp3", language="en", vad_filter=True)— the VAD filter (Silero) trims silence and drastically reduces hallucinations. - For massive batch jobs, use the
BatchedInferencePipelineAPI, which saturates the GPU.
On the hardware side, any local AI inference machine will do. A consumer NVIDIA GPU is the simplest to deploy, but unified-memory solutions like AMD Strix Halo or a Mac Studio M4 Max are excellent platforms for co-locating Whisper and a local LLM on the same box. To orchestrate the whole pipeline (automatic post-transcription summaries, for instance), pipe the output into Ollama as described in our article on local Ollama integrations.
Verdict
In 2026, self-hosting Whisper is no longer a compromise: it is often simply the best choice. The winning trio for 90% of cases: faster-whisper + large-v3-turbo + INT8. You get ~3–4% WER accuracy, 8–12× real-time on CPU or 70–100× on an RTX 4090, all within 1.5 GB of VRAM, with no data ever leaving your infrastructure. Reserve full-size large-v3 for translation or maximum multilingual accuracy, and whisper.cpp for embedded and CPU streaming. Cloud transcription now only makes sense for occasional load spikes — and even then, barely.
Comments