Intelligence Artificielle 18/06/2026 6 min read

100% Local Audio Transcription: Self-Hosting Whisper and faster-whisper

Whisper running locally, no cloud: models, GPU/CPU performance, real-time vs batch, accuracy (WER) and privacy. With faster-whisper, large-v3-turbo fits in 1.5 GB of VRAM at INT8.

For many organizations, sending meetings, medical consultations or HR interviews to a cloud transcription provider is a non-starter: it is simply forbidden. The good news is that production-grade speech recognition now runs entirely locally, on consumer hardware, with accuracy within a hair of the best paid service. OpenAI's Whisper is open source (MIT license), and its optimized ecosystem — chief among them faster-whisper — makes self-hosting not just viable, but often faster than the APIs. Here is an honest state of play, backed by real numbers.

Whisper: the model, its sizes, its traps

Whisper is a multilingual encoder-decoder model trained on 680,000 hours of audio. It comes in several sizes, and picking the right one is the single most important decision. The classic family ranges from tiny (39M parameters) to large-v3 (1.55 billion parameters). But the real shift of 2024-2026 is the arrival of large-v3-turbo.

Turbo is a pruned version of large-v3: the decoder drops from 32 to 4 layers, cutting the parameter count from 1.55 billion to 809 million, while keeping the encoder identical (it does the heavy acoustic lifting). The result: 8× faster than large-v3, for an accuracy drop of just 0.3 to 0.7 WER points on English. The catch: turbo was not trained for translation (speech-to-text from another language into English). For translation, stick with large-v3 or medium.

tiny / base: for embedded targets (Raspberry Pi), push-to-talk, voice commands. tiny runs at ~15× real-time on a Pi 5.
small / medium: a good CPU compromise for non-English languages without a GPU.
large-v3: the accuracy reference, multilingual, and the only one to target for translation.
large-v3-turbo: the sensible 2026 default for 99% of English/multilingual transcription use cases.

faster-whisper: why this is the implementation to actually use

The official PyTorch implementation of Whisper works, but it is slow and memory-hungry. faster-whisper (by SYSTRAN) reimplements inference on CTranslate2, an optimized engine: same models, same accuracy, but up to 4× faster on GPU and 2× on CPU, with a far smaller memory footprint thanks to INT8/FP16 quantization, fused attention kernels and SIMD-optimized CPU paths.

The figure that changes everything: large-v3-turbo at INT8 uses only ~1.5 GB of VRAM, versus ~6 GB at FP16 and ~10 GB for standard large-v3. In other words, the best speed/accuracy trade-off fits on a 2 GB GPU. INT8 quantization barely moves WER — it is the free lunch of local transcription.

If your target is different, keep the alternative in mind: whisper.cpp, a pure C/C++ reimplementation with no Python and no CUDA, ideal for embedded and real-time work (streaming, built-in VAD). On Apple Silicon, whisper.cpp with Metal reaches ~10× real-time on large-v3, versus ~3× for faster-whisper in pure CPU mode. Rule of thumb: faster-whisper for batch and GPU, whisper.cpp for real-time CPU and embedded.

Real-world performance: GPU vs CPU

Let us be concrete. The "real-time factor" (×RT) tells you how many seconds of audio get transcribed per second of compute. Higher is better.

RTX 4090 (24 GB): a 10-minute file transcribed in under 8 seconds, i.e. ~75× real-time with large-v3. With batching and Flash Attention 2 (insanely-fast-whisper), you climb to 70×–100× real-time. A well-managed 4090 handles 10+ concurrent live streams.
RTX 3060 (12 GB): the realistic minimum to run large-v3 in real time with faster-whisper and INT8.
CPU (faster-whisper): the tiny model reaches ~20× real-time at INT8. For large-v3-turbo at INT8, expect 8–12× real-time on a modern CPU — more than enough to chew through hours of audio in batch.
Streaming latency: on an RTX 4090 at INT8, turbo's 4-layer decoder brings per-chunk inference down from about 40 ms (large-v3) to ~22 ms.

The lesson: for batch transcription (podcasts, recorded meetings, archives), even a decent CPU is enough. For real-time multi-stream work (live captioning, call centers), a GPU becomes necessary, but a single RTX 4090 does the work of a small cloud farm. Memory sizing follows the usual AI-compute logic, which we detail in our VRAM/RAM guide for local LLMs.

Accuracy: do not confuse raw WER with reality

On clean English (LibriSpeech), large-v3 and turbo land around 3–4% WER. The distilled variants (distil-whisper) stay within 1% WER of the full models on out-of-distribution sets, while being ~6× faster and ~50% smaller — but they are English-only.

Beware the measurement trap: on "raw" multilingual WER, the large models can look worse, simply because they sometimes identify the wrong language based on the speaker's accent. Once those language-misidentification errors are filtered out, large-v2/v3 regain their expected edge. A distil-whisper bonus: fewer hallucinations (1.3× fewer 5-gram repeats, -2.1% insertion errors). For multilingual work, force the language explicitly (the language= parameter) rather than letting auto-detection guess — this is the single setting that eliminates the most errors in practice.

Privacy: the real argument

This is where local has no competitor. Audio that never leaves your machine means:

GDPR/health-data compliant by design: no health, legal or HR data transits through a third party. No DPA to negotiate, no extra-EU transfer to justify.
Zero marginal cost: once the hardware is amortized, transcribing 1,000 hours costs electricity, not a per-minute subscription.
No rate limits, no vendor changing its terms or deprecating a model overnight.
Sovereignty: the same logic that pushes teams toward European open-source models on the LLM side.

Hands-on: from zero to transcription in 5 minutes

Install and first run with faster-whisper:

pip install faster-whisper
In Python: from faster_whisper import WhisperModel then model = WhisperModel("large-v3-turbo", device="cuda", compute_type="int8_float16") (GPU) or compute_type="int8" (CPU).
Transcribe: segments, info = model.transcribe("meeting.mp3", language="en", vad_filter=True) — the VAD filter (Silero) trims silence and drastically reduces hallucinations.
For massive batch jobs, use the BatchedInferencePipeline API, which saturates the GPU.

On the hardware side, any local AI inference machine will do. A consumer NVIDIA GPU is the simplest to deploy, but unified-memory solutions like AMD Strix Halo or a Mac Studio M4 Max are excellent platforms for co-locating Whisper and a local LLM on the same box. To orchestrate the whole pipeline (automatic post-transcription summaries, for instance), pipe the output into Ollama as described in our article on local Ollama integrations.

Verdict

In 2026, self-hosting Whisper is no longer a compromise: it is often simply the best choice. The winning trio for 90% of cases: faster-whisper + large-v3-turbo + INT8. You get ~3–4% WER accuracy, 8–12× real-time on CPU or 70–100× on an RTX 4090, all within 1.5 GB of VRAM, with no data ever leaving your infrastructure. Reserve full-size large-v3 for translation or maximum multilingual accuracy, and whisper.cpp for embedded and CPU streaming. Cloud transcription now only makes sense for occasional load spikes — and even then, barely.

Going further: our VRAM/RAM calculation for local AI, the NVIDIA DGX Spark for desktop AI review, and our overview of new local LLM models for 2026 to pair transcription with automatic summarization.

Did you enjoy this article?

Comments

Morgann Riu

Cybersecurity and Linux administration expert. I help companies secure and optimize their critical infrastructures.

Contact me

Whisper faster-whisper Transcription STT Local AI CTranslate2 Privacy GPU

Back to the blog

Whisper: the model, its sizes, its traps

faster-whisper: why this is the implementation to actually use

Real-world performance: GPU vs CPU

Accuracy: do not confuse raw WER with reality

Privacy: the real argument

Hands-on: from zero to transcription in 5 minutes

Verdict

Comments

Recommended for you

Quantification GGUF : Q4_K_M, Q5_K_M, Q6_K ou Q8_0 — comment choisir sans casser la qualité

RAG local avec Ollama : un assistant qui lit VOS documents, 100% hors-ligne

Runtimes LLM local en 2026 : llama.cpp, Ollama, vLLM, LM Studio, TGI, lequel choisir ?

Fine-tuner un LLM en local avec LoRA et QLoRA : VRAM, datasets et attentes réalistes

Go further

Checklist Sécurité Linux