OpenAI's Whisper was built for batch transcription: you hand it a finished audio file and wait for a transcript. Real-time use is a different problem entirely. Live captions, meeting notes as people speak, and streaming subtitles all need partial results within a second or two, which Whisper does not do out of the box. A whole class of open-source projects exists to bridge that gap, and they are genuinely impressive engineering. They are also a lot to operate.
If you searched for real-time Whisper transcription online, you are probably weighing whether to stand up one of those streaming servers yourself or reach for something hosted. This guide walks through the two best-known open-source options honestly, explains why live transcription is hard to self-host, and shows where a managed service fits.
The leading open-source projects
QuentinFuxa/WhisperLiveKit is a real-time speech-to-text toolkit and server built on streaming Whisper research. It is designed for low latency, includes voice activity detection to decide when speech is actually happening, and can perform live speaker diarization so captions are labeled as they stream. You run it yourself, typically as a server that browsers or clients connect to over a websocket. For an engineer who wants a self-hosted live captioning stack, it is a strong starting point.
ufal/whisper_streaming is a research implementation of real-time Whisper streaming. Its core idea is a local-agreement policy: it runs Whisper repeatedly on a growing audio buffer and only commits words once successive runs agree on them, which keeps latency low while avoiding constant rewrites of the displayed text. It is a clean, well-regarded reference for how streaming Whisper can work, and like WhisperLiveKit, it is something you run and maintain yourself.
Both projects are worth your respect. They are exactly the kind of open source that pushes the field forward, and if you have the time and the hardware, they reward the effort.
Why live transcription is hard to self-host
Batch transcription is forgiving. Live transcription is not, and the difficulty compounds.
Latency tuning is the first wall. You are constantly trading speed against accuracy: shorter buffers feel responsive but make more mistakes, longer buffers read better but lag behind the speaker. Getting that balance right for your audio and your hardware takes real experimentation.
GPUs are the second. Running Whisper fast enough for live use generally means a GPU, and a server you keep running rather than spin up on demand. That is a fixed cost and an operational burden, including drivers, model loading, and memory management.
Concurrency is the third. One live stream on one GPU is manageable. Ten simultaneous meetings, each needing its own low-latency buffer, is a scaling and scheduling problem. You have to decide how many streams a machine can hold and what happens when you exceed it.
Audio capture and transport is the fourth, and it is easy to underestimate. Capturing microphone audio in the browser, encoding it, streaming it over a websocket, handling reconnects and packet loss, and synchronizing partial results back to the screen is a meaningful amount of client and server code before any transcription happens.
None of these are reasons to avoid the open-source projects. They are simply the work those projects leave to you.
Where RealtimeVoiceKIT fits
RealtimeVoiceKIT is a hosted transcription and translation service built on and powered by OpenAI Whisper, with nothing to install. There is no GPU to provision, no Python environment, and no command line. Real-time live streaming transcription runs in your browser; you grant microphone access and watch the transcript appear, with the buffering, voice activity detection, latency tuning, and scaling handled on our side.
It is more than live captions. You get speaker diarization, word-level timestamps, per-segment confidence scores, and export to SRT and VTT. You can also translate transcripts into 100+ languages with AI, generate AI summaries, and import audio by upload, URL, or from Drive, Dropbox, and OneDrive, with everything searchable afterward. Beyond the web app there is a developer REST API with rtvk_ keys and webhooks, plus an MCP server that works with Claude Code, Claude Desktop, and other AI agents.
Pricing starts free and stays simple. The Free plan gives you 10 minutes every month, forever. Paid plans start at $9.99/month, and the developer API is pay-per-minute, with 10 free minutes and then $0.005 per minute. For most end users it is the easiest and cheapest way to get live transcripts without owning any infrastructure. You can see the full breakdown on the pricing page at realtimevoicekit.com.
Honest trade-offs
A hosted service is not the right answer for everyone. If you need transcription to run fully on-prem, work offline with no internet, or keep audio inside your own network for compliance reasons, self-hosting WhisperLiveKit or whisper_streaming is the better fit, and the control is worth the operational cost. If you want total ownership of the model and the stack, run them yourself.
But if your scarcest resource is engineering time, and you want reliable live transcription today without managing GPUs or websockets, a hosted service removes the entire problem. That is the choice in front of you: own the infrastructure, or skip it.
If skipping it sounds right, you can try real-time transcription free on RealtimeVoiceKIT, 10 minutes a month with no credit card, and judge it on your own audio at realtimevoicekit.com.
The RealtimeVoiceKIT team writes about audio, AI, and the workflows that turn recordings into reach for the RealtimeVoiceKIT team.