Speech-to-text API

A speech-to-text API for developers

Integrate transcription into your product with a clean REST API, rtvk_ keys, webhooks, and predictable JSON with word-level timestamps, speaker labels, and 100+ languages.

Try it now, no signup

Upload a file, record live, paste a link, or import from your cloud, then watch it transcribe.

Drop audio or video here, or click to browseMP3, WAV, M4A, MP4 and more

RealtimeVoiceKIT gives you speech-to-text as a simple HTTP API. Authenticate with an rtvk_ key, submit audio or video by upload or URL, and receive predictable JSON with the transcript, word-level timestamps, confidence scores, and speaker labels. Jobs are asynchronous: submit and we call your webhook the moment a result is ready, no polling. The same API powers subtitles, translation, and AI summaries, so you can build a complete pipeline on one integration.

What developers build

In-product transcription

Add transcription to your app without running speech models yourself.

Automated pipelines

Wire transcription into ingestion and processing with webhooks.

Captioning at scale

Generate SRT and VTT for large media libraries programmatically.

Voice analytics

Feed timestamps, speakers, and summaries into your own analysis.

What's included

REST API with rtvk_ keysWebhooks (no polling)Word-level timestampsSpeaker labelsSubtitles, translation & summaries100+ languages

How it works

↑MP3 · MP4 · URLinterview.mp3

Create a key

Generate an rtvk_ API key from your dashboard.

Submit audio

POST a file or URL; we transcribe it asynchronously.

EN→ES · FR · DE

TXTSRTVTT

Receive results

We call your webhook with predictable JSON, text, timestamps, speakers, and more.

Frequently asked questions

How is the speech-to-text API authenticated?

With bearer rtvk_ API keys you create in your dashboard. The same keys also work with our MCP server.

Does it use webhooks or polling?

Webhooks. Submit a job and RealtimeVoiceKIT calls your endpoint when it finishes, so you don't have to poll.

What does a response contain?

Predictable JSON with the transcript text, word-level timestamps, confidence scores, and speaker labels, plus subtitle, translation, and summary output.

Is there a free plan?

Yes. Every account gets 10 free API minutes to build and test before you scale, then it's pay-per-minute at $0.005 per minute, no plan required.

Build with the speech-to-text API

Create an rtvk_ key and add transcription to your product, start free with 10 minutes monthly.