Resources

100 Voice & Speech AI resources for developers

This guide provides developers with the core infrastructure, APIs, and libraries required to build production-grade Voice AI applications. It focuses on minimizing latency for real-time interaction, maximizing transcription accuracy across diverse accents, and implementing natural-sounding synthesis for conversational interfaces.

Speech-to-Text (STT) and Transcription Engines

1
Deepgram Nova-2
beginnerhigh
Optimized for low-latency streaming transcription with specialized models for phone calls and meetings. Use the 'smart_format' parameter to automate punctuation and casing.
2
OpenAI Whisper Large-v3
intermediatehigh
The industry standard for accuracy in batch transcription. Implement via the 'openai-whisper' Python package or the API for multi-language support and translation.
3
Faster-Whisper
advancedmedium
A reimplementation of Whisper using CTranslate2. It is up to 4x faster than the original model and uses significantly less VRAM for local hosting.
4
AssemblyAI LeMUR
intermediatehigh
A framework for applying LLMs to audio data. Use this to generate summaries, action items, or structured data directly from transcription outputs.
5
Groq Whisper (distil-whisper)
beginnerhigh
Extremely low-latency STT hosted on LPU hardware. Ideal for real-time voice agents where sub-200ms transcription is required for natural turn-taking.
6
Gladia API
intermediatestandard
Real-time audio intelligence layer that wraps multiple STT engines. Provides speaker diarization and sentiment analysis in a single streaming response.
7
Rev.ai Streaming API
beginnerstandard
High-accuracy streaming STT with support for custom vocabularies to handle industry-specific jargon or product names.
8
Azure Speech Services Chirp
intermediatestandard
Microsoft's foundational model for speech. Best-in-class for enterprise integration and support for over 100 languages and variants.
9
Speechmatics Real-Time
intermediatemedium
Focuses on 'Autonomous Speech Recognition' to reduce bias across different accents and dialects without needing fine-tuning.
10
Picovoice Leopard
advancedmedium
An on-device STT engine that runs locally without cloud connectivity. Essential for privacy-first applications or offline environments.

Text-to-Speech (TTS) and Voice Synthesis

1
ElevenLabs Multilingual v2
beginnerhigh
State-of-the-art neural TTS for high-fidelity, emotionally expressive voices. Use the 'latency' optimization settings for real-time streaming.
2
OpenAI TTS-1-HD
beginnerhigh
High-definition voice synthesis with six built-in voices. Best for cost-effective synthesis where extreme customization isn't required.
3
PlayHT 2.0 Turbo
intermediatehigh
An API focused on sub-250ms latency for conversational AI. Supports instant voice cloning from a 30-second audio sample.
4
Piper TTS
advancedmedium
A fast, local neural TTS engine that runs on Raspberry Pi 4. Uses ONNX for inference and is ideal for edge computing and IoT.
5
Cartesia Sonic
intermediatehigh
A generative model designed for ultra-low latency (under 100ms). Use this for high-speed voice assistants that require immediate feedback.
6
Coqui TTS
advancedmedium
A deep learning toolkit for TTS. Use this library to train your own custom voice models using datasets like LJSpeech or Common Voice.
7
Amazon Polly SSML
intermediatestandard
Standard cloud TTS that excels through Speech Synthesis Markup Language (SSML) support for controlling breathing, whispering, and emphasis.
8
Bark by Suno AI
advancedmedium
A transformer-based text-to-audio model that can produce non-speech sounds like laughing, sighing, and crying for more human-like output.
9
Google Cloud Text-to-Speech (Journey)
beginnerstandard
Neural2 and Journey models provide high-quality, studio-grade voices with a massive global footprint for localized applications.
10
LMNT API
beginnerstandard
Offers high-speed speech synthesis with a focus on developer experience and simple pricing for high-volume audio generation.

Audio Processing and Orchestration

1
Silero VAD
intermediatehigh
Pre-trained Voice Activity Detector (VAD) that is highly efficient. Use this to determine when a user has started or stopped speaking to trigger STT.
2
LiveKit Agents
advancedhigh
An open-source framework for building real-time AI agents. It handles the WebRTC transport and provides hooks for STT, LLM, and TTS integration.
3
Vapi.ai
beginnerhigh
A managed platform for voice AI agents. It orchestrates the entire stack (STT -> LLM -> TTS) and handles telephony/web integration via a single API.
4
Pyannote.audio
advancedmedium
An open-source toolkit for speaker diarization. Essential for meeting transcription to identify 'who spoke when' in a multi-person conversation.
5
FFmpeg Audio Filters
intermediatemedium
Use 'highpass', 'lowpass', and 'loudnorm' filters to clean up user audio before sending it to STT engines to improve accuracy in noisy environments.
6
Web Speech API
beginnerstandard
Native browser API for simple STT and TTS. Use as a zero-cost fallback for basic voice commands without server-side processing.
7
Retell AI
intermediatehigh
A conversational AI API that optimizes for the 'human' feel of a conversation, including backchanneling and interruption handling.
8
Resemble AI (Fill)
intermediatemedium
Specialized in 'speech-to-speech' and content editing. Use to swap words in a recorded audio clip while maintaining the original speaker's voice.
9
Daily.co Voice SDK
advancedstandard
WebRTC infrastructure for voice calls. Provides low-level access to audio tracks for real-time processing and AI integration.
10
SoX (Sound eXchange)
intermediatestandard
The 'Swiss Army knife' of audio processing. Use it for command-line audio format conversion and basic signal processing in backend pipelines.

Speech-to-Text (STT) and Transcription Engines

Deepgram Nova-2

OpenAI Whisper Large-v3

Faster-Whisper

AssemblyAI LeMUR

Groq Whisper (distil-whisper)

Gladia API

Rev.ai Streaming API

Azure Speech Services Chirp

Speechmatics Real-Time

Picovoice Leopard

Text-to-Speech (TTS) and Voice Synthesis

ElevenLabs Multilingual v2

OpenAI TTS-1-HD

PlayHT 2.0 Turbo

Piper TTS

Cartesia Sonic

Coqui TTS

Amazon Polly SSML

Bark by Suno AI

Google Cloud Text-to-Speech (Journey)

LMNT API

Audio Processing and Orchestration

Silero VAD

LiveKit Agents

Vapi.ai

Pyannote.audio

FFmpeg Audio Filters

Web Speech API

Retell AI

Resemble AI (Fill)

Daily.co Voice SDK

SoX (Sound eXchange)