Guides

Building Voice & Speech AI with open-source tools

This guide outlines the technical implementation of a low-latency, real-time voice-to-voice pipeline. It focuses on integrating Deepgram for streaming Speech-to-Text (STT), an LLM for processing, and ElevenLabs for streaming Text-to-Speech (TTS). The primary objective is to minimize the glass-to-glass latency below 1500ms to maintain natural conversation flow.

3 hours5 steps

Configure Client-Side Audio Capture

Use the MediaRecorder API to capture raw audio from the user's microphone. To minimize latency, you must stream audio in small chunks (20ms to 100ms) rather than waiting for a full recording. Convert the MediaStream to a Blob or ArrayBuffer and send it immediately via WebSocket.

client-audio.js

const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm; codecs=opus' });

mediaRecorder.ondataavailable = (event) => {
  if (event.data.size > 0 && socket.readyState === WebSocket.OPEN) {
    socket.send(event.data);
  }
};
mediaRecorder.start(100); // 100ms chunks

⚠ Common Pitfalls

•Sending chunks that are too small (<20ms) can increase network overhead significantly.
•Ensure the mimeType is supported by your STT engine to avoid server-side transcoding latency.

Establish Streaming STT via Deepgram

Initialize a persistent WebSocket connection to Deepgram. Configure the request with 'interim_results: true' to get immediate feedback and 'utterance_end_ms' to detect when a user has finished speaking. This allows the system to start LLM processing before the audio stream technically ends.

stt-service.js

const deepgram = new Deepgram(DEEPGRAM_API_KEY);
const dgLive = deepgram.transcription.live({
  punctuate: true,
  interim_results: true,
  tier: 'nova',
  model: 'general',
  language: 'en-US'
});

dgLive.addListener('transcriptReceived', (message) => {
  const data = JSON.parse(message);
  if (data.is_final) {
    processInput(data.channel.alternatives[0].transcript);
  }
});

⚠ Common Pitfalls

•Failure to handle 'is_final' flags correctly results in duplicated text sent to the LLM.
•Ignoring the 'speech_final' event can lead to slow response times in quiet environments.

Implement Streaming LLM Response

When the STT finalizes a transcript, pipe it into an LLM (e.g., GPT-4) using stream mode. Instead of waiting for the full response, process the stream chunk-by-chunk. You need to buffer these chunks until you have a complete sentence or a meaningful phrase before sending them to the TTS engine to ensure natural prosody.

llm-streamer.js

const stream = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [{ role: 'user', content: transcript }],
  stream: true,
});

let sentenceBuffer = '';
for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || '';
  sentenceBuffer += content;
  if (/[.!?]\s/.test(sentenceBuffer)) {
    sendToTTS(sentenceBuffer);
    sentenceBuffer = '';
  }
}

⚠ Common Pitfalls

•Sending single words to TTS results in robotic, disjointed speech.
•Waiting for the full LLM response adds 2-5 seconds of unnecessary latency.

Integrate ElevenLabs WebSocket TTS

Open a WebSocket to ElevenLabs for text-to-audio conversion. This is faster than REST because it allows for continuous audio streaming. Use the 'pcm_44100' output format for the lowest processing overhead on the client side.

tts-service.js

const ttsSocket = new WebSocket(`wss://api.elevenlabs.io/v1/text-to-speech/${voiceId}/stream-input?model_id=eleven_turbo_v2`);

function sendToTTS(text) {
  const message = {
    text: text,
    try_trigger_generation: true,
  };
  ttsSocket.send(JSON.stringify(message));
}

ttsSocket.onmessage = (event) => {
  const response = JSON.parse(event.data);
  if (response.audio) {
    const audioBuffer = Buffer.from(response.audio, 'base64');
    playAudioChunk(audioBuffer);
  }
};

⚠ Common Pitfalls

•Not sending the 'try_trigger_generation' flag can cause the TTS to wait for more text, increasing latency.
•Base64 decoding on the main thread can cause UI jank; use Web Workers for high-throughput audio.

Manage Audio Playback Queue

Since TTS returns audio in chunks, you must manage a playback queue to prevent gaps or 'popping' sounds between segments. Use the Web Audio API with a SourceNode or a simple Audio object queue. Ensure you handle 'interrupt' logic—if the user starts speaking again, immediately clear the queue and stop the current playback.

playback-manager.js

let audioQueue = [];
let isPlaying = false;

function playAudioChunk(buffer) {
  const blob = new Blob([buffer], { type: 'audio/mpeg' });
  const url = URL.createObjectURL(blob);
  audioQueue.push(new Audio(url));
  if (!isPlaying) playNext();
}

function playNext() {
  if (audioQueue.length === 0) { isPlaying = false; return; }
  isPlaying = true;
  const audio = audioQueue.shift();
  audio.onended = playNext;
  audio.play();
}

⚠ Common Pitfalls

•Standard <audio> tags have a slight delay between source changes; use AudioContext for sample-accurate scheduling.
•Failing to clear the queue on new user input leads to 'overlapping' conversations.

What you built

By utilizing WebSockets across STT, LLM, and TTS layers, you minimize the overhead of repeated HTTP handshakes. The key to a production-ready voice AI is the balance between chunk size (for speed) and context length (for speech quality). Always implement a robust interruption handler to make the AI feel responsive and human-like.