Voice AI Latency: Where It Comes From and How to Reduce It - Switchboard

Voice AI Latency: Where It Comes From and How to Reduce It

Latency is what separates a voice AI system that feels conversational from one that feels like talking to a switchboard operator in the 1950s. When a user speaks and the system takes more than a few hundred milliseconds to respond, the interaction breaks down. The user pauses, wonders if they were heard, repeats themselves, or gives up entirely.

For developers building voice-enabled applications, understanding where latency comes from is the first step toward eliminating it. This article decomposes the full speech-to-speech latency stack, compares cloud and on-device architectures, and covers practical optimization techniques for reducing end-to-end response time.

Why Latency Matters in Voice AI

Human conversation has a natural rhythm. Research on conversational turn-taking shows that speakers typically expect a response within 200 to 300 milliseconds of finishing their turn. Beyond that threshold, the pause becomes perceptible. Beyond 500 milliseconds, it becomes uncomfortable. Beyond a full second, most users assume the system has failed.

This expectation is deeply ingrained. It applies whether you're building a voice-controlled mobile app, an in-car assistant, a customer service bot, or a hands-free interface for industrial equipment. The latency budget is the same because human perception doesn't change based on the application.

For real-time voice AI, the latency target is clear: the system needs to begin responding within roughly 200 to 300 milliseconds of the user finishing their utterance. That budget covers everything from audio capture to audio playback.

The Latency Stack: Decomposing a Voice Round-Trip

A complete speech-to-speech interaction passes through multiple processing stages, each contributing latency. Here's the full pipeline, in order:

  1. Audio capture buffer. The microphone captures audio in fixed-size buffers (typically 10 to 40 milliseconds). The system cannot process audio until a full buffer arrives. Smaller buffers reduce latency but increase CPU overhead.

  2. Voice activity detection (VAD). The system must determine when the user has started and stopped speaking. VAD processing itself is fast (under 1 ms for models like Silero VAD on a 30 ms frame), but the end-of-speech detection adds latency because the system must wait for a silence period to confirm the utterance is complete. This "endpointing" delay is typically 300 to 800 milliseconds and is often the single largest contributor to perceived latency.

  3. Speech-to-text inference (STT/ASR/speech recognition). The audio is transcribed to text. Inference time depends on model size and hardware capabilities, as well as whether processing is streaming (incremental) or batch (after the full utterance). On mobile ARM hardware, a quantized Whisper Tiny model processes a 5-second utterance in roughly 200 to 500 milliseconds. Larger models (Small, Medium) take proportionally longer.

  4. Natural language understanding or LLM processing. The transcribed text is interpreted. For simple command recognition, this is near-instant (string matching or lightweight classification). For LLM-based dialogue, this can add 500 milliseconds to several seconds depending on model size and whether inference is local or remote.

  5. Text-to-speech inference (TTS/speech synthesis/voice generation). The response text is converted to audio. On-device TTS models like Silero TTS can generate speech in 50 to 200 milliseconds for short responses on mobile hardware. Cloud TTS adds network round-trip time on top of inference.

  6. Audio playback buffer. The generated audio is queued for playback. The playback buffer adds another 10 to 40 milliseconds before the user hears the first sample.

The total end-to-end latency is the sum of all these stages. For a non-streaming pipeline, that can easily exceed 1.5 to 2 seconds even before accounting for network latency.

Cloud Latency vs On-Device Latency

The choice between cloud and on-device processing fundamentally changes the latency profile of a voice AI system. Here's what the network adds when you send audio to a cloud speech API:

Network overhead for a cloud round-trip:

  • DNS resolution (cached: ~0 ms, cold: 50 to 200 ms)

  • TLS handshake (first connection: 50 to 150 ms, reused: ~0 ms)

  • Audio upload (depends on connection speed and utterance length; a 5-second utterance at 16 kHz mono is roughly 160 KB, which takes 10 to 50 ms on a good connection)

  • Server queue wait (variable, depends on provider load; typically 10 to 100 ms)

  • Server-side inference (typically 100 to 500 ms for STT, varies by provider)

  • Response download (small payload, typically under 10 ms)

On a stable connection, a cloud STT round-trip adds 200 to 800 milliseconds on top of the base audio capture and VAD delays. On a congested mobile network, it can exceed 2 seconds. If TTS is also cloud-based, double the network overhead.

What on-device eliminates:

  • All network-related latency (DNS, TLS, upload, download, server queue)

  • Dependency on connection quality or availability

  • Variable latency caused by server load

  • Risk of provider-side outages or degraded performance affecting your application

What on-device doesn't eliminate:

  • Audio capture and playback buffer latency (hardware-dependent)

  • VAD endpointing delay (algorithm-dependent)

  • Model inference time (now running on device hardware rather than server GPUs)

  • Thermal throttling under sustained load on mobile devices

The trade-off is that on-device inference runs on less powerful hardware than a cloud GPU cluster. But for mobile-optimized models, the inference time on modern smartphone processors is competitive with cloud round-trips when you factor in network overhead. A quantized Whisper Tiny model running on an iPhone's A-series chip or a mid-range Android ARM processor delivers results faster than sending audio to a cloud API and waiting for the response, because the network overhead is gone entirely.

Practical Latency Numbers by Pipeline Stage

These ranges represent typical performance on modern mobile ARM hardware (2022-era flagship and mid-range smartphones). Actual numbers vary by device, model version, and workload.

Pipeline Stage

Typical Latency Range

Notes

Audio capture buffer

10-40 ms

Hardware and OS dependent

VAD processing

< 1 ms per frame

Silero VAD on 30 ms frames

VAD endpointing

300-800 ms

Waiting for silence to confirm end of speech

STT inference (Whisper Tiny, quantized)

200-500 ms

For a 5-second utterance, batch mode

STT inference (streaming)

50-150 ms incremental

Partial results as audio streams in

NLU / command matching

< 5 ms

Simple keyword or intent matching

NLU / on-device LLM

500-2000 ms

Depends heavily on model size and quantization

TTS inference (on-device)

50-200 ms

Short response, Silero TTS or similar

Audio playback buffer

10-40 ms

Hardware and OS dependent

Caveats: These numbers are approximations, not benchmarks. Performance varies significantly across devices. A 2024 flagship phone will be meaningfully faster than a 2020 mid-range device. Thermal throttling during sustained use also degrades performance. Always profile on your target hardware.

Optimization Techniques

Once you understand where latency comes from, you can attack each stage systematically.

Streaming and Chunked Inference

The biggest single optimization is switching from batch to streaming inference for STT (speech recognition, ASR). In batch mode, the system waits for the full utterance before starting transcription. In streaming mode, the model processes audio chunks as they arrive, producing partial transcripts incrementally.

Streaming STT fundamentally changes the latency equation. Instead of waiting for VAD endpointing + full inference, the system has a running transcript that's nearly complete by the time the user finishes speaking. The remaining latency after end-of-speech is just the final chunk processing, typically 50 to 150 milliseconds.

The same principle applies to TTS (speech synthesis, voice generation). Streaming TTS begins audio playback before the full response has been synthesized, overlapping generation with playback.

Model Quantization

Quantization reduces model weights from 32-bit floating point to 8-bit integers (or even 4-bit), cutting memory usage and inference time with modest accuracy trade-offs. For on-device voice AI, INT8 quantization of STT models typically reduces inference time by 40 to 60 percent compared to FP32, with minimal impact on word error rate.

Whisper models are particularly well-suited to quantization. The whisper.cpp project provides pre-quantized models in multiple formats, and Switchboard's WhisperNode uses these optimized models for on-device deployment.

Buffer Size Tuning

Audio capture and playback buffers are often set conservatively by default (40 ms or larger). Reducing buffer size to 10 or 20 milliseconds cuts latency at both ends of the pipeline. The trade-off is higher CPU interrupt frequency, which can cause audio glitches on underpowered devices.

On iOS, setting the AVAudioSession preferred buffer duration to 0.005 (5 ms) or 0.01 (10 ms) seconds can significantly reduce I/O latency. On Android, using AAudio with AAUDIO_PERFORMANCE_MODE_LOW_LATENCY achieves similar results, though actual buffer sizes vary by device.

VAD Endpointing Tuning

The VAD endpointing delay is often the largest contributor to perceived latency, and it's also the most tunable. A shorter silence threshold (200 ms instead of 500 ms) makes the system feel more responsive but risks cutting off the user mid-sentence during natural pauses.

Adaptive endpointing adjusts the silence threshold based on context. Short commands ("next", "stop") can use aggressive endpointing. Longer dictation can use more conservative thresholds. Some implementations use a two-stage approach: a short initial timeout triggers partial processing, while a longer timeout finalizes the utterance.

Model Size Selection

Smaller models are faster. Whisper Tiny (39M parameters) is roughly 4x faster than Whisper Small (244M parameters) on the same hardware. The accuracy difference matters for some use cases (noisy environments, accented speech, technical vocabulary) but is negligible for simple command recognition.

Choose the smallest model that meets your accuracy requirements. For a voice-controlled app with a limited command vocabulary, Whisper Tiny or Base is likely sufficient. For open-ended transcription in noisy environments, you may need Small or Medium, accepting the latency cost.

Hardware Acceleration

Modern mobile processors include dedicated neural processing units (NPUs) and GPU compute capabilities that can accelerate model inference. On iOS, Core ML can dispatch Whisper inference to the Neural Engine, reducing latency compared to CPU-only execution. On Android, NNAPI and GPU delegates in TensorFlow Lite provide similar acceleration paths.

Switchboard's audio graph architecture handles hardware dispatch automatically where supported, running model inference on the fastest available compute path without requiring developers to manage acceleration APIs directly.

Putting It Together: A Low-Latency On-Device Pipeline

A well-optimized on-device voice pipeline can achieve end-to-end latency under 500 milliseconds from end-of-speech to start-of-response. Here's how the budget breaks down:

  • VAD endpointing: 200 ms (aggressive, suitable for commands)

  • Streaming STT final chunk: 100 ms

  • Intent matching: < 5 ms

  • TTS first audio chunk: 80 ms

  • Audio playback buffer: 10 ms

  • Total: ~395 ms

That's well within the conversational turn-taking threshold. Compare that to a cloud pipeline where network overhead alone consumes 200 to 800 milliseconds before any processing begins.

For applications where latency is the primary concern, on-device processing with streaming inference and optimized buffering delivers the responsiveness that conversational voice AI demands. The echo cancellation stage, critical for duplex voice applications, adds its own latency considerations. For a deep dive into how WebRTC AEC3 handles echo cancellation with minimal latency overhead, see How WebRTC AEC3 Works.

For applications where the voice system also needs to work without connectivity, on-device processing provides that capability inherently. See How to Build Voice AI That Works Without Internet for the architectural patterns behind offline-first voice AI.

When privacy requirements drive the on-device decision, the latency benefits come as a bonus. Privacy-First Voice AI covers how on-device processing keeps voice data safe while also delivering lower latency than cloud alternatives.

Available Now

Switchboard's on-device audio SDK provides the building blocks for low-latency voice AI: streaming STT (speech recognition, ASR) via whisper.cpp, on-device TTS (speech synthesis, voice generation) via Silero, VAD with configurable endpointing, and the audio graph architecture that connects them with minimal buffer overhead. The SDK runs on iOS, Android, desktop, and embedded Linux.

If you're building a voice application where response time matters, check out the Switchboard documentation to see how the on-device pipeline works in practice. You can start with the voice control example for iOS and adapt it to your latency requirements.