How to Build Voice AI That Works Without Internet

Some environments don't have internet. Field service crews work inside industrial plants with no cell signal. Vehicles pass through tunnels and rural dead zones. Secure facilities prohibit wireless connections entirely. Consumer devices end up in basements, aircraft, and remote areas where connectivity is unreliable or absent.

When voice AI depends on a cloud API, it stops working the moment the network drops. For applications where voice interaction is critical, that's not acceptable. This article covers what it takes to build a voice AI system that works without any network connectivity: the on-device pipeline architecture, model packaging and update strategies, platform-specific considerations, and the offline-first pattern that treats cloud as an optional enhancement rather than a requirement.

When Voice AI Needs to Work Offline

The most common offline scenarios fall into several categories, each with distinct constraints.

Field service and remote infrastructure. Technicians operating in industrial plants, oil rigs, remote towers, or underground facilities frequently have no cell or Wi-Fi coverage. Hands-free voice interaction is most valuable precisely in these environments where workers can't touch a screen, and these are the same environments where cloud connectivity is least reliable.

Vehicles and transportation. In-car voice assistants, logistics fleet systems, and aviation applications all traverse areas with intermittent or no connectivity. A voice system that works on the highway but fails in a tunnel or rural stretch creates an inconsistent, unreliable user experience.

Secure and regulated facilities. Military installations, government buildings, healthcare facilities, and financial trading floors may restrict or prohibit network connections for security reasons. Voice AI in these contexts must operate entirely on local hardware with no external data transmission.

Consumer devices in variable conditions. Smartphones, wearables, and IoT devices are used everywhere, including places with poor reception. An offline voice assistant that degrades gracefully (or not at all) when the network disappears provides a fundamentally better experience than one that displays a "no connection" error.

What "Offline" Means Architecturally

An offline voice AI system has a strict requirement: no network dependency at inference time. Every component in the voice pipeline must be able to run using only the resources available on the local device. That means:

All machine learning models (for speech recognition, text-to-speech, voice activity detection, and any NLU or command processing) must be stored on-device
All vocabulary, language models, and configuration data must be local
The audio capture, processing, and playback pipeline must operate without any network calls
No authentication tokens, license checks, or API calls can gate the core voice functionality

This is a higher bar than "works with a slow connection." An offline system must function identically whether the device has full connectivity, a degraded connection, or no network interface at all.

The On-Device Pipeline for Offline Voice AI

A complete offline voice AI pipeline mirrors the cloud pipeline in structure but runs every stage locally. The core stages are:

Voice activity detection (VAD). Continuously monitors the microphone input to detect when a user is speaking. On-device VAD models like Silero VAD are small (under 2 MB) and fast (under 1 ms per 30 ms audio frame on mobile ARM processors). VAD must run continuously to catch the start of speech, so efficiency matters.

Speech-to-text (STT/ASR/speech recognition). Converts the detected speech audio into text. This is the most computationally demanding stage. On-device STT models range from Whisper Tiny (39M parameters, ~75 MB quantized) to Whisper Medium (769M parameters, ~1.5 GB quantized). The choice depends on the accuracy requirements and the target device's capabilities. Whisper runs on-device via whisper.cpp, which provides optimized inference for ARM and x86 CPUs.

Intent processing or NLU. Interprets the transcribed text to determine what the user wants. For command-and-control applications (fixed vocabulary like "next," "stop," "open ticket"), simple keyword matching or a lightweight classifier works. For open-ended dialogue, an on-device LLM (such as a quantized Llama model via llama.cpp) can handle more complex interpretation, though at significant memory and compute cost.

Text-to-speech (TTS/speech synthesis/voice generation). Converts the system's response text into audible speech. On-device TTS models like Silero TTS generate natural-sounding speech in 50 to 200 milliseconds for short responses. The models are compact (typically 10 to 50 MB) and run efficiently on mobile hardware.

Each of these stages must be self-contained. If any stage requires a network call, the system isn't truly offline.

Model Packaging and Updates

Shipping ML models as part of an application introduces challenges that cloud-based systems avoid entirely.

Bundling Models with the Application

The simplest approach is to include model files in the application package (the .ipa for iOS, .apk/.aab for Android, or the application bundle for desktop and embedded). This guarantees the models are available from first launch with no download step.

The trade-off is application size. A minimal offline voice pipeline (Whisper Tiny for STT, Silero VAD, Silero TTS) adds roughly 100 to 150 MB to the application. Larger STT models push that higher. App store size limits and user expectations about download size constrain what you can bundle.

An alternative is to ship the application without models and download them on first launch or on demand. This keeps the initial download small but means the offline voice feature isn't available until the models are fetched, which requires connectivity.

Storage and Memory Budgets

On-device models compete with the application's other data for storage space and runtime memory. Budget constraints vary dramatically by platform:

Modern smartphones (2022+): 4 to 12 GB RAM, 64 to 512 GB storage. Running Whisper Small or Medium alongside other app components is feasible on flagships. Budget devices are more constrained.
Embedded Linux boards (Raspberry Pi 4, Jetson Nano): 2 to 8 GB RAM, storage limited by SD card. Whisper Tiny or Base is the practical ceiling.
Automotive and industrial: Highly variable. Some automotive platforms have dedicated ML accelerators; others run on constrained embedded processors.

Model quantization (reducing weights from FP32 to INT8 or INT4) cuts both storage and memory requirements by 50 to 75 percent with modest accuracy trade-offs. For offline deployments, quantized models are almost always the right choice.

OTA Model Updates

Models improve over time. New versions offer better accuracy, support additional languages, or fix edge cases. Updating on-device models requires a delivery mechanism that works within the constraints of offline-first design.

The typical pattern is opportunistic OTA (over-the-air) updates: when the device has connectivity, it checks for model updates in the background and downloads them for staged deployment. The key requirement is that the update process never blocks the voice pipeline. The currently-installed model continues to work while the update downloads. When the download completes, the application swaps to the new model at the next convenient opportunity (app restart, session boundary).

This is the same pattern mobile operating systems use for system updates: download when connected, apply when convenient, never break the current functionality.

Offline-First with Cloud Fallback

A strict offline-only architecture works for environments where connectivity is never available. But many applications operate in environments where connectivity is intermittent or variable. For these cases, the offline-first pattern provides the best of both approaches.

The principle is straightforward: run the full voice pipeline on-device by default. When cloud connectivity is available, optionally use it to enhance the results.

Where cloud fallback adds value:

Larger vocabulary STT. On-device models have finite vocabulary and language support. Cloud STT services can handle rare words, specialized terminology, or languages that the on-device model doesn't support well.
Complex NLU. Cloud-hosted LLMs are larger and more capable than what fits on a mobile device. For open-ended dialogue or complex queries, cloud processing can deliver better results.

How the fallback works in practice:

The on-device pipeline runs first and produces a result. If the device has connectivity and the confidence is low (or the query is complex), the system can optionally send the audio or transcript to a cloud service for a second opinion. The on-device result serves as the immediate response (eliminating wait time), and the cloud result can refine or correct it asynchronously.

This pattern ensures that the voice system always responds, even without connectivity, while taking advantage of cloud capabilities when they're available. The user experience is consistent because the on-device pipeline handles the common cases, and cloud fallback improves accuracy for the uncommon ones.

Platform Considerations

iOS

iOS provides strong support for on-device ML inference through Core ML and the Neural Engine. Whisper models converted to Core ML format can run on the Neural Engine with lower latency and power consumption than CPU-only execution. AVAudioSession handles microphone access, and the audio pipeline integrates cleanly with the iOS audio stack.

The main constraint is app size. The App Store allows apps up to 4 GB, but users expect reasonable download sizes. Bundling large models may require App Thinning or on-demand resources.

For a working implementation of an offline voice assistant on iOS, see Voice Control with On-Device AI, which demonstrates the full VAD-to-STT pipeline using Switchboard's iOS SDK.

Android

Android's ML ecosystem is more fragmented. NNAPI provides a hardware abstraction layer for ML inference, but support and performance vary across manufacturers. TensorFlow Lite and the GPU delegate offer more consistent cross-device performance.

The bigger challenge on Android is the device diversity. The same application must run on flagship phones with dedicated NPUs and budget devices with limited RAM. Model selection may need to be adaptive: use a larger, more accurate model on capable hardware and fall back to a smaller model on constrained devices.

Embedded Linux

Embedded Linux platforms (Raspberry Pi, NVIDIA Jetson, custom boards) offer the most flexibility but the least hand-holding. There's no platform audio session to manage, no app store size limit, no built-in ML runtime, and no pre-configured audio routing. You ship the models, the inference runtime, and the audio pipeline as a self-contained package.

Switchboard's C++ API runs on embedded Linux, providing the audio graph architecture (VAD, STT/ASR/speech recognition, TTS/speech synthesis/voice generation, noise suppression) as a library that integrates into your application. The models ship alongside the binary, and the entire system runs without any external dependencies.

Automotive

Automotive voice systems have unique constraints: fixed hardware that doesn't get upgraded, long product lifecycles (10+ years), strict safety certification requirements, and the expectation of instant responsiveness. Offline operation is non-negotiable because vehicles regularly lose connectivity.

On-device voice AI is a natural fit for automotive. The voice pipeline runs on the vehicle's application processor, models are delivered as part of the system software, and updates come through the vehicle's OTA update mechanism. Latency requirements are strict because driver distraction is a safety concern. For more on optimizing voice AI response time, see Voice AI Latency: Where It Comes From and How to Reduce It.

Privacy as a Side Effect

An interesting property of offline-first voice AI is that it inherently provides strong privacy guarantees. If audio never leaves the device, there's no transmission to intercept and no server-side storage to breach. For organizations that choose offline deployment for connectivity reasons, privacy compliance comes as a structural benefit rather than an additional engineering effort.

For a deeper exploration of the privacy implications, including regulatory compliance considerations for GDPR, HIPAA, and data residency requirements, see Privacy-First Voice AI: Why On-Device Processing Keeps Voice Data Safe.

Available Now

Switchboard's on-device audio SDK provides the complete pipeline for offline voice AI: voice activity detection (Silero VAD), speech-to-text (STT/ASR/speech recognition via whisper.cpp), text-to-speech (TTS/speech synthesis/voice generation via Silero TTS), noise suppression (RNNoise), and echo cancellation (WebRTC AEC3). All components run on-device with no cloud dependency, across iOS, Android, desktop, and embedded Linux.

If you're building a voice application that needs to work without internet, explore the Switchboard documentation to see how the on-device pipeline fits together. For a hands-on starting point, the voice control iOS example demonstrates the full offline pipeline in a working application.