Voice Activity Detection (VAD)

What is VAD?

A Voice Activity Detector is a system that identifies the presence of human speech in an audio stream. The ai-coustics VAD is tightly integrated with our Quail enhancement models, allowing it to make highly accurate predictions even in noisy environments or with interfering speakers.

While our standard Quail models handle most environments effectively, the Quail Voice Focus model is the best choice for pure speaker isolation. It specifically targets the primary speaker, preventing background chatter from triggering false positives.

Use Cases

Integrating the VAD into your pipeline can significantly improve your application’s performance and user experience.

Improved Turn-Taking: In voice agent or conversational AI applications, the VAD provides a reliable signal for detecting the end of a user’s turn.

Cost Reduction: By processing audio only when speech is present, you can reduce computational load and downstream processing costs (e.g., for STT).

Enhanced STT Accuracy: Providing a clean, speech-only audio stream to your Speech-to-Text engine can reduce errors like insertions and substitutions caused by background noise or interfering speech.

How it Works

The VAD is not a standalone component; it is created from an existing model instance. It analyzes the enhanced audio from the model to make its prediction, benefiting from the noise and reverb removal already performed.

The basic workflow is:

Create a processor with a model

Create a VAD instance from the processor

Process your audio

Query the VAD to see if speech was detected in the last processed frame

VAD Parameters

You can fine-tune the VAD’s behavior using the following parameters.

Sensitivity

Controls the sensitivity (energy threshold) of the VAD. A speech signal’s energy must exceed this threshold to be classified as speech.

Formula: Energy Threshold = $10^{(-\text{sensitivity})}$
Range: 0.0 to 15.0
Default: 6.0

Speech Hold Duration

Controls how long the VAD continues to report speech after the audio signal no longer contains speech. This stabilizes speech-detected → not-detected transitions.The VAD reports speech if at least 50% of frames in the last speech_hold_duration × 2 seconds contained speech. If speech reappears during the hold period, those frames reset the calculation until the 50% threshold is no longer met.

This value is rounded to the nearest model processing window (e.g. 10 ms), so the returned value may differ from what was set.

Unit: Seconds
Range: 0.0 to 1.0
Default: 0.03

Minimum Speech Duration

Controls how long speech must be continuously present before the VAD classifies it as speech. This stabilizes not-detected → detected transitions.

This value is rounded to the nearest model processing window (e.g. 10 ms), so the returned value may differ from what was set.

Unit: Seconds
Range: 0.0 to 1.0
Default: 0.0

Best Practices

Tune Sensitivity: The optimal sensitivity may vary depending on your audio source and environment. Start with the default and adjust as needed.

Use with Quail Voice Focus: For applications with multiple speakers, using the VAD with the Voice Focus model provides the best results for isolating the primary speaker.

General

SDK

Voice Activity Detection (VAD)

What is VAD?

Use Cases

How it Works

VAD Parameters

Best Practices

General

SDK

​What is VAD?

​Use Cases

​How it Works

​VAD Parameters

​Best Practices

What is VAD?

Use Cases

How it Works

VAD Parameters

Best Practices