What is VAD?
A Voice Activity Detector is a system that identifies the presence of human speech in an audio stream. The ai-coustics VAD is tightly integrated with our Quail enhancement models, allowing it to make highly accurate predictions even in noisy environments or with interfering speakers. While our standard Quail models handle most environments effectively, the Quail Voice Focus model is the best choice for pure speaker isolation. It specifically targets the primary speaker, preventing background chatter from triggering false positives.Use Cases
Integrating the VAD into your pipeline can significantly improve your application’s performance and user experience.- Improved Turn-Taking: In voice agent or conversational AI applications, the VAD provides a reliable signal for detecting the end of a user’s turn.
- Cost Reduction: By processing audio only when speech is present, you can reduce computational load and downstream processing costs (e.g., for STT).
- Enhanced STT Accuracy: Providing a clean, speech-only audio stream to your Speech-to-Text engine can reduce errors like insertions and substitutions caused by background noise or interfering speech.
How it Works
The VAD is not a standalone component; it is created from an existing model instance. It analyzes the enhanced audio from the model to make its prediction, benefiting from the noise and reverb removal already performed. The basic workflow is:- Create a processor with a model
- Create a VAD instance from the processor
- Process your audio
- Query the VAD to see if speech was detected in the last processed frame
VAD Parameters
You can fine-tune the VAD’s behavior using the following parameters.Sensitivity
Sensitivity
Controls the sensitivity (energy threshold) of the VAD. A speech signal’s energy must exceed this threshold to be classified as speech.
- Formula: Energy Threshold =
- Range:
0.0to15.0 - Default:
6.0
Speech Hold Duration
Speech Hold Duration
Controls how long the VAD continues to report speech after the audio signal no longer contains speech. This stabilizes speech-detected → not-detected transitions.The VAD reports speech if at least 50% of frames in the last
speech_hold_duration × 2 seconds contained speech. If speech reappears during the hold period, those frames reset the calculation until the 50% threshold is no longer met.This value is rounded to the nearest model processing window (e.g. 10 ms), so the returned value may differ from what was set.
- Unit: Seconds
- Range:
0.0to1.0 - Default:
0.03
Minimum Speech Duration
Minimum Speech Duration
Controls how long speech must be continuously present before the VAD classifies it as speech. This stabilizes not-detected → detected transitions.
This value is rounded to the nearest model processing window (e.g. 10 ms), so the returned value may differ from what was set.
- Unit: Seconds
- Range:
0.0to1.0 - Default:
0.0
Best Practices
- Tune Sensitivity: The optimal sensitivity may vary depending on your audio source and environment. Start with the default and adjust as needed.
- Use with Quail Voice Focus: For applications with multiple speakers, using the VAD with the Voice Focus model provides the best results for isolating the primary speaker.