Skip to main content
The ai-coustics SDK includes an efficient Voice Activity Detector (VAD) that is optimized for performance and accuracy. This guide explains how to use the VAD in your real-time applications.

What is VAD?

A Voice Activity Detector is a system that identifies the presence of human speech in an audio stream. The ai-coustics VAD is tightly integrated with our Quail enhancement models, allowing it to make highly accurate predictions even in noisy environments or with interfering speakers. While our standard Quail models handle most environments effectively, the Quail Voice Focus model is the best choice for pure speaker isolation. It specifically targets the primary speaker, preventing background chatter from triggering false positives.

Use Cases

Integrating the VAD into your pipeline can significantly improve your application’s performance and user experience.
  • Improved Turn-Taking: In voice agent or conversational AI applications, the VAD provides a reliable signal for detecting the end of a user’s turn.
  • Cost Reduction: By processing audio only when speech is present, you can reduce computational load and downstream processing costs (e.g., for STT).
  • Enhanced STT Accuracy: Providing a clean, speech-only audio stream to your Speech-to-Text engine can reduce errors like insertions and substitutions caused by background noise or interfering speech.

How it Works

The VAD is not a standalone component; it is created from an existing model instance. It analyzes the enhanced audio from the model to make its prediction, benefiting from the noise and reverb removal already performed. The basic workflow is:
  1. Create a processor with a model
  2. Create a VAD instance from the processor
  3. Process your audio
  4. Query the VAD to see if speech was detected in the last processed frame

VAD Parameters

You can fine-tune the VAD’s behavior using the following parameters.
Controls the sensitivity (energy threshold) of the VAD. A speech signal’s energy must exceed this threshold to be classified as speech.
  • Formula: Energy Threshold = 10(sensitivity)10^{(-\text{sensitivity})}
  • Range: 0.0 to 15.0
  • Default: 6.0
Controls how long the VAD continues to report speech after the audio signal no longer contains speech. This stabilizes speech-detected → not-detected transitions.The VAD reports speech if at least 50% of frames in the last speech_hold_duration × 2 seconds contained speech. If speech reappears during the hold period, those frames reset the calculation until the 50% threshold is no longer met.
This value is rounded to the nearest model processing window (e.g. 10 ms), so the returned value may differ from what was set.
  • Unit: Seconds
  • Range: 0.0 to 1.0
  • Default: 0.03
Controls how long speech must be continuously present before the VAD classifies it as speech. This stabilizes not-detected → detected transitions.
This value is rounded to the nearest model processing window (e.g. 10 ms), so the returned value may differ from what was set.
  • Unit: Seconds
  • Range: 0.0 to 1.0
  • Default: 0.0

Best Practices

  • Tune Sensitivity: The optimal sensitivity may vary depending on your audio source and environment. Start with the default and adjust as needed.
  • Use with Quail Voice Focus: For applications with multiple speakers, using the VAD with the Voice Focus model provides the best results for isolating the primary speaker.