Overview
Speech Enhancement for Voice AI
Best for: Improving Voice AI Agents and Speech-to-Text (STT) accuracy. Includes Quail Voice Focus.
- Model Family: Quail
- Platform: SDK
- Real-time
Perceptual Speech Enhancement
Best for: Removing background noise and reverb on real-time communication use-cases.
- Model Family: Rook
- Platform: SDK
- Real-time
Quail
SDK, Real-time, Human-to-Machine The Quail models are purpose-built for Voice AI Agents and human-to-machine interactions. Unlike standard noise suppression, Quail is tuned to improve the performance of downstream Speech-to-Text (STT) engines. Quail Voice Focus is optimized for near-field voice interactions. It prioritizes speech that sounds close to the microphone and suppresses speech that sounds distant, along with background noise. This makes it ideal for single-user, close-talk use cases (e.g., headsets or handheld devices). Quail, in contrast, is designed for far-field and multi-speaker environments. It does not suppress distant-sounding speech, making it better suited for speakerphone setups, meeting rooms, or situations with multiple participants spread across a space.Take a look at our ASR optimization guide
Quail Voice Focus 2.0 L (16 kHz)
Quail Voice Focus 2.0 L (16 kHz)
- ID:
quail-vf-2.0-l-16khz - File size: 35 MB
- Window length: 10 ms
- Optimal sample rate: 16 kHz
- Optimal num frames: 160
- Minimal algorithmic delay: 30 ms
Quail L (16 kHz)
Quail L (16 kHz)
- ID:
quail-l-16khz - File size: 35 MB
- Window length: 10 ms
- Optimal sample rate: 16 kHz
- Optimal num frames: 160
- Minimal algorithmic delay: 30 ms
Quail L (8 kHz)
Quail L (8 kHz)
- ID:
quail-l-8khz - File size: 33.4 MB
- Window length: 10 ms
- Native sample rate: 8 kHz
- Native num frames: 80
- Minimal algorithmic delay: 30 ms
Quail S (16 kHz)
Quail S (16 kHz)
- ID:
quail-s-16khz - File size: 8.88 MB
- Window length: 10 ms
- Native sample rate: 16 kHz
- Native num frames: 160
- Minimal algorithmic delay: 30 ms
Quail S (8 kHz)
Quail S (8 kHz)
- ID:
quail-s-8khz - File size: 8.43 MB
- Window length: 10 ms
- Native sample rate: 8 kHz
- Native num frames: 80
- Minimal algorithmic delay: 30 ms
Rook
SDK, Real-time, Human-to-Human The Rook models are specifically optimized for human-to-human interaction in real-time constrained systems (e.g. voice calls). They reduce background noise and reverberation while preserving speech naturalness and intelligibility for human perception.Rook L (48 kHz)
Rook L (48 kHz)
- ID:
rook-l-48khz - File size: 35.1 MB
- Window length: 10 ms
- Native sample rate: 48 kHz
- Native num frames: 480
- Minimal algorithmic delay: 30 ms
Rook L (16 kHz)
Rook L (16 kHz)
- ID:
rook-l-16khz - File size: 35 MB
- Window length: 10 ms
- Native sample rate: 16 kHz
- Native num frames: 160
- Minimal algorithmic delay: 30 ms
Rook L (8 kHz)
Rook L (8 kHz)
- ID:
rook-l-8khz - File size: 33.4 MB
- Window length: 10 ms
- Native sample rate: 8 kHz
- Native num frames: 80
- Minimal algorithmic delay: 30 ms
Rook S (48 kHz)
Rook S (48 kHz)
- ID:
rook-s-48khz - File size: 8.96 MB
- Window length: 10 ms
- Native sample rate: 48 kHz
- Native num frames: 480
- Minimal algorithmic delay: 30 ms
Rook S (16 kHz)
Rook S (16 kHz)
- ID:
rook-s-16khz - File size: 8.88 MB
- Window length: 10 ms
- Native sample rate: 16 kHz
- Native num frames: 160
- Minimal algorithmic delay: 30 ms
Rook S (8 kHz)
Rook S (8 kHz)
- ID:
rook-s-8khz - File size: 8.43 MB
- Window length: 10 ms
- Native sample rate: 8 kHz
- Native num frames: 80
- Minimal algorithmic delay: 30 ms
Using models with non-native sample rates
Our models are trained for specific sample rates (8 kHz, 16 kHz, and 48 kHz). However, the ai-coustics SDK allows you to use any model with audio at non-native sample rates by internally resampling the input. The model always processes the audio at its native sample rate. You can choose any sample rate between 8 kHz and 192 kHz when calling the processor’sinitialize function in the SDK, regardless of the model being used.
Higher-than-native sample rate (e.g. 48 kHz audio with a 16 kHz model):
In this case, the SDK cuts away the frequency content above the model’s native Nyquist frequency (everything above half the sample rate) before feeding it to the model.
The SDK output is then upsampled back to the original sample rate. The mixback (enhancement_level) stays at the higher sample rate, so the output will contain the full frequency range of the original audio,
but the model’s enhancement will only be applied to the frequencies within the model’s native Nyquist frequency.
Lower-than-native sample rate (e.g. 8 kHz audio with a 16 kHz model):
When the input audio sample rate is lower than the model’s native sample rate, compute resources are effectively “wasted” processing higher frequencies where no signal is contained (the model is just processing zeros).
Therefore, if there is a model available matching your audio’s sample rate, we recommend using that model to avoid unnecessary compute and ensure optimal performance.
In both cases, delay and CPU consumption are not affected by the input sample rate.
Learn more about performance here.