speech_recognition
speech_recognition copied to clipboard
Saved audio recorded with SR plays choppy and too fast
Steps to reproduce
- Record audio from an USB audio interface (Focusrite Scarlett) with Microphone() instance
- Save to file with Pythons
wave
module
Here's an exemplary code that shows what I do (copied together from actual source):
import speech_recognition as sr
import wave
mic_index = 7 # focusrite scarlett input
recognizer = sr.Recognizer()
mic = sr.Microphone(device_index=mic_index)
print('Recording...')
with mic as source:
recognizer.adjust_for_ambient_noise(source, duration=0.2)
audio = recognizer.listen(source, timeout=1, phrase_time_limit=5)
wave_file = wave.open('audiotest.wav', 'wb')
wave_file.setnchannels(1)
wave_file.setsampwidth(2)
wave_file.setframerate(16000)
wave_file.writeframes(audio.get_wav_data(convert_rate=16000))
wave_file.close()
Expected behaviour
The written wave file should sound like the original audio source: clean and correct tempo
Actual behaviour
The written wave file sounds somewhat choppy and way too fast. audiotest.wav.zip
Recording audio from the device with arecord -D plughw:1,0 -f cd -d 5 alsatest.wav
produces a clean result.
System information
(Delete all the statements that don't apply.)
My system is Linux Mint 20.3 Cinnamon.
My Python version is 3.8.10.
My Pip version is 20.0.2.
My SpeechRecognition library version is 3.9.0.
My PyAudio library version is 0.2.13
My microphones are:
HDA NVidia: HDMI 0 (hw:0,3)
HDA NVidia: HDMI 1 (hw:0,7)
HDA NVidia: HDMI 2 (hw:0,8)
HDA NVidia: HDMI 3 (hw:0,9)
HDA NVidia: HDMI 4 (hw:0,10)
HDA NVidia: HDMI 5 (hw:0,11)
HDA NVidia: HDMI 6 (hw:0,12)
Scarlett 2i2 USB: Audio (hw:1,0)
HD-Audio Generic: ALC1220 Analog (hw:2,0)
HD-Audio Generic: ALC1220 Digital (hw:2,1)
HD-Audio Generic: ALC1220 Alt Analog (hw:2,2)
C922 Pro Stream Webcam: USB Audio (hw:3,0)
hdmi
pulse
default
My working microphones are:
7: 'Scarlett 2i2 USB: Audio (hw:1,0)',
11: 'C922 Pro Stream Webcam: USB Audio (hw:3,0)',
13: 'pulse',
14: 'default'
}
Hi @antimatter84,
I had the exact same experience and what helped me greatly was playing around with the chunk_size parameter. In my case, setting it to 512 instead of the default 1024 drastically increased the quality of the recorded audio. That also made recognition (with VOSK) much more reliable.
Give it a shot and let me know how it goes :)