faster-whisper
faster-whisper copied to clipboard
transcription speed blowouts
Thanks again for this project - for context I'm testing it transcribing a live public radio stream, appreciate the rapid speed and low memory as it's most useful providing near-live transcription. The radio stream is maybe 60% voice on studio mic, 30% phone voice, 5% voice talking over music, and 5% music. I have a simple python script running 30s chunks from the live radio stream into faster-whisper continuously. Using base model, on a cheap VPS with just 2GB RAM - I'm sure I could get better results with a higher spec machine but it's a proof of concept - would be useful to run across a large number of different streams here.
Most 30s chunks take between 6-8s to transcribe, which is perfect, but roughly 1 in 10 can blow out between 20-50s.
I haven't quite figured out what causes it, I wonder if it's when the 30s chunk has a mix of music and talk, or a mix of different audio sources? Was wondering if in your experience you could shed light on the reason? Would a larger model stop the blowouts?
Hi,
Most likely these audio chunks trigger the "temperature fallback". In this case, the transcription is run multiple times with different parameters in an attempt to improve the final result. This explains why the transcription time can suddenly increase for specific segments.
You can disable this fallback by setting the argument temperature=0
;
model.transcribe(..., temperature=0)
That was it, thankyou! Time now at most 15s which is manageable, no blowouts with temperature=0
. I'll try experimenting with that and best_of
.
Downside to temperature=0
is as you'd expect, lots of fun repeats like this:
was arrested and charged with s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s***ing s (15.32s)