Recognition: Whisper model may get stuck in a token repeat loop when it encounters silence or non-speech segments
This is a common problem with Whisper: when it encounters silence or non-speech segments, it may hallucinate and start to repeat a token pattern, like:
Thanks for watching! Thanks for watching! Thanks for watching! ...
There are various proposed strategies to improve the situation, but none of them has shown to completely solve it in all cases, or are always practical to apply.
Using VAD (voice activity detection) to cut non-speech part is a possibility, but in practice, no reasonably fast VAD engine (including WebRTC and Silero, which Echogarden already supports), is accurate enough to avoid significant false positives, so I don't think that's the best path for now.
Meanwhile, if you encounter this problem, try to ensure the input doesn't have significant segments of silence or music, by pre-trimming or slicing out those parts.