Transcriptions with repeated sentences
Sometimes, my transcriptions return with repeated sentences that are not in the original audio file or nonsense like three dots where there should be speech. Any idea what may be the problem?
I'm using the large-v2 model, fp16, beam 5, with the VAD filter on, temperature 0.
Sample: https://www.youtube.com/watch?v=pzS367uY5-k
Transcription: 01:33:57 quer mais parar, você se apaixona e você convida amigos, é um momento ímpar com pessoas que você gosta descontração você conversa troca uma ideia, atira compete, brinca zoam o outro, então assim o tiro esportivo hoje em dia faz parte da minha vida ... 01:35:38 ... ... ... ... ... ... 01:36:09 ... ... ... ... ... ... ... 01:36:26 ... ... ... 01:36:49 ... ... ... ... ... ... ... ... ... ... ... 01:37:10 ... ... ... ... ... ... ... ... ... ... ... ... ... 01:37:33 ... ... ... ... ... ... ... ... ...
I had the same experience with even unlikely hallucinations (Dutch language, same settings).
This is a recurring issue in both whisper and faster_whisper issues. I recommend you read whisper #679 entirely so you can understand what causes the repetitions and get some ideas from it.
I also recommend you try changing the tokens that are suppressed in the transcribe options, the default value is -1, which refers to the config.json file that contains the tokens that will be suppressed, which are mostly symbols that are described in the tokenizer.py file. When you suppress those and the model chooses to output something when it is silent, it tends to get stuck in a failure loop outputting complete gibberish, refer to whisper-timestamped #456. Warning: You will probably get some of those special symbols and some background noise tags like [silence] or even [dog barking], but you can eliminate those with some postprocessing.
You can also set condition_on_previous_text = False, as mentioned in the documentation, this makes the model less prone to get stuck in a failure loop.
Lastly, you can do some postprocessing with the dictionaries generated for the transcription and suppress the repetitions that might still happen. You can save the dicitonaries in a list and iterate over them with this:
segments_list = []
for segment in segments:
segments_list.append(segment._asdict())
@guilhermehge thanks for the sharing , this repeating issue is really a annoyingissue
from my understanding , the following way can be used to solve this problem
- set condition_on_previous_text to False
- set suppress_tokens to [], which is empty list
is there any other options you can suggest from code perspective?
if possible , can you share the code to handle this ?
I am trying to find an optimal way to handle this , so I want to try all possible options
@iorilu No worries! Just sharing what I discovered reading a few topics and exploring it myself
Of course, you activate the VAD filter that this model has.
What you can do from here, to my knowledge, is to postprocess the transcription.
You can try something like this:
First you save the dictionaries of the transcriptions in a list
segments_list = []
for segment in segments:
segments_list.append(segment._asdict())
Then you can access that list and do something like:
text = ''
for i, dict_item in enumerate(segments_list):
if dict_item['text'][i] == dict_item['text'][i-1]: # this is for repetitions that happen in sequential lines
continue
text += *mount the transcription the way you want it*
This is not a perfect approach, because the person can say the same thing twice and you'd eliminate it, but, it is good for when the repetitions happen.
Another approach for the repetitions that happen in the same segment, or row, you can try counting each word in that segment, if, for example, one word repeats more than, let's say, 5 or 6 times, you also "continue" that line and eliminate it from the transcription, also a not perfect approach, but, when it is a repetition, it will remove it from the transcription.
Since the last update there are a few parameters to avoid repetitions. I'm not using them yet because I didn't have the time to test them, but they might be of help.
@guilhermehge Thanks a lot for the quick help , I will try your suggestion , that is helpful
one other issue I want to check regarding paramters with vad
I just post a issue, #477
As I am not quite sure what are the best vad options to use , I found faster-whisper use very different parameters from silero-vad, do you have any experience regarding this one.
Hey @guilhermehge! Thanks for your reply, it was super helpful. I tried setting condition_on_previous_text to False, and it was enough to prevent hallucination on the audio clip I was testing.
Do you know if there are any disadvantages to leaving it False? I read it can make the transcription less consistent, but I imagine that's not necessarily a bad thing since it can avoid the repetition of transcription mistakes.
Could you try whisper-faster.exe from https://github.com/Purfview/whisper-standalone-win on audio where you've issue with "stuck in a failure loop"?
Hey @guilhermehge! Thanks for your reply, it was super helpful. I tried setting condition_on_previous_text to False, and it was enough to prevent hallucination on the audio clip I was testing.
Do you know if there are any disadvantages to leaving it False? I read it can make the transcription less consistent, but I imagine that's not necessarily a bad thing since it can avoid the repetition of transcription mistakes.
I also tried set condition_on_previous_text to False, it seems the performce of transcription is worse
@iorilu that I do not know, and I see that guillaumekln answered your topic. Silero-VAD helps lower the number of repetitions the way it is by default, since it eliminates what mostly causes the repetitions, that is, long chunks of silence, but they still happen in some files. In some cases, even when setting condition_on_previous_text = False, the repetitions might still happen, so, that's why you should try all those solutions I prevented together, suppress_tokens = [], condition_on_previous_text = False and some postprocessing.
@bfavero glad I could be of help. This parameter basically feeds a prompt made from the last few seconds (I don't remember how many) of the previous chunk to the next chunk to be transcribed (see encoder-decoder models in HF course for more details. When you set the parameter off, that prompt will not be fed to the decoder, making the transcriptions less consistent, but, as you said, that is not necessarily a bad thing, since the model is less prone to repetitions and, also, the whisper model is really good even with condition_on_previous_text = False.