transformers icon indicating copy to clipboard operation
transformers copied to clipboard

WhisperTimeStampLogitsProcessor error while using Whisper pipelines. Was WhisperTimeStampLogitsProcessor used?

Open melihogutcen opened this issue 1 year ago โ€ข 19 comments

System Info

Hello,

When I tried this notebook, https://colab.research.google.com/drive/1rS1L4YSJqKUH_3YxIQHBI982zso23wor?usp=sharing#scrollTo=Ca4YYdtATxzo, I encountered an error that is: There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used? Especially sounds greater than the 30s, I encountered this error. On the other hand, it returns timestamps when sounds are lower than 30 seconds. How can I fix it?

Specs: transformers==4.27.0.dev0

from transformers import pipeline
MODEL_NAME = "openai/whisper-large-v2"
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    device='cuda:0',
   generate_kwargs = {"language":"<|tr|>","task": "transcribe"})

results = pipe(speech_file, return_timestamps=True, chunk_length_s=30, stride_length_s=[6,0], batch_size=32, generate_kwargs = {"language":"<|tr|>","task": "transcribe"})

Who can help?

@ArthurZucker @sanchit-gandhi @Narsil

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

MODEL_NAME = "openai/whisper-large-v2"
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    device='cuda:0',
   generate_kwargs = {"language":"<|tr|>","task": "transcribe"})

results = pipe(speech_file, return_timestamps=True, chunk_length_s=30, stride_length_s=[6,0], batch_size=32, generate_kwargs = {"language":"<|tr|>","task": "transcribe"})

Expected behavior

results = {'text':'Some Turkish results.',
'chunks':[
{'text': ' Some Turkish results.',
            'timestamp': (0.0,4.4)},
{'text': ' Some Turkish results.',
             'timestamp': (4.4,28.32)},
{'text': ' Some Turkish results.',
             'timestamp': (28.32,45.6)}]
}

melihogutcen avatar Mar 09 '23 10:03 melihogutcen

cc @Narsil as this might follow the latest update of the return_stimestamps

ArthurZucker avatar Mar 09 '23 11:03 ArthurZucker

Do you have the faulty sample too ? I cannot reproduce with a dummy file ?

@ArthurZucker it does look like the last token is indeed not a timestamp, but it could be linked to batching possibly ?

Narsil avatar Mar 09 '23 11:03 Narsil

I'm using this audio https://github.com/frankiedrake/demo/blob/master/whisper_test.wav to test with your script.

Narsil avatar Mar 09 '23 11:03 Narsil

You can use this full script for testing. I uploaded an English sound to GitHub. By using this, you can try it too.

from six.moves.urllib.request import urlopen
import io
import numpy as np
import soundfile as sf
from transformers import pipeline

sound_link = "https://github.com/melihogutcen/sound_data/blob/main/accidents_resampled.wav?raw=true"
data, sr = sf.read(io.BytesIO(urlopen(sound_link).read()))

sound_arr_first_ch1 = np.asarray(data, dtype=np.float64)
audio_in_memory_ch1 = {"raw": sound_arr_first_ch1,
                       "sampling_rate": 16000}

MODEL_NAME = "openai/whisper-large-v2"

pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    device='cuda:0')

results_pipe_ch1 = pipe(audio_in_memory_ch1, return_timestamps=True, chunk_length_s=30,
                        stride_length_s=[6, 0], batch_size=32,
                        generate_kwargs = {"language":"<|en|>",
                                           "task": "transcribe"})
print(results_pipe_ch1["text"])
print(results_pipe_ch1)

Error as below.

  warnings.warn(
Traceback (most recent call last):
  File "/SpeechToText/whisper_trials.py", line 21, in <module>
    results_pipe_ch1 = pipe(audio_in_memory_ch1, return_timestamps=True, chunk_length_s=30,
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 272, in __call__
    return super().__call__(inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1101, in __call__
    return next(
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
    processed = self.infer(item, **self.params)
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 527, in postprocess
    text, optional = self.tokenizer._decode_asr(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/whisper/tokenization_whisper_fast.py", line 480, in _decode_asr
    return _decode_asr(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/whisper/tokenization_whisper.py", line 881, in _decode_asr
    raise ValueError(
ValueError: There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used?

melihogutcen avatar Mar 09 '23 13:03 melihogutcen

Thanks, I have been able to reproduce, defnitely linked to batching, as the thing works with batch_size=1.

Working on a fix.

Narsil avatar Mar 09 '23 15:03 Narsil

Ok, the issue is that the model uses 50256 for padding, or silence.

@ArthurZucker should we make this a special token ? (This would mean it would be ignored in the state machine, which is OK since this token is ''.

The other solution would be to decode the previous_tokens before failing and checking that the decoding is the nil string, but that seems like a workaround the fact that token 50256 is special and means silence (or pad I guess)

Narsil avatar Mar 09 '23 15:03 Narsil

This is the issue: https://huggingface.co/openai/whisper-large-v2/blob/main/generation_config.json#L124

@melihogutcen A fix is coming.

Narsil avatar Mar 09 '23 16:03 Narsil

Proposed changes:

https://huggingface.co/openai/whisper-base/discussions/12 https://huggingface.co/openai/whisper-large/discussions/29 https://huggingface.co/openai/whisper-medium/discussions/12 https://huggingface.co/openai/whisper-large-v2/discussions/30 https://huggingface.co/openai/whisper-small/discussions/19 https://huggingface.co/openai/whisper-tiny/discussions/9

Narsil avatar Mar 09 '23 16:03 Narsil

I fixed my problem by updating generation_config.json. Thanks!

melihogutcen avatar Mar 09 '23 21:03 melihogutcen

Oops! I have tried different sounds with the new config. And rarely, I got this error again on some sounds.

Traceback (most recent call last):
  File "/SpeechToText/whisper_trials.py", line 63, in <module>
    results_pipe_ch1 = pipe(resampled16k_data_ch1, return_timestamps=True, chunk_length_s=30,
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 272, in __call__
    return super().__call__(inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1101, in __call__
    return next(
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
    processed = self.infer(item, **self.params)
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 527, in postprocess
    text, optional = self.tokenizer._decode_asr(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/whisper/tokenization_whisper_fast.py", line 480, in _decode_asr
    return _decode_asr(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/whisper/tokenization_whisper.py", line 881, in _decode_asr
    raise ValueError(
ValueError: There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used?

melihogutcen avatar Mar 10 '23 08:03 melihogutcen

Thanks, any potential to see the files ?

Or if you could print previous_tokens just before this error that would be nice.

This error occurs when the state machine still has some dangling tokens and no timestamp token in the end, meaning we have no ending timestamp. This shouldn't happen given how WhisperTimestampLogitsProcessor is supposed to work. The previous error was that it would use a padding_token_id which wasn't a special_token so it would be considered as text (which it isn't)

Narsil avatar Mar 10 '23 11:03 Narsil

Sorry, I couldn't share these files due to privacy, but I can send the previous_tokens. I added print function here. https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/tokenization_whisper.py#:~:text=current_tokens%20%3D%20%5B%5D-,if%20previous_tokens%3A,-if%20return_timestamps%3A Is it correct?

Previous tokens: [[16729, 44999, 39196, 259, 13]]
There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used?

melihogutcen avatar Mar 10 '23 12:03 melihogutcen

I suspect the logits processor @Narsil but this is strange that it didnโ€™t came up before

ArthurZucker avatar Mar 10 '23 13:03 ArthurZucker

@melihogutcen This is Turkish, on whisper-large-v2 correct ? I'll try to run a batch on some dataset to try and trigger it elsewhere. Still using the same script as above correct ?

We need to reproduce to understand what's going on. It could be the WhisperLogitsProcessor, but also a bug somewhere else.

Narsil avatar Mar 10 '23 13:03 Narsil

Yes, it is Turkish and I used whisper-large-v2. I used the same script as above, I just used "<|tr|>" language and I changed generation_config.json as you said.

melihogutcen avatar Mar 10 '23 14:03 melihogutcen

Could it be possible that this is due to all the batches processes are silence? I have seem that the error generates when the Audio has a section that is mainly silence (I test with a 10 min silece). With the original whisper what I get is allucination and repeated words.

rjac-ml avatar Mar 14 '23 22:03 rjac-ml

I'm getting this error as well, but only on a fine-tuned model. I will try my program with huggingface openai/whisper-medium and it will work fine, and then I will change just the model over to a model of whisper medium trained on the common_voice_11_0 dataset, and any audio file I try to pass through gets this error.

2023-03-15 15:06:11 Error occurred while processing File1.wav. Exception: There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used? Traceback (most recent call last): File "/home/user/basictest.py", line 64, in transcribe_audio out = pipeline(audio) File "/home/user/anaconda3/lib/python3.9/site-packages/speechbox/diarize.py", line 120, in call asr_out = self.asr_pipeline( File "/home/user/anaconda3/lib/python3.9/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 272, in call return super().call(inputs, **kwargs) File "/home/user/anaconda3/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1101, in call return next( File "/home/user/anaconda3/lib/python3.9/site-packages/transformers/pipelines/pt_utils.py", line 125, in next processed = self.infer(item, **self.params) File "/home/user/anaconda3/lib/python3.9/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 527, in postprocess text, optional = self.tokenizer._decode_asr( File "/home/user/anaconda3/lib/python3.9/site-packages/transformers/models/whisper/tokenization_whisper_fast.py", line 480, in _decode_asr return _decode_asr( File "/home/user/anaconda3/lib/python3.9/site-packages/transformers/models/whisper/tokenization_whisper.py", line 881, in _decode_asr raise ValueError( ValueError: There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used?

alextomana avatar Mar 15 '23 15:03 alextomana

@alextomana, did you try comparing the generation_config as mentioned above? About the silence or what not, not really sure

ArthurZucker avatar Mar 16 '23 09:03 ArthurZucker

Seeing the same with a fine-tuned model.

import requests
import transformers
from transformers import GenerationConfig

pipe = transformers.pipeline(
    "automatic-speech-recognition",
    model="vasista22/whisper-hindi-large-v2",
    device="cuda:0",
)
pipe.model.generation_config = GenerationConfig.from_pretrained("openai/whisper-large-v2")

audio = requests.get(
    "https://storage.googleapis.com/dara-c1b52.appspot.com/daras_ai/media/e00ba954-c980-11ed-8700-8e93953183bb/6.ogg"
).content

forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(task="transcribe", language="hindi")
pipe(
    audio,
    return_timestamps=True,
    generate_kwargs=dict(
        forced_decoder_ids=forced_decoder_ids,
    ),
    chunk_length_s=30,
    stride_length_s=[6, 0],
    batch_size=32,
)
/root/.pyenv/versions/3.10.10/lib/python3.10/site-packages/transformers/generation/utils.py:1288: UserWarning: Using `max_length`'s default (448) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ <stdin>:1 in <module>                                                                            โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /root/.pyenv/versions/3.10.10/lib/python3.10/site-packages/transformers/pipelines/automatic_spee โ”‚
โ”‚ ch_recognition.py:272 in __call__                                                                โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   269 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   "there", "timestamps": (1.0, 1.5)}]`. The original full text can   โ”‚
โ”‚   270 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   `"".join(chunk["text"] for chunk in output["chunks"])`.            โ”‚
โ”‚   271 โ”‚   โ”‚   """                                                                                โ”‚
โ”‚ โฑ 272 โ”‚   โ”‚   return super().__call__(inputs, **kwargs)                                          โ”‚
โ”‚   273 โ”‚                                                                                          โ”‚
โ”‚   274 โ”‚   def _sanitize_parameters(                                                              โ”‚
โ”‚   275 โ”‚   โ”‚   self,                                                                              โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /root/.pyenv/versions/3.10.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1101   โ”‚
โ”‚ in __call__                                                                                      โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   1098 โ”‚   โ”‚   elif is_iterable:                                                                 โ”‚
โ”‚   1099 โ”‚   โ”‚   โ”‚   return self.iterate(inputs, preprocess_params, forward_params, postprocess_p  โ”‚
โ”‚   1100 โ”‚   โ”‚   elif self.framework == "pt" and isinstance(self, ChunkPipeline):                  โ”‚
โ”‚ โฑ 1101 โ”‚   โ”‚   โ”‚   return next(                                                                  โ”‚
โ”‚   1102 โ”‚   โ”‚   โ”‚   โ”‚   iter(                                                                     โ”‚
โ”‚   1103 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   self.get_iterator(                                                    โ”‚
โ”‚   1104 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   [inputs], num_workers, batch_size, preprocess_params, forward_pa  โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /root/.pyenv/versions/3.10.10/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py:12 โ”‚
โ”‚ 5 in __next__                                                                                    โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   122 โ”‚   โ”‚                                                                                      โ”‚
โ”‚   123 โ”‚   โ”‚   # We're out of items within a batch                                                โ”‚
โ”‚   124 โ”‚   โ”‚   item = next(self.iterator)                                                         โ”‚
โ”‚ โฑ 125 โ”‚   โ”‚   processed = self.infer(item, **self.params)                                        โ”‚
โ”‚   126 โ”‚   โ”‚   # We now have a batch of "inferred things".                                        โ”‚
โ”‚   127 โ”‚   โ”‚   if self.loader_batch_size is not None:                                             โ”‚
โ”‚   128 โ”‚   โ”‚   โ”‚   # Try to infer the size of the batch                                           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /root/.pyenv/versions/3.10.10/lib/python3.10/site-packages/transformers/pipelines/automatic_spee โ”‚
โ”‚ ch_recognition.py:527 in postprocess                                                             โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   524 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   stride_right /= sampling_rate                                          โ”‚
โ”‚   525 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   output["stride"] = chunk_len, stride_left, stride_right                โ”‚
โ”‚   526 โ”‚   โ”‚   โ”‚                                                                                  โ”‚
โ”‚ โฑ 527 โ”‚   โ”‚   โ”‚   text, optional = self.tokenizer._decode_asr(                                   โ”‚
โ”‚   528 โ”‚   โ”‚   โ”‚   โ”‚   model_outputs,                                                             โ”‚
โ”‚   529 โ”‚   โ”‚   โ”‚   โ”‚   return_timestamps=return_timestamps,                                       โ”‚
โ”‚   530 โ”‚   โ”‚   โ”‚   โ”‚   return_language=return_language,                                           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /root/.pyenv/versions/3.10.10/lib/python3.10/site-packages/transformers/models/whisper/tokenizat โ”‚
โ”‚ ion_whisper_fast.py:480 in _decode_asr                                                           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   477 โ”‚   โ”‚   return forced_decoder_ids                                                          โ”‚
โ”‚   478 โ”‚                                                                                          โ”‚
โ”‚   479 โ”‚   def _decode_asr(self, model_outputs, *, return_timestamps, return_language, time_pre   โ”‚
โ”‚ โฑ 480 โ”‚   โ”‚   return _decode_asr(                                                                โ”‚
โ”‚   481 โ”‚   โ”‚   โ”‚   self,                                                                          โ”‚
โ”‚   482 โ”‚   โ”‚   โ”‚   model_outputs,                                                                 โ”‚
โ”‚   483 โ”‚   โ”‚   โ”‚   return_timestamps=return_timestamps,                                           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /root/.pyenv/versions/3.10.10/lib/python3.10/site-packages/transformers/models/whisper/tokenizat โ”‚
โ”‚ ion_whisper.py:881 in _decode_asr                                                                โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   878 โ”‚   โ”‚   if return_timestamps:                                                              โ”‚
โ”‚   879 โ”‚   โ”‚   โ”‚   # Last token should always be timestamps, so there shouldn't be                โ”‚
โ”‚   880 โ”‚   โ”‚   โ”‚   # leftover                                                                     โ”‚
โ”‚ โฑ 881 โ”‚   โ”‚   โ”‚   raise ValueError(                                                              โ”‚
โ”‚   882 โ”‚   โ”‚   โ”‚   โ”‚   "There was an error while processing timestamps, we haven't found a time   โ”‚
โ”‚   883 โ”‚   โ”‚   โ”‚   โ”‚   " WhisperTimeStampLogitsProcessor used?"                                   โ”‚
โ”‚   884 โ”‚   โ”‚   โ”‚   )                                                                              โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
ValueError: There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used?

devxpy avatar Mar 23 '23 17:03 devxpy

Running into the same issue:

import torch
import gdown
from transformers import pipeline, AutomaticSpeechRecognitionPipeline, Pipeline, GenerationConfig, \
    WhisperTokenizer, WhisperModel, WhisperConfig, WhisperForConditionalGeneration, WhisperTokenizerFast, \
    WhisperProcessor


url = 'https://drive.google.com/uc?id=1IcnHiL5gdGs8zr-NwuSQm_hsAZugz4mq'
audio_path = 'audio.wav'
gdown.download(url, audio_path, quiet=False)


model_name = "openai/whisper-small"
task = 'transcribe'
language = 'spanish'
predict_timestamps = True
chunk_length = 30
max_length = 100
batch_size = 1
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
# -----------------------------------------------------------------------

config = WhisperConfig.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name, config=config)

tokenizer = WhisperTokenizer.from_pretrained(model_name)
# tokenizer.set_prefix_tokens(language=language, task=task, predict_timestamps=predict_timestamps)
processor = WhisperProcessor.from_pretrained(model_name)

pipe = pipeline(
    task='automatic-speech-recognition',
    model=model,
    chunk_length_s=chunk_length,
    batch_size=batch_size,
    tokenizer=tokenizer,
    feature_extractor=processor.feature_extractor,
    device=device
)

forced_decoder_ids = tokenizer.get_decoder_prompt_ids(language=language, task=task, no_timestamps=not predict_timestamps)
print(forced_decoder_ids)
generate_kwargs = {'max_length': max_length, "forced_decoder_ids": forced_decoder_ids}


print('audio_path: ', audio_path)
result = pipe(audio_path, return_timestamps=predict_timestamps, generate_kwargs=generate_kwargs)
print(result)

with error

Traceback (most recent call last):
  File "/home/spanagiotidi/notebook_dir/whisper_tests/test6.py", line 47, in <module>
    print(result)
  File "/home/spanagiotidi/anaconda3/lib/python3.9/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 272, in __call__
    return super().__call__(inputs, **kwargs)
  File "/home/spanagiotidi/anaconda3/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1101, in __call__
    return next(
  File "/home/spanagiotidi/anaconda3/lib/python3.9/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
    processed = self.infer(item, **self.params)
  File "/home/spanagiotidi/anaconda3/lib/python3.9/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 527, in postprocess
    text, optional = self.tokenizer._decode_asr(
  File "/home/spanagiotidi/anaconda3/lib/python3.9/site-packages/transformers/models/whisper/tokenization_whisper.py", line 708, in _decode_asr
    return _decode_asr(
  File "/home/spanagiotidi/anaconda3/lib/python3.9/site-packages/transformers/models/whisper/tokenization_whisper.py", line 881, in _decode_asr
    raise ValueError(
ValueError: There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used?

panagiotidi avatar Mar 30 '23 10:03 panagiotidi

cc @Narsil maybe an edge case that was not handle (and that was previously ignored) let's be more permissive on the last timestamps + will check with the provided example the reason why we are not getting a last timestamps. Might be something relating to the length of the forced_decoder_ids that can affect the WhisperTImestampsLogitProcessor. Something to lookout for

ArthurZucker avatar Mar 30 '23 11:03 ArthurZucker

@devxpy I have reproduced with your example. It seems this model never outputs timestamps.

I am guessing it was finetuned without timestamps and so the error is kind of normal. However it lead me to reduce the hard error to a soft error. The results are still nonsensical (check out the test).

I spent some time trying to find a better fix by fixing the logits processor itself, but to no avail. There's just no way to fix models that refuse to output timestamp tokens. To be noted is that whisper models are never even forced to output increasing timestamp tokens, so there's already a lot of room there. Soft error is better.

Narsil avatar Mar 30 '23 14:03 Narsil

https://github.com/huggingface/transformers/pull/22475/files

Narsil avatar Mar 30 '23 14:03 Narsil

I received this error when transcribing audio with openai/whisper-large-v2. For me, the cause was 10 seconds of silence at the end of the file. Maybe this can be added as a potential solution to the error/warning, or maybe this can be detected and silently ignored.

wietsedv avatar Apr 04 '23 10:04 wietsedv

Thanks for this comment! @narsil, I think it makes sense

ArthurZucker avatar Apr 04 '23 10:04 ArthurZucker

@Narsil @devxpy @ArthurZucker I also did finetuning without timestamps, and now I have an issue where timestamps are not appearing. Is there a good way to finetune and include timestamps? Do I need to add 1500 special tokens for each timestamp in the tokenizer? I made sure that the tokenizer doesn't have a timestamps. #20225

upskyy avatar Jun 21 '23 07:06 upskyy

Hey! For finetuning with timestamps, you should either use the latest tokenizer (which by default should add 1500 special tokens, not more) or use the previous one, which also supported them, but not for encoding. Pinging @sanchit-gandhi as he has been working on distil whisper, might have a training script to add timestamps. Also this kind of question would be better for the forum

ArthurZucker avatar Jun 22 '23 11:06 ArthurZucker

Hey @upskyy - in my experience, fine-tuning with LoRA / QLoRA is a fantastic way to prevent this 'catastrophic forgetting' effect where Whisper forgets how to predict timestamps after fine-tuning. For this, you can check-out the following repo: https://github.com/Vaibhavs10/fast-whisper-finetuning

And @ArthurZucker - cool that the latest tokenizer has the 1500 special tokens already added! This should make our lives a lot easier for encoding with timestamps, since the tokenizer is now able to map the timestamp strings to tokens.

All we really need to do then is have a small amount of data in our train set that has timestamps in the Whisper format, e.g.

"<|0.00|> He has grave doubts whether Sir Frederick Layton's work is really Greek after all and<|6.24|><|6.24|> can discover in it but little of rocky Ithaca.<|9.44|>"

Generally, you only need between 1-5% of your data to be timestamped to ensure you retain Whisper's timestamp prediction abilities. The easiest way of getting this data is to use the pre-trained Whisper model to re-annotate 1% of your training data with timestamps. You can then merge this data into your full training corpus to train on both non-timestamped (99%) and timestamped (1%) data.

What we then want to do is enable/disable timestamps when we encode the labels, depending on whether the labels have timestamps or not:

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # set tokenizer prefix tokens depending on whether we have timestamps or not
    predict_timestamps = batch["predict_timestamps"]  #ย boolean that tells us whether our labels have timestamps or not (add this column to your dataset to indicate)
    tokenizer.set_prefix_tokens(language=language, task="transcribe", predict_timestamps= predict_timestamps)

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

sanchit-gandhi avatar Jun 22 '23 17:06 sanchit-gandhi

@ArthurZucker @sanchit-gandhi Thank you so much for the detailed explanation. I'm trying to download a new tokenizer, but it seems like it was updated 5 months ago. Can I get it like this? [link] What is the latest tokenizer you are talking about? Currently, my tokenizer is splitting one by one like this.

from transformers import WhisperProcessor


processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
tokens = processor.tokenizer("<|0.00|>Hello!<|2.34|>").input_ids
print(tokens)
# [50258, 50363, 27, 91, 15, 13, 628, 91, 29, 15947, 0, 27, 91, 17, 13, 12249, 91, 29, 50257]

text = processor.decode([27, 91, 15, 13, 628, 91, 29])
print(text)
# <|0.00|>

upskyy avatar Jun 23 '23 03:06 upskyy

@ArthurZucker could you give @upskyy a hand with downloading the latest version of the tokenizer please! ๐Ÿ™Œ

sanchit-gandhi avatar Jun 23 '23 16:06 sanchit-gandhi