transformers
transformers copied to clipboard
WhisperTimeStampLogitsProcessor error while using Whisper pipelines. Was WhisperTimeStampLogitsProcessor used?
System Info
Hello,
When I tried this notebook, https://colab.research.google.com/drive/1rS1L4YSJqKUH_3YxIQHBI982zso23wor?usp=sharing#scrollTo=Ca4YYdtATxzo, I encountered an error that is: There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used?
Especially sounds greater than the 30s, I encountered this error. On the other hand, it returns timestamps when sounds are lower than 30 seconds.
How can I fix it?
Specs:
transformers==4.27.0.dev0
from transformers import pipeline
MODEL_NAME = "openai/whisper-large-v2"
pipe = pipeline(
task="automatic-speech-recognition",
model=MODEL_NAME,
device='cuda:0',
generate_kwargs = {"language":"<|tr|>","task": "transcribe"})
results = pipe(speech_file, return_timestamps=True, chunk_length_s=30, stride_length_s=[6,0], batch_size=32, generate_kwargs = {"language":"<|tr|>","task": "transcribe"})
Who can help?
@ArthurZucker @sanchit-gandhi @Narsil
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
MODEL_NAME = "openai/whisper-large-v2"
pipe = pipeline(
task="automatic-speech-recognition",
model=MODEL_NAME,
device='cuda:0',
generate_kwargs = {"language":"<|tr|>","task": "transcribe"})
results = pipe(speech_file, return_timestamps=True, chunk_length_s=30, stride_length_s=[6,0], batch_size=32, generate_kwargs = {"language":"<|tr|>","task": "transcribe"})
Expected behavior
results = {'text':'Some Turkish results.',
'chunks':[
{'text': ' Some Turkish results.',
'timestamp': (0.0,4.4)},
{'text': ' Some Turkish results.',
'timestamp': (4.4,28.32)},
{'text': ' Some Turkish results.',
'timestamp': (28.32,45.6)}]
}
cc @Narsil as this might follow the latest update of the return_stimestamps
Do you have the faulty sample too ? I cannot reproduce with a dummy file ?
@ArthurZucker it does look like the last token is indeed not a timestamp, but it could be linked to batching possibly ?
I'm using this audio https://github.com/frankiedrake/demo/blob/master/whisper_test.wav to test with your script.
You can use this full script for testing. I uploaded an English sound to GitHub. By using this, you can try it too.
from six.moves.urllib.request import urlopen
import io
import numpy as np
import soundfile as sf
from transformers import pipeline
sound_link = "https://github.com/melihogutcen/sound_data/blob/main/accidents_resampled.wav?raw=true"
data, sr = sf.read(io.BytesIO(urlopen(sound_link).read()))
sound_arr_first_ch1 = np.asarray(data, dtype=np.float64)
audio_in_memory_ch1 = {"raw": sound_arr_first_ch1,
"sampling_rate": 16000}
MODEL_NAME = "openai/whisper-large-v2"
pipe = pipeline(
task="automatic-speech-recognition",
model=MODEL_NAME,
device='cuda:0')
results_pipe_ch1 = pipe(audio_in_memory_ch1, return_timestamps=True, chunk_length_s=30,
stride_length_s=[6, 0], batch_size=32,
generate_kwargs = {"language":"<|en|>",
"task": "transcribe"})
print(results_pipe_ch1["text"])
print(results_pipe_ch1)
Error as below.
warnings.warn(
Traceback (most recent call last):
File "/SpeechToText/whisper_trials.py", line 21, in <module>
results_pipe_ch1 = pipe(audio_in_memory_ch1, return_timestamps=True, chunk_length_s=30,
File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 272, in __call__
return super().__call__(inputs, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1101, in __call__
return next(
File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
processed = self.infer(item, **self.params)
File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 527, in postprocess
text, optional = self.tokenizer._decode_asr(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/whisper/tokenization_whisper_fast.py", line 480, in _decode_asr
return _decode_asr(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/whisper/tokenization_whisper.py", line 881, in _decode_asr
raise ValueError(
ValueError: There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used?
Thanks, I have been able to reproduce, defnitely linked to batching, as the thing works with batch_size=1
.
Working on a fix.
Ok, the issue is that the model uses 50256
for padding, or silence.
@ArthurZucker should we make this a special token ? (This would mean it would be ignored in the state machine, which is OK since this token is ''
.
The other solution would be to decode the previous_tokens
before failing and checking that the decoding is the nil string, but that seems like a workaround the fact that token 50256 is special and means silence (or pad I guess)
This is the issue: https://huggingface.co/openai/whisper-large-v2/blob/main/generation_config.json#L124
@melihogutcen A fix is coming.
Proposed changes:
https://huggingface.co/openai/whisper-base/discussions/12 https://huggingface.co/openai/whisper-large/discussions/29 https://huggingface.co/openai/whisper-medium/discussions/12 https://huggingface.co/openai/whisper-large-v2/discussions/30 https://huggingface.co/openai/whisper-small/discussions/19 https://huggingface.co/openai/whisper-tiny/discussions/9
I fixed my problem by updating generation_config.json
. Thanks!
Oops! I have tried different sounds with the new config. And rarely, I got this error again on some sounds.
Traceback (most recent call last):
File "/SpeechToText/whisper_trials.py", line 63, in <module>
results_pipe_ch1 = pipe(resampled16k_data_ch1, return_timestamps=True, chunk_length_s=30,
File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 272, in __call__
return super().__call__(inputs, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1101, in __call__
return next(
File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
processed = self.infer(item, **self.params)
File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 527, in postprocess
text, optional = self.tokenizer._decode_asr(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/whisper/tokenization_whisper_fast.py", line 480, in _decode_asr
return _decode_asr(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/whisper/tokenization_whisper.py", line 881, in _decode_asr
raise ValueError(
ValueError: There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used?
Thanks, any potential to see the files ?
Or if you could print previous_tokens
just before this error that would be nice.
This error occurs when the state machine still has some dangling tokens and no timestamp token in the end, meaning we have no ending timestamp. This shouldn't happen given how WhisperTimestampLogitsProcessor is supposed to work. The previous error was that it would use a padding_token_id which wasn't a special_token so it would be considered as text (which it isn't)
Sorry, I couldn't share these files due to privacy, but I can send the previous_tokens
. I added print function here. https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/tokenization_whisper.py#:~:text=current_tokens%20%3D%20%5B%5D-,if%20previous_tokens%3A,-if%20return_timestamps%3A
Is it correct?
Previous tokens: [[16729, 44999, 39196, 259, 13]]
There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used?
I suspect the logits processor @Narsil but this is strange that it didnโt came up before
@melihogutcen This is Turkish, on whisper-large-v2
correct ? I'll try to run a batch on some dataset to try and trigger it elsewhere. Still using the same script as above correct ?
We need to reproduce to understand what's going on. It could be the WhisperLogitsProcessor, but also a bug somewhere else.
Yes, it is Turkish and I used whisper-large-v2.
I used the same script as above, I just used "<|tr|>" language and I changed generation_config.json
as you said.
Could it be possible that this is due to all the batches processes are silence? I have seem that the error generates when the Audio has a section that is mainly silence (I test with a 10 min silece). With the original whisper what I get is allucination and repeated words.
I'm getting this error as well, but only on a fine-tuned model. I will try my program with huggingface openai/whisper-medium and it will work fine, and then I will change just the model over to a model of whisper medium trained on the common_voice_11_0 dataset, and any audio file I try to pass through gets this error.
2023-03-15 15:06:11 Error occurred while processing File1.wav. Exception: There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used? Traceback (most recent call last): File "/home/user/basictest.py", line 64, in transcribe_audio out = pipeline(audio) File "/home/user/anaconda3/lib/python3.9/site-packages/speechbox/diarize.py", line 120, in call asr_out = self.asr_pipeline( File "/home/user/anaconda3/lib/python3.9/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 272, in call return super().call(inputs, **kwargs) File "/home/user/anaconda3/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1101, in call return next( File "/home/user/anaconda3/lib/python3.9/site-packages/transformers/pipelines/pt_utils.py", line 125, in next processed = self.infer(item, **self.params) File "/home/user/anaconda3/lib/python3.9/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 527, in postprocess text, optional = self.tokenizer._decode_asr( File "/home/user/anaconda3/lib/python3.9/site-packages/transformers/models/whisper/tokenization_whisper_fast.py", line 480, in _decode_asr return _decode_asr( File "/home/user/anaconda3/lib/python3.9/site-packages/transformers/models/whisper/tokenization_whisper.py", line 881, in _decode_asr raise ValueError( ValueError: There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used?
@alextomana, did you try comparing the generation_config
as mentioned above?
About the silence or what not, not really sure
Seeing the same with a fine-tuned model.
import requests
import transformers
from transformers import GenerationConfig
pipe = transformers.pipeline(
"automatic-speech-recognition",
model="vasista22/whisper-hindi-large-v2",
device="cuda:0",
)
pipe.model.generation_config = GenerationConfig.from_pretrained("openai/whisper-large-v2")
audio = requests.get(
"https://storage.googleapis.com/dara-c1b52.appspot.com/daras_ai/media/e00ba954-c980-11ed-8700-8e93953183bb/6.ogg"
).content
forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(task="transcribe", language="hindi")
pipe(
audio,
return_timestamps=True,
generate_kwargs=dict(
forced_decoder_ids=forced_decoder_ids,
),
chunk_length_s=30,
stride_length_s=[6, 0],
batch_size=32,
)
/root/.pyenv/versions/3.10.10/lib/python3.10/site-packages/transformers/generation/utils.py:1288: UserWarning: Using `max_length`'s default (448) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ <stdin>:1 in <module> โ
โ โ
โ /root/.pyenv/versions/3.10.10/lib/python3.10/site-packages/transformers/pipelines/automatic_spee โ
โ ch_recognition.py:272 in __call__ โ
โ โ
โ 269 โ โ โ โ โ โ "there", "timestamps": (1.0, 1.5)}]`. The original full text can โ
โ 270 โ โ โ โ โ โ `"".join(chunk["text"] for chunk in output["chunks"])`. โ
โ 271 โ โ """ โ
โ โฑ 272 โ โ return super().__call__(inputs, **kwargs) โ
โ 273 โ โ
โ 274 โ def _sanitize_parameters( โ
โ 275 โ โ self, โ
โ โ
โ /root/.pyenv/versions/3.10.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1101 โ
โ in __call__ โ
โ โ
โ 1098 โ โ elif is_iterable: โ
โ 1099 โ โ โ return self.iterate(inputs, preprocess_params, forward_params, postprocess_p โ
โ 1100 โ โ elif self.framework == "pt" and isinstance(self, ChunkPipeline): โ
โ โฑ 1101 โ โ โ return next( โ
โ 1102 โ โ โ โ iter( โ
โ 1103 โ โ โ โ โ self.get_iterator( โ
โ 1104 โ โ โ โ โ โ [inputs], num_workers, batch_size, preprocess_params, forward_pa โ
โ โ
โ /root/.pyenv/versions/3.10.10/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py:12 โ
โ 5 in __next__ โ
โ โ
โ 122 โ โ โ
โ 123 โ โ # We're out of items within a batch โ
โ 124 โ โ item = next(self.iterator) โ
โ โฑ 125 โ โ processed = self.infer(item, **self.params) โ
โ 126 โ โ # We now have a batch of "inferred things". โ
โ 127 โ โ if self.loader_batch_size is not None: โ
โ 128 โ โ โ # Try to infer the size of the batch โ
โ โ
โ /root/.pyenv/versions/3.10.10/lib/python3.10/site-packages/transformers/pipelines/automatic_spee โ
โ ch_recognition.py:527 in postprocess โ
โ โ
โ 524 โ โ โ โ โ stride_right /= sampling_rate โ
โ 525 โ โ โ โ โ output["stride"] = chunk_len, stride_left, stride_right โ
โ 526 โ โ โ โ
โ โฑ 527 โ โ โ text, optional = self.tokenizer._decode_asr( โ
โ 528 โ โ โ โ model_outputs, โ
โ 529 โ โ โ โ return_timestamps=return_timestamps, โ
โ 530 โ โ โ โ return_language=return_language, โ
โ โ
โ /root/.pyenv/versions/3.10.10/lib/python3.10/site-packages/transformers/models/whisper/tokenizat โ
โ ion_whisper_fast.py:480 in _decode_asr โ
โ โ
โ 477 โ โ return forced_decoder_ids โ
โ 478 โ โ
โ 479 โ def _decode_asr(self, model_outputs, *, return_timestamps, return_language, time_pre โ
โ โฑ 480 โ โ return _decode_asr( โ
โ 481 โ โ โ self, โ
โ 482 โ โ โ model_outputs, โ
โ 483 โ โ โ return_timestamps=return_timestamps, โ
โ โ
โ /root/.pyenv/versions/3.10.10/lib/python3.10/site-packages/transformers/models/whisper/tokenizat โ
โ ion_whisper.py:881 in _decode_asr โ
โ โ
โ 878 โ โ if return_timestamps: โ
โ 879 โ โ โ # Last token should always be timestamps, so there shouldn't be โ
โ 880 โ โ โ # leftover โ
โ โฑ 881 โ โ โ raise ValueError( โ
โ 882 โ โ โ โ "There was an error while processing timestamps, we haven't found a time โ
โ 883 โ โ โ โ " WhisperTimeStampLogitsProcessor used?" โ
โ 884 โ โ โ ) โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
ValueError: There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used?
Running into the same issue:
import torch
import gdown
from transformers import pipeline, AutomaticSpeechRecognitionPipeline, Pipeline, GenerationConfig, \
WhisperTokenizer, WhisperModel, WhisperConfig, WhisperForConditionalGeneration, WhisperTokenizerFast, \
WhisperProcessor
url = 'https://drive.google.com/uc?id=1IcnHiL5gdGs8zr-NwuSQm_hsAZugz4mq'
audio_path = 'audio.wav'
gdown.download(url, audio_path, quiet=False)
model_name = "openai/whisper-small"
task = 'transcribe'
language = 'spanish'
predict_timestamps = True
chunk_length = 30
max_length = 100
batch_size = 1
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
# -----------------------------------------------------------------------
config = WhisperConfig.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name, config=config)
tokenizer = WhisperTokenizer.from_pretrained(model_name)
# tokenizer.set_prefix_tokens(language=language, task=task, predict_timestamps=predict_timestamps)
processor = WhisperProcessor.from_pretrained(model_name)
pipe = pipeline(
task='automatic-speech-recognition',
model=model,
chunk_length_s=chunk_length,
batch_size=batch_size,
tokenizer=tokenizer,
feature_extractor=processor.feature_extractor,
device=device
)
forced_decoder_ids = tokenizer.get_decoder_prompt_ids(language=language, task=task, no_timestamps=not predict_timestamps)
print(forced_decoder_ids)
generate_kwargs = {'max_length': max_length, "forced_decoder_ids": forced_decoder_ids}
print('audio_path: ', audio_path)
result = pipe(audio_path, return_timestamps=predict_timestamps, generate_kwargs=generate_kwargs)
print(result)
with error
Traceback (most recent call last):
File "/home/spanagiotidi/notebook_dir/whisper_tests/test6.py", line 47, in <module>
print(result)
File "/home/spanagiotidi/anaconda3/lib/python3.9/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 272, in __call__
return super().__call__(inputs, **kwargs)
File "/home/spanagiotidi/anaconda3/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1101, in __call__
return next(
File "/home/spanagiotidi/anaconda3/lib/python3.9/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
processed = self.infer(item, **self.params)
File "/home/spanagiotidi/anaconda3/lib/python3.9/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 527, in postprocess
text, optional = self.tokenizer._decode_asr(
File "/home/spanagiotidi/anaconda3/lib/python3.9/site-packages/transformers/models/whisper/tokenization_whisper.py", line 708, in _decode_asr
return _decode_asr(
File "/home/spanagiotidi/anaconda3/lib/python3.9/site-packages/transformers/models/whisper/tokenization_whisper.py", line 881, in _decode_asr
raise ValueError(
ValueError: There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used?
cc @Narsil maybe an edge case that was not handle (and that was previously ignored) let's be more permissive on the last timestamps + will check with the provided example the reason why we are not getting a last timestamps.
Might be something relating to the length of the forced_decoder_ids
that can affect the WhisperTImestampsLogitProcessor
. Something to lookout for
@devxpy I have reproduced with your example. It seems this model never outputs timestamps.
I am guessing it was finetuned without timestamps and so the error is kind of normal. However it lead me to reduce the hard error to a soft error. The results are still nonsensical (check out the test).
I spent some time trying to find a better fix by fixing the logits processor itself, but to no avail. There's just no way to fix models that refuse to output timestamp tokens. To be noted is that whisper models are never even forced to output increasing timestamp tokens, so there's already a lot of room there. Soft error is better.
https://github.com/huggingface/transformers/pull/22475/files
I received this error when transcribing audio with openai/whisper-large-v2
. For me, the cause was 10 seconds of silence at the end of the file. Maybe this can be added as a potential solution to the error/warning, or maybe this can be detected and silently ignored.
Thanks for this comment! @narsil, I think it makes sense
@Narsil @devxpy @ArthurZucker I also did finetuning without timestamps, and now I have an issue where timestamps are not appearing. Is there a good way to finetune and include timestamps? Do I need to add 1500 special tokens for each timestamp in the tokenizer? I made sure that the tokenizer doesn't have a timestamps. #20225
Hey! For finetuning with timestamps, you should either use the latest tokenizer (which by default should add 1500 special tokens, not more) or use the previous one, which also supported them, but not for encoding. Pinging @sanchit-gandhi as he has been working on distil whisper, might have a training script to add timestamps. Also this kind of question would be better for the forum
Hey @upskyy - in my experience, fine-tuning with LoRA / QLoRA is a fantastic way to prevent this 'catastrophic forgetting' effect where Whisper forgets how to predict timestamps after fine-tuning. For this, you can check-out the following repo: https://github.com/Vaibhavs10/fast-whisper-finetuning
And @ArthurZucker - cool that the latest tokenizer has the 1500 special tokens already added! This should make our lives a lot easier for encoding with timestamps, since the tokenizer is now able to map the timestamp strings to tokens.
All we really need to do then is have a small amount of data in our train set that has timestamps in the Whisper format, e.g.
"<|0.00|> He has grave doubts whether Sir Frederick Layton's work is really Greek after all and<|6.24|><|6.24|> can discover in it but little of rocky Ithaca.<|9.44|>"
Generally, you only need between 1-5% of your data to be timestamped to ensure you retain Whisper's timestamp prediction abilities. The easiest way of getting this data is to use the pre-trained Whisper model to re-annotate 1% of your training data with timestamps. You can then merge this data into your full training corpus to train on both non-timestamped (99%) and timestamped (1%) data.
What we then want to do is enable/disable timestamps when we encode the labels, depending on whether the labels have timestamps or not:
def prepare_dataset(batch):
# load and resample audio data from 48 to 16kHz
audio = batch["audio"]
# compute log-Mel input features from input audio array
batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
# set tokenizer prefix tokens depending on whether we have timestamps or not
predict_timestamps = batch["predict_timestamps"] #ย boolean that tells us whether our labels have timestamps or not (add this column to your dataset to indicate)
tokenizer.set_prefix_tokens(language=language, task="transcribe", predict_timestamps= predict_timestamps)
# encode target text to label ids
batch["labels"] = tokenizer(batch["sentence"]).input_ids
return batch
@ArthurZucker @sanchit-gandhi Thank you so much for the detailed explanation. I'm trying to download a new tokenizer, but it seems like it was updated 5 months ago. Can I get it like this? [link] What is the latest tokenizer you are talking about? Currently, my tokenizer is splitting one by one like this.
from transformers import WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
tokens = processor.tokenizer("<|0.00|>Hello!<|2.34|>").input_ids
print(tokens)
# [50258, 50363, 27, 91, 15, 13, 628, 91, 29, 15947, 0, 27, 91, 17, 13, 12249, 91, 29, 50257]
text = processor.decode([27, 91, 15, 13, 628, 91, 29])
print(text)
# <|0.00|>
@ArthurZucker could you give @upskyy a hand with downloading the latest version of the tokenizer please! ๐