transformers
transformers copied to clipboard
Whisper - list index out of range with word level timestamps
System Info
-
transformers
version: 4.42.2 - Platform: Windows-10-10.0.22621-SP0
- Python version: 3.10.14
- Huggingface_hub version: 0.23.4
- Safetensors version: 0.4.2
- Accelerate version: 0.31.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.3.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: No
- Using GPU in script?: Yes
- GPU type: NVIDIA GeForce RTX 4070 Laptop GPU
Who can help?
No response
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
- Load 'whisper-large-v3' AutoModelForSpeechSeq2Seq model and send it to GPU
- Setup model config with return_timestamps="word", amongst other settings.
- Run audio through the pipe, which results in error
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[8], [line 1](vscode-notebook-cell:?execution_count=8&line=1)
----> [1](vscode-notebook-cell:?execution_count=8&line=1) asr_out = transcribing_pipe(
[2](vscode-notebook-cell:?execution_count=8&line=2) '../SampleData/Saba_interview_short.wav',
[3](vscode-notebook-cell:?execution_count=8&line=3) return_timestamps="word",
[4](vscode-notebook-cell:?execution_count=8&line=4) generate_kwargs={"language": "danish"}
[5](vscode-notebook-cell:?execution_count=8&line=5) )
[7](vscode-notebook-cell:?execution_count=8&line=7) asr_out
File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\pipelines\automatic_speech_recognition.py:284, in AutomaticSpeechRecognitionPipeline.__call__(self, inputs, **kwargs)
[221](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:221) def __call__(
[222](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:222) self,
[223](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:223) inputs: Union[np.ndarray, bytes, str],
[224](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:224) **kwargs,
[225](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:225) ):
[226](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:226) """
[227](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:227) Transcribe the audio sequence(s) given as inputs to text. See the [`AutomaticSpeechRecognitionPipeline`]
[228](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:228) documentation for more information.
(...)
[282](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:282) `"".join(chunk["text"] for chunk in output["chunks"])`.
[283](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:283) """
--> [284](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:284) return super().__call__(inputs, **kwargs)
File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\pipelines\base.py:1246, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
[1244](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1244) return self.iterate(inputs, preprocess_params, forward_params, postprocess_params)
[1245](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1245) elif self.framework == "pt" and isinstance(self, ChunkPipeline):
-> [1246](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1246) return next(
[1247](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1247) iter(
[1248](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1248) self.get_iterator(
[1249](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1249) [inputs], num_workers, batch_size, preprocess_params, forward_params, postprocess_params
[1250](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1250) )
[1251](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1251) )
[1252](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1252) )
[1253](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1253) else:
[1254](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1254) return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\pipelines\pt_utils.py:125, in PipelineIterator.__next__(self)
[123](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:123) # We're out of items within a batch
[124](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:124) item = next(self.iterator)
--> [125](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:125) processed = self.infer(item, **self.params)
[126](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:126) # We now have a batch of "inferred things".
[127](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:127) if self.loader_batch_size is not None:
[128](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:128) # Try to infer the size of the batch
File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\pipelines\automatic_speech_recognition.py:587, in AutomaticSpeechRecognitionPipeline.postprocess(self, model_outputs, decoder_kwargs, return_timestamps, return_language)
[584](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:584) stride_right /= sampling_rate
[585](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:585) output["stride"] = chunk_len, stride_left, stride_right
--> [587](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:587) text, optional = self.tokenizer._decode_asr(
[588](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:588) model_outputs,
[589](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:589) return_timestamps=return_timestamps,
[590](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:590) return_language=return_language,
[591](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:591) time_precision=time_precision,
[592](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:592) )
[593](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:593) else:
[594](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:594) items = np.concatenate(final_items, axis=1)
File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\models\whisper\tokenization_whisper.py:832, in WhisperTokenizer._decode_asr(self, model_outputs, return_timestamps, return_language, time_precision)
[831](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:831) def _decode_asr(self, model_outputs, *, return_timestamps, return_language, time_precision):
--> [832](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:832) return _decode_asr(
[833](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:833) self,
[834](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:834) model_outputs,
[835](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:835) return_timestamps=return_timestamps,
[836](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:836) return_language=return_language,
[837](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:837) time_precision=time_precision,
[838](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:838) )
File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\models\whisper\tokenization_whisper.py:1032, in _decode_asr(tokenizer, model_outputs, return_timestamps, return_language, time_precision)
[1030](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1030) current_tokens.append(token)
[1031](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1031) if return_timestamps == "word":
-> [1032](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1032) start_time = round(token_timestamps[i] + time_offset, 2)
[1033](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1033) if i + 1 < len(token_timestamps):
[1034](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1034) end_time = round(token_timestamps[i + 1] + time_offset, 2)
IndexError: list index out of range
I've uploaded the audio that I am trying to process here
I have a discussion on the whisper's hub page, where I have implemented a fix that seems to work. here
Colab link here
Expected behavior
- Load 'whisper-large-v3' AutoModelForSpeechSeq2Seq model and send it to GPU
- Setup model config with return_timestamps="word", amongst other settings.
- Run audio through the pipe, which returns transcription together with the word level timestamps segment.