transformers
transformers copied to clipboard
Error when running pipeline with whisper and using the 'return_dict_in_generate=True' option
System Info
-
transformers
version: 4.26.1 - Platform: macOS-13.1-x86_64-i386-64bit
- Python version: 3.9.16
- Huggingface_hub version: 0.12.1
- PyTorch version (GPU?): 1.13.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help?
@sanchit-gandhi @Narsil
When running a simple whisper pipeline, e.g., using the options 'return_dict_in_generate': True and 'output_scores': True, e.g.,
from pathlib import Path
from transformers import pipeline, AutomaticSpeechRecognitionPipeline, Pipeline, GenerationConfig
audio_path = 'xxx.wav'
generate_kwargs = {'temperature': 1, 'max_length': 448, 'return_dict_in_generate': True, 'output_scores': True}
pipe = pipeline(
model="openai/whisper-small",
chunk_length_s=10,
framework="pt",
batch_size=1
)
print(pipe(audio_path, return_timestamps=True, generate_kwargs=generate_kwargs))
I am getting the following error:
Traceback (most recent call last):
File "/Users/sofia/PycharmProjects/openAI-whisper/test4.py", line 39, in <module>
print(pipe(audio_path, return_timestamps=True, generate_kwargs=generate_kwargs))
File "/Users/sofia/miniforge3/envs/openAI-whisper/lib/python3.9/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 378, in __call__
return super().__call__(inputs, **kwargs)
File "/Users/sofia/miniforge3/envs/openAI-whisper/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1076, in __call__
return next(
File "/Users/sofia/miniforge3/envs/openAI-whisper/lib/python3.9/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
processed = self.infer(item, **self.params)
File "/Users/sofia/miniforge3/envs/openAI-whisper/lib/python3.9/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 611, in postprocess
items = outputs[key].numpy()
AttributeError: 'ModelOutput' object has no attribute 'numpy'
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
- Run the code
from pathlib import Path
from transformers import pipeline, AutomaticSpeechRecognitionPipeline, Pipeline, GenerationConfig
audio_path = 'xxx.wav'
generate_kwargs = {'temperature': 1, 'max_length': 448, 'return_dict_in_generate': True, 'output_scores': True}
pipe = pipeline(
model="openai/whisper-small",
chunk_length_s=10,
framework="pt",
batch_size=1
)
print(pipe(audio_path, return_timestamps=True, generate_kwargs=generate_kwargs))
Expected behavior
I expect to get the text result accompanied with the timestamps and the prediction scores
cc @ArthurZucker
Hey! Thanks for reporting. This is normal as the pipeline
does not support returning the usual dictionary
.
We should probably prevent this behaviour (raise an error when return_dict_in_generate
is set in the pipeline) cc @Narsil this is a duplicate of another issue but I can't find it!
edit: #21185
Best recommendation in the mean time is to define a custom pipeline, where you process the inputs before feeding them to super.preprocess
!
Best recommendation in the mean time is to define a custom pipeline, where you process the inputs before feeding them to
super.preprocess
!
Thanks for your reply, I now understand the issue.
However, I am not sure how to preprocess the input to achieve this. I can see the output and the dictionary still contains the tokens (inside the ModelOutput):
{'tokens': ModelOutput([('sequences', tensor([[50258, 50342, 50358, 50364, 1044, 291, 337, 1976, 0, 50864,
50257]])), ('scores', (tensor([[2.3064, -inf, -inf, ..., 2.8053, 2.7866, 3.3406]]), tensor([[3.7724, -inf, -inf, ..., 3.1328, 3.6590, 3.8489]]), tensor([[ -inf, -inf, -inf, ..., -7.8979, -7.7944, -11.4352]]), tensor([[-5.0041, -inf, -inf, ..., -5.5928, -5.6329, -6.7607]]), tensor([[16.9060, -inf, -inf, ..., -inf, -inf, -inf]]), tensor([[ 4.7684, -inf, -inf, ..., -4.7718, -4.7031, -6.6440]]), tensor([[ 3.5967, -inf, -inf, ..., -0.2559, -0.4887, -1.7837]]), tensor([[ 1.7885, -inf, -inf, ..., -8.9040, -8.4750, -12.0667]]), tensor([[ -inf, -inf, -inf, ..., -15.8636, -15.3132, -18.1436]]), tensor([[ -inf, -inf, -inf, ..., 13.3971, 12.9880, 10.2999]])))]), 'stride': (160000, 0, 26667)}
and where it fails is when it tries to execute outputs["tokens"].numpy()
. Would you mean maybe post process the output?
Hi @panagiotidi , thanks for raising this issue.
Yes, in this case as the error is being raise in the postprocess
method, this is the one you'd need to adapt. Generally for custom workflows, it's probably easier to start with lower-level API such as AutoModel
to define your steps and then move to something like a custom pipeline.
If all that you want to do automatic speech recognition with the audio input, removing return_dict_in_generate
from the generate_kwargs
will work i.e.:
from pathlib import Path
from transformers import pipeline, AutomaticSpeechRecognitionPipeline, Pipeline, GenerationConfig
audio_path = 'xxx.wav'
generate_kwargs = {'temperature': 1, 'max_length': 448, 'output_scores': True}
pipe = pipeline(
model="openai/whisper-small",
chunk_length_s=10,
framework="pt",
batch_size=1
)
print(pipe(audio_path, return_timestamps=True, generate_kwargs=generate_kwargs))
I am actually trying to implement the --logprob_threshold
from the original paper of whisper as I would like to be able to experiment with it when transcribing. There is a relevant discussion here, but as you said too, in order to implement in a pipeline, a custom implementation of post process is needed on the output results.
Will you maybe include in later versions?
@panagiotidi I don't know of any plans to add this at the moment. As this is a specific generation case, it's not something that's likely to be included into a pipeline.
If I've understood --logprob_threshold
, then the desire is to stop generation if the average logprob is below a certain threshold. In this case, a custom Constraint
class could be implemented and passed in to the generate_kwargs
. Questions about an implementation of this is probably best placed in the forums.
As mentioned above, when applying custom code, it is easier to work from the AutoModel
level first e.g. adapting the examples in the docs.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.