transformers Error when running pipeline with whisper and using the 'return_dict_in

System Info

transformers version: 4.26.1
Platform: macOS-13.1-x86_64-i386-64bit
Python version: 3.9.16
Huggingface_hub version: 0.12.1
PyTorch version (GPU?): 1.13.0 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@sanchit-gandhi @Narsil

When running a simple whisper pipeline, e.g., using the options 'return_dict_in_generate': True and 'output_scores': True, e.g.,

from pathlib import Path
from transformers import pipeline, AutomaticSpeechRecognitionPipeline, Pipeline, GenerationConfig

audio_path = 'xxx.wav'

generate_kwargs = {'temperature': 1, 'max_length': 448, 'return_dict_in_generate': True, 'output_scores': True}

pipe = pipeline(
    model="openai/whisper-small",
    chunk_length_s=10,
    framework="pt",
    batch_size=1
)

print(pipe(audio_path, return_timestamps=True, generate_kwargs=generate_kwargs))

I am getting the following error:

Traceback (most recent call last):
  File "/Users/sofia/PycharmProjects/openAI-whisper/test4.py", line 39, in <module>
    print(pipe(audio_path, return_timestamps=True, generate_kwargs=generate_kwargs))
  File "/Users/sofia/miniforge3/envs/openAI-whisper/lib/python3.9/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 378, in __call__
    return super().__call__(inputs, **kwargs)
  File "/Users/sofia/miniforge3/envs/openAI-whisper/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1076, in __call__
    return next(
  File "/Users/sofia/miniforge3/envs/openAI-whisper/lib/python3.9/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
    processed = self.infer(item, **self.params)
  File "/Users/sofia/miniforge3/envs/openAI-whisper/lib/python3.9/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 611, in postprocess
    items = outputs[key].numpy()
AttributeError: 'ModelOutput' object has no attribute 'numpy'

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Run the code

from pathlib import Path
from transformers import pipeline, AutomaticSpeechRecognitionPipeline, Pipeline, GenerationConfig

audio_path = 'xxx.wav'

generate_kwargs = {'temperature': 1, 'max_length': 448, 'return_dict_in_generate': True, 'output_scores': True}

pipe = pipeline(
    model="openai/whisper-small",
    chunk_length_s=10,
    framework="pt",
    batch_size=1
)

print(pipe(audio_path, return_timestamps=True, generate_kwargs=generate_kwargs))

Expected behavior

I expect to get the text result accompanied with the timestamps and the prediction scores

Mar 14 '23 18:03 panagiotidi

cc @ArthurZucker

Mar 14 '23 18:03 amyeroberts

Hey! Thanks for reporting. This is normal as the pipeline does not support returning the usual dictionary. We should probably prevent this behaviour (raise an error when return_dict_in_generate is set in the pipeline) cc @Narsil this is a duplicate of another issue but I can't find it! edit: #21185

Mar 14 '23 19:03 ArthurZucker

Best recommendation in the mean time is to define a custom pipeline, where you process the inputs before feeding them to super.preprocess!

Mar 14 '23 19:03 ArthurZucker

Best recommendation in the mean time is to define a custom pipeline, where you process the inputs before feeding them to super.preprocess!

Thanks for your reply, I now understand the issue.

However, I am not sure how to preprocess the input to achieve this. I can see the output and the dictionary still contains the tokens (inside the ModelOutput):

{'tokens': ModelOutput([('sequences', tensor([[50258, 50342, 50358, 50364,  1044,   291,   337,  1976,     0, 50864,
         50257]])), ('scores', (tensor([[2.3064,   -inf,   -inf,  ..., 2.8053, 2.7866, 3.3406]]), tensor([[3.7724,   -inf,   -inf,  ..., 3.1328, 3.6590, 3.8489]]), tensor([[    -inf,     -inf,     -inf,  ...,  -7.8979,  -7.7944, -11.4352]]), tensor([[-5.0041,    -inf,    -inf,  ..., -5.5928, -5.6329, -6.7607]]), tensor([[16.9060,    -inf,    -inf,  ...,    -inf,    -inf,    -inf]]), tensor([[ 4.7684,    -inf,    -inf,  ..., -4.7718, -4.7031, -6.6440]]), tensor([[ 3.5967,    -inf,    -inf,  ..., -0.2559, -0.4887, -1.7837]]), tensor([[  1.7885,     -inf,     -inf,  ...,  -8.9040,  -8.4750, -12.0667]]), tensor([[    -inf,     -inf,     -inf,  ..., -15.8636, -15.3132, -18.1436]]), tensor([[   -inf,    -inf,    -inf,  ..., 13.3971, 12.9880, 10.2999]])))]), 'stride': (160000, 0, 26667)}

and where it fails is when it tries to execute outputs["tokens"].numpy(). Would you mean maybe post process the output?

Mar 15 '23 09:03 panagiotidi

Hi @panagiotidi , thanks for raising this issue.

Yes, in this case as the error is being raise in the postprocess method, this is the one you'd need to adapt. Generally for custom workflows, it's probably easier to start with lower-level API such as AutoModel to define your steps and then move to something like a custom pipeline.

If all that you want to do automatic speech recognition with the audio input, removing return_dict_in_generate from the generate_kwargs will work i.e.:

from pathlib import Path
from transformers import pipeline, AutomaticSpeechRecognitionPipeline, Pipeline, GenerationConfig

audio_path = 'xxx.wav'

generate_kwargs = {'temperature': 1, 'max_length': 448, 'output_scores': True}

pipe = pipeline(
    model="openai/whisper-small",
    chunk_length_s=10,
    framework="pt",
    batch_size=1
)

print(pipe(audio_path, return_timestamps=True, generate_kwargs=generate_kwargs))

Mar 15 '23 13:03 amyeroberts

I am actually trying to implement the --logprob_threshold from the original paper of whisper as I would like to be able to experiment with it when transcribing. There is a relevant discussion here, but as you said too, in order to implement in a pipeline, a custom implementation of post process is needed on the output results.

Will you maybe include in later versions?

Mar 16 '23 07:03 panagiotidi

@panagiotidi I don't know of any plans to add this at the moment. As this is a specific generation case, it's not something that's likely to be included into a pipeline.

If I've understood --logprob_threshold, then the desire is to stop generation if the average logprob is below a certain threshold. In this case, a custom Constraint class could be implemented and passed in to the generate_kwargs. Questions about an implementation of this is probably best placed in the forums.

As mentioned above, when applying custom code, it is easier to work from the AutoModel level first e.g. adapting the examples in the docs.

Mar 16 '23 14:03 amyeroberts

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 14 '23 15:04 github-actions[bot]

transformers
transformers copied to clipboard

Error when running pipeline with whisper and using the 'return_dict_in_generate=True' option

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers transformers copied to clipboard

Error when running pipeline with whisper and using the 'return_dict_in_generate=True' option

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard