transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Error when running pipeline with whisper and using the 'return_dict_in_generate=True' option

Open panagiotidi opened this issue 1 year ago • 7 comments

System Info

  • transformers version: 4.26.1
  • Platform: macOS-13.1-x86_64-i386-64bit
  • Python version: 3.9.16
  • Huggingface_hub version: 0.12.1
  • PyTorch version (GPU?): 1.13.0 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

@sanchit-gandhi @Narsil

When running a simple whisper pipeline, e.g., using the options 'return_dict_in_generate': True and 'output_scores': True, e.g.,

from pathlib import Path
from transformers import pipeline, AutomaticSpeechRecognitionPipeline, Pipeline, GenerationConfig

audio_path = 'xxx.wav'

generate_kwargs = {'temperature': 1, 'max_length': 448, 'return_dict_in_generate': True, 'output_scores': True}

pipe = pipeline(
    model="openai/whisper-small",
    chunk_length_s=10,
    framework="pt",
    batch_size=1
)

print(pipe(audio_path, return_timestamps=True, generate_kwargs=generate_kwargs))

I am getting the following error:

Traceback (most recent call last):
  File "/Users/sofia/PycharmProjects/openAI-whisper/test4.py", line 39, in <module>
    print(pipe(audio_path, return_timestamps=True, generate_kwargs=generate_kwargs))
  File "/Users/sofia/miniforge3/envs/openAI-whisper/lib/python3.9/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 378, in __call__
    return super().__call__(inputs, **kwargs)
  File "/Users/sofia/miniforge3/envs/openAI-whisper/lib/python3.9/site-packages/transformers/pipelines/base.py", line 1076, in __call__
    return next(
  File "/Users/sofia/miniforge3/envs/openAI-whisper/lib/python3.9/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
    processed = self.infer(item, **self.params)
  File "/Users/sofia/miniforge3/envs/openAI-whisper/lib/python3.9/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 611, in postprocess
    items = outputs[key].numpy()
AttributeError: 'ModelOutput' object has no attribute 'numpy'

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

  1. Run the code
from pathlib import Path
from transformers import pipeline, AutomaticSpeechRecognitionPipeline, Pipeline, GenerationConfig

audio_path = 'xxx.wav'

generate_kwargs = {'temperature': 1, 'max_length': 448, 'return_dict_in_generate': True, 'output_scores': True}

pipe = pipeline(
    model="openai/whisper-small",
    chunk_length_s=10,
    framework="pt",
    batch_size=1
)

print(pipe(audio_path, return_timestamps=True, generate_kwargs=generate_kwargs))

Expected behavior

I expect to get the text result accompanied with the timestamps and the prediction scores

panagiotidi avatar Mar 14 '23 18:03 panagiotidi

cc @ArthurZucker

amyeroberts avatar Mar 14 '23 18:03 amyeroberts

Hey! Thanks for reporting. This is normal as the pipeline does not support returning the usual dictionary. We should probably prevent this behaviour (raise an error when return_dict_in_generate is set in the pipeline) cc @Narsil this is a duplicate of another issue but I can't find it! edit: #21185

ArthurZucker avatar Mar 14 '23 19:03 ArthurZucker

Best recommendation in the mean time is to define a custom pipeline, where you process the inputs before feeding them to super.preprocess!

ArthurZucker avatar Mar 14 '23 19:03 ArthurZucker

Best recommendation in the mean time is to define a custom pipeline, where you process the inputs before feeding them to super.preprocess!

Thanks for your reply, I now understand the issue.

However, I am not sure how to preprocess the input to achieve this. I can see the output and the dictionary still contains the tokens (inside the ModelOutput):

{'tokens': ModelOutput([('sequences', tensor([[50258, 50342, 50358, 50364,  1044,   291,   337,  1976,     0, 50864,
         50257]])), ('scores', (tensor([[2.3064,   -inf,   -inf,  ..., 2.8053, 2.7866, 3.3406]]), tensor([[3.7724,   -inf,   -inf,  ..., 3.1328, 3.6590, 3.8489]]), tensor([[    -inf,     -inf,     -inf,  ...,  -7.8979,  -7.7944, -11.4352]]), tensor([[-5.0041,    -inf,    -inf,  ..., -5.5928, -5.6329, -6.7607]]), tensor([[16.9060,    -inf,    -inf,  ...,    -inf,    -inf,    -inf]]), tensor([[ 4.7684,    -inf,    -inf,  ..., -4.7718, -4.7031, -6.6440]]), tensor([[ 3.5967,    -inf,    -inf,  ..., -0.2559, -0.4887, -1.7837]]), tensor([[  1.7885,     -inf,     -inf,  ...,  -8.9040,  -8.4750, -12.0667]]), tensor([[    -inf,     -inf,     -inf,  ..., -15.8636, -15.3132, -18.1436]]), tensor([[   -inf,    -inf,    -inf,  ..., 13.3971, 12.9880, 10.2999]])))]), 'stride': (160000, 0, 26667)}

and where it fails is when it tries to execute outputs["tokens"].numpy(). Would you mean maybe post process the output?

panagiotidi avatar Mar 15 '23 09:03 panagiotidi

Hi @panagiotidi , thanks for raising this issue.

Yes, in this case as the error is being raise in the postprocess method, this is the one you'd need to adapt. Generally for custom workflows, it's probably easier to start with lower-level API such as AutoModel to define your steps and then move to something like a custom pipeline.

If all that you want to do automatic speech recognition with the audio input, removing return_dict_in_generate from the generate_kwargs will work i.e.:

from pathlib import Path
from transformers import pipeline, AutomaticSpeechRecognitionPipeline, Pipeline, GenerationConfig

audio_path = 'xxx.wav'

generate_kwargs = {'temperature': 1, 'max_length': 448, 'output_scores': True}

pipe = pipeline(
    model="openai/whisper-small",
    chunk_length_s=10,
    framework="pt",
    batch_size=1
)

print(pipe(audio_path, return_timestamps=True, generate_kwargs=generate_kwargs))

amyeroberts avatar Mar 15 '23 13:03 amyeroberts

I am actually trying to implement the --logprob_threshold from the original paper of whisper as I would like to be able to experiment with it when transcribing. There is a relevant discussion here, but as you said too, in order to implement in a pipeline, a custom implementation of post process is needed on the output results.

Will you maybe include in later versions?

panagiotidi avatar Mar 16 '23 07:03 panagiotidi

@panagiotidi I don't know of any plans to add this at the moment. As this is a specific generation case, it's not something that's likely to be included into a pipeline.

If I've understood --logprob_threshold, then the desire is to stop generation if the average logprob is below a certain threshold. In this case, a custom Constraint class could be implemented and passed in to the generate_kwargs. Questions about an implementation of this is probably best placed in the forums.

As mentioned above, when applying custom code, it is easier to work from the AutoModel level first e.g. adapting the examples in the docs.

amyeroberts avatar Mar 16 '23 14:03 amyeroberts

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 14 '23 15:04 github-actions[bot]