optimum OVModelForWhisper cannot extract word level timestamps

System Info

optimum-1.17.0.dev0
openvino-2023.3.0
transfomers-4.37.2
python 3.11

Who can help?

@philschmid

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Setting argument for pipeline return_timestamps="word" results in failure. Using model openai-whisper/small, which is properly configured to output token level timestamps. Backend OpenVino GPU. Without word level argument the task finishes.

Resulting error:

Traceback (most recent call last):
  File "/run/media/greggy/1a4fd6d7-1f9d-42c6-9324-661804695013/D/owisp/./n2.py", line 58, in <module>
    result = pipe("./4.wav")
             ^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 292, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1154, in __call__
    return next(
           ^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 266, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1068, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 507, in _forward
    tokens = self.model.generate(
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/optimum/intel/openvino/modeling_seq2seq.py", line 1018, in generate
    outputs["token_timestamps"] = self._extract_token_timestamps(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_OVModelForWhisper' object has no attribute '_extract_token_timestamps'

My script

from transformers import WhisperProcessor, WhisperForConditionalGeneration, GenerationConfig, pipeline
from optimum.intel.openvino import OVModelForSpeechSeq2Seq
from pathlib import Path
import openvino as ov
import json
import whisper

# Load the Whisper model
model_id = "openai/whisper-small"
model = WhisperForConditionalGeneration.from_pretrained(model_id)
processor = WhisperProcessor.from_pretrained(model_id)
generation_config = GenerationConfig.from_pretrained(model_id)

# Load audio
audio = whisper.load_audio("./4.wav")
input_features = processor(audio, return_tensors="pt").input_features

# Configure OpenVINO model
ov_config = {"CACHE_DIR": ""}
model_path = Path(model_id.replace('/', '_'))

if not model_path.exists():
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_id, ov_config=ov_config, export=True, compile=False, load_in_8bit=False
    )
    ov_model.half()
    ov_model.save_pretrained(model_path)
else:
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_path, ov_config=ov_config, compile=False
    )

ov_model.generation_config = generation_config

# Choose device
device = 'gpu'  # Change this to 'GPU' if GPU is preferred
ov_model.to(device)
ov_model.compile()

# Configure pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=ov_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    stride_length_s=[4, 2],
    return_timestamps="word"
)


# Transcribe the audio
result = pipe("./4.wav")
# Save result to JSON
with open("sample.json", "w") as outfile:
    json.dump(result, outfile)

print(result["text"])

Expected behavior

Pipeline outputs word/token level timestamps properly.

Feb 13 '24 15:02 zkvsky

cc @echarlaix

Feb 16 '24 13:02 fxmarty

Here's a more concise example https://github.com/huggingface/optimum-intel/issues/561 (I wasn't sure where to report the issue initially)

Feb 16 '24 22:02 zkvsky

Closing it to keep discussion in https://github.com/huggingface/optimum-intel/issues/561

Mar 21 '24 14:03 echarlaix