optimum
optimum copied to clipboard
OVModelForWhisper cannot extract word level timestamps
System Info
optimum-1.17.0.dev0
openvino-2023.3.0
transfomers-4.37.2
python 3.11
Who can help?
@philschmid
Information
- [X] The official example scripts
- [X] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction (minimal, reproducible, runnable)
Setting argument for pipeline return_timestamps="word" results in failure. Using model openai-whisper/small, which is properly configured to output token level timestamps. Backend OpenVino GPU. Without word level argument the task finishes.
Resulting error:
Traceback (most recent call last):
File "/run/media/greggy/1a4fd6d7-1f9d-42c6-9324-661804695013/D/owisp/./n2.py", line 58, in <module>
result = pipe("./4.wav")
^^^^^^^^^^^^^^^
File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 292, in __call__
return super().__call__(inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1154, in __call__
return next(
^^^^^
File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
item = next(self.iterator)
^^^^^^^^^^^^^^^^^^^
File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 266, in __next__
processed = self.infer(next(self.iterator), **self.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1068, in forward
model_outputs = self._forward(model_inputs, **forward_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 507, in _forward
tokens = self.model.generate(
^^^^^^^^^^^^^^^^^^^^
File "/home/greggy/.local/lib/python3.11/site-packages/optimum/intel/openvino/modeling_seq2seq.py", line 1018, in generate
outputs["token_timestamps"] = self._extract_token_timestamps(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_OVModelForWhisper' object has no attribute '_extract_token_timestamps'
My script
from transformers import WhisperProcessor, WhisperForConditionalGeneration, GenerationConfig, pipeline
from optimum.intel.openvino import OVModelForSpeechSeq2Seq
from pathlib import Path
import openvino as ov
import json
import whisper
# Load the Whisper model
model_id = "openai/whisper-small"
model = WhisperForConditionalGeneration.from_pretrained(model_id)
processor = WhisperProcessor.from_pretrained(model_id)
generation_config = GenerationConfig.from_pretrained(model_id)
# Load audio
audio = whisper.load_audio("./4.wav")
input_features = processor(audio, return_tensors="pt").input_features
# Configure OpenVINO model
ov_config = {"CACHE_DIR": ""}
model_path = Path(model_id.replace('/', '_'))
if not model_path.exists():
ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
model_id, ov_config=ov_config, export=True, compile=False, load_in_8bit=False
)
ov_model.half()
ov_model.save_pretrained(model_path)
else:
ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
model_path, ov_config=ov_config, compile=False
)
ov_model.generation_config = generation_config
# Choose device
device = 'gpu' # Change this to 'GPU' if GPU is preferred
ov_model.to(device)
ov_model.compile()
# Configure pipeline
pipe = pipeline(
"automatic-speech-recognition",
model=ov_model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=30,
stride_length_s=[4, 2],
return_timestamps="word"
)
# Transcribe the audio
result = pipe("./4.wav")
# Save result to JSON
with open("sample.json", "w") as outfile:
json.dump(result, outfile)
print(result["text"])
Expected behavior
Pipeline outputs word/token level timestamps properly.
cc @echarlaix
Here's a more concise example https://github.com/huggingface/optimum-intel/issues/561 (I wasn't sure where to report the issue initially)
Closing it to keep discussion in https://github.com/huggingface/optimum-intel/issues/561