transformers Whisper Word-level Timestamps broken on some inputs

System Info

transformers version: 4.38.2
Platform: Linux-5.15.0-1036-aws-x86_64-with-glibc2.31
Python version: 3.10.13
Huggingface_hub version: 0.21.4
Safetensors version: 0.4.1
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.1.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Audio files can be found here.

import librosa
import soundfile as sf
from transformers import pipeline, AutoModelForSpeechSeq2Seq, AutoProcessor
from torch import float16
import numpy as np

model_id = "openai/whisper-large-v3"
dtype = float16
device = "cuda:0"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=dtype, use_flash_attention_2=False, attn_implementation="eager"
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
model = pipeline(
            "automatic-speech-recognition",
            model=model,
            tokenizer=processor.tokenizer,
            feature_extractor=processor.feature_extractor,
            device=device,
            batch_size=4,
            framework="pt",
            torch_dtype=float16
        )

def infer_whisper(audio_file):
    audio, sr = sf.read(audio_file, dtype=np.float32)
    if sr != 16000:
        whisper_audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
    else:
        whisper_audio = audio

    raw_transcriptions = model(whisper_audio, generate_kwargs={"task": "transcribe"}, return_timestamps="word")
    print(raw_transcriptions["chunks"])

infer_whisper("./test_16k.wav")
infer_whisper("./test_ko_new.wav")

Expected behavior

Both audio files should output correct word-level timestamps. However, the output is as follows:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
[{'text': ' 국무총리는', 'timestamp': (0.0, 0.52)}, {'text': ' 국회의', 'timestamp': (0.52, 1.08)}, {'text': ' 동의를', 'timestamp': (1.08, xt': ' 얻어', 'timestamp': (1.4, 1.72)}, {'text': ' 대통령이', 'timestamp': (1.72, 2.28)}, {'text': ' 임명한다.', 'timestamp': (2.28, 2.92)} ' 대법관은', 'timestamp': (2.92, 4.26)}, {'text': ' 대법원장의', 'timestamp': (4.26, 5.06)}, {'text': ' 제청으로', 'timestamp': (5.06, 5.54' 국회의', 'timestamp': (5.54, 5.96)}, {'text': ' 동의를', 'timestamp': (5.96, 6.32)}, {'text': ' 얻어', 'timestamp': (6.32, 6.64)}, {'text'', 'timestamp': (6.64, 7.18)}, {'text': ' 임명한다.', 'timestamp': (7.18, 8.32)}, {'text': ' 
의무교육은', 'timestamp': (8.32, 9.14)}, {'text': ' 무상으로', 'timestamp': (9.14, 9.58)}, {'text': ' 한다.', 'timestamp': (9.58, 10.38)}, {률', 'timestamp': (10.38, 10.96)}, {'text': ' 안에', 'timestamp': (10.96, 11.1)}, {'text': ' 이의가', 'timestamp': (11.1, 11.66)}, {'text':  'timestamp': (11.66, 11.84)}, {'text': ' 때에는', 'timestamp': (11.84, 12.24)}, {'text': ' 대통령은', 'timestamp': (12.24, 13.04)}, {'text'
: ' 재견', 'timestamp': (13.04, 13.38)}, {'text': ' 항해', 'timestamp': (13.38, 13.7)}, {'text': ' 기간', 'timestamp': (                   (
13.7, 14.04)}, {'text': ' 내에', 'timestamp': (14.04, 14.26)}, {'text': ' 이의서를', 'timestamp': (14.26, 14.82)}, {'text': ' 붙여', 'timest
amp': (14.82, 15.2)}, {'text': ' 국회로', 'timestamp': (15.2, 15.7)}, {'text': ' 환부하고', 'timestamp': (15.7, 16.16)}, {'text': ' 그', 'ti
mestamp': (16.16, 16.84)}, {'text': ' 제의를', 'timestamp': (16.84, 17.28)}, {'text': ' 요구할', 'timestamp': (17.28, 17.68)}, {'text': ' 수
, 'timestamp': (17.68, 17.8)}, {'text': ' 있다.', 'timestamp': (17.8, 18.78)}, {'text': ' 국회의', 'timestamp': (18.78, 19.44)}, {'text': ''
 폐회', 'timestamp': (19.44, 19.72)}, {'text': ' 중에도', 'timestamp': (19.72, 20.06)}, {'text': ' 또한', 'timestamp': (20.06, 20.42)}, {'te
xt': ' 같다.', 'timestamp': (20.42, 20.96)}, {'text': ' 명령규칙', 'timestamp': (20.96, 22.16)}, {'text': ' 또는', 'timestamp': (22.16, 22.4
4)}, {'text': ' 처분이', 'timestamp': (22.44, 22.86)}, {'text': ' 헌법이나', 'timestamp': (22.86, 23.34)}, {'text': ' 법률에', 'timestamp': 
(23.34, 24.18)}, {'text': ' 위반되는', 'timestamp': (24.18, 24.96)}, {'text': ' 여부가', 'timestamp': (24.96, 25.44)}, {'text': ' 재판에', '
timestamp': (25.44, 26.0)}, {'text': ' 전제가', 'timestamp': (26.0, 26.36)}, {'text': ' 된', 'timestamp': (26.36, 26.56)}, {'text': ' 경우에', 'timestamp': (26.56, 26.94)}, {'text': ' 대법원은', 'timestamp': (26.94, 27.9)}]


Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
[{'text': ' 추면', 'timestamp': (29.98, 29.98)}, {'text': ' 서로', 'timestamp': (29.98, 29.98)}, {'text': ' 달라붙어', 'timestamp': (29.98, 
29.98)}, {'text': ' 얼음이', 'timestamp': (29.98, 29.98)}, {'text': ' 되는', 'timestamp': (29.98, 29.98)}, {'text': ' 거야.', 'timestamp': (29.98, 29.98)}]

Related to https://github.com/huggingface/transformers/pull/25607 @xenova

Mar 07 '24 07:03 kyle-v6x

After further testing, adding the chunk_size_s parameter to the inference works correctly for small inputs. Even if this is expected, it would be nice to throw some error or warning if Whisper now treats small inputs differently without an explicit chunk-size.

 raw_transcriptions = model(
    whisper_audio, 
    generate_kwargs={"task": "transcribe"}, 
    return_timestamps="word", 
    chunk_size_s=30
)

Mar 07 '24 07:03 kyle-v6x

cc @sanchit-gandhi @ylacombe A warning or information in the docs seems reasonable if chunk_size_s is necessary for even small inputs. WDYT?

Apr 08 '24 14:04 amyeroberts

Gentle ping @sanchit-gandhi

May 07 '24 10:05 amyeroberts

Hey @kyle-v6x, seems lile #30325 should have fixed this, could you verify that it does fix your issue ? Thanks for your help!

May 13 '24 15:05 ylacombe

It was indeed solved with #30325, I'm closing for now!

May 16 '24 16:05 kamilakesbi