transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Whisper Word-level Timestamps broken on some inputs

Open kyle-v6x opened this issue 1 year ago β€’ 3 comments

System Info

  • transformers version: 4.38.2
  • Platform: Linux-5.15.0-1036-aws-x86_64-with-glibc2.31
  • Python version: 3.10.13
  • Huggingface_hub version: 0.21.4
  • Safetensors version: 0.4.1
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help?

No response

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Audio files can be found here.

import librosa
import soundfile as sf
from transformers import pipeline, AutoModelForSpeechSeq2Seq, AutoProcessor
from torch import float16
import numpy as np

model_id = "openai/whisper-large-v3"
dtype = float16
device = "cuda:0"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=dtype, use_flash_attention_2=False, attn_implementation="eager"
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
model = pipeline(
            "automatic-speech-recognition",
            model=model,
            tokenizer=processor.tokenizer,
            feature_extractor=processor.feature_extractor,
            device=device,
            batch_size=4,
            framework="pt",
            torch_dtype=float16
        )

def infer_whisper(audio_file):
    audio, sr = sf.read(audio_file, dtype=np.float32)
    if sr != 16000:
        whisper_audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
    else:
        whisper_audio = audio

    raw_transcriptions = model(whisper_audio, generate_kwargs={"task": "transcribe"}, return_timestamps="word")
    print(raw_transcriptions["chunks"])

infer_whisper("./test_16k.wav")
infer_whisper("./test_ko_new.wav")

Expected behavior

Both audio files should output correct word-level timestamps. However, the output is as follows:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
[{'text': ' κ΅­λ¬΄μ΄λ¦¬λŠ”', 'timestamp': (0.0, 0.52)}, {'text': ' ꡭ회의', 'timestamp': (0.52, 1.08)}, {'text': ' λ™μ˜λ₯Ό', 'timestamp': (1.08, xt': ' μ–»μ–΄', 'timestamp': (1.4, 1.72)}, {'text': ' λŒ€ν†΅λ Ήμ΄', 'timestamp': (1.72, 2.28)}, {'text': ' μž„λͺ…ν•œλ‹€.', 'timestamp': (2.28, 2.92)} ' λŒ€λ²•κ΄€μ€', 'timestamp': (2.92, 4.26)}, {'text': ' λŒ€λ²•μ›μž₯의', 'timestamp': (4.26, 5.06)}, {'text': ' 제청으둜', 'timestamp': (5.06, 5.54' ꡭ회의', 'timestamp': (5.54, 5.96)}, {'text': ' λ™μ˜λ₯Ό', 'timestamp': (5.96, 6.32)}, {'text': ' μ–»μ–΄', 'timestamp': (6.32, 6.64)}, {'text'', 'timestamp': (6.64, 7.18)}, {'text': ' μž„λͺ…ν•œλ‹€.', 'timestamp': (7.18, 8.32)}, {'text': ' 
μ˜λ¬΄κ΅μœ‘μ€', 'timestamp': (8.32, 9.14)}, {'text': ' λ¬΄μƒμœΌλ‘œ', 'timestamp': (9.14, 9.58)}, {'text': ' ν•œλ‹€.', 'timestamp': (9.58, 10.38)}, {λ₯ ', 'timestamp': (10.38, 10.96)}, {'text': ' μ•ˆμ—', 'timestamp': (10.96, 11.1)}, {'text': ' μ΄μ˜κ°€', 'timestamp': (11.1, 11.66)}, {'text':  'timestamp': (11.66, 11.84)}, {'text': ' λ•Œμ—λŠ”', 'timestamp': (11.84, 12.24)}, {'text': ' λŒ€ν†΅λ Ήμ€', 'timestamp': (12.24, 13.04)}, {'text'
: ' 재견', 'timestamp': (13.04, 13.38)}, {'text': ' ν•­ν•΄', 'timestamp': (13.38, 13.7)}, {'text': ' κΈ°κ°„', 'timestamp': (                   (
13.7, 14.04)}, {'text': ' 내에', 'timestamp': (14.04, 14.26)}, {'text': ' μ΄μ˜μ„œλ₯Ό', 'timestamp': (14.26, 14.82)}, {'text': ' λΆ™μ—¬', 'timest
amp': (14.82, 15.2)}, {'text': ' ꡭ회둜', 'timestamp': (15.2, 15.7)}, {'text': ' ν™˜λΆ€ν•˜κ³ ', 'timestamp': (15.7, 16.16)}, {'text': ' κ·Έ', 'ti
mestamp': (16.16, 16.84)}, {'text': ' 제의λ₯Ό', 'timestamp': (16.84, 17.28)}, {'text': ' μš”κ΅¬ν• ', 'timestamp': (17.28, 17.68)}, {'text': ' 수
, 'timestamp': (17.68, 17.8)}, {'text': ' μžˆλ‹€.', 'timestamp': (17.8, 18.78)}, {'text': ' ꡭ회의', 'timestamp': (18.78, 19.44)}, {'text': ''
 폐회', 'timestamp': (19.44, 19.72)}, {'text': ' 쀑에도', 'timestamp': (19.72, 20.06)}, {'text': ' λ˜ν•œ', 'timestamp': (20.06, 20.42)}, {'te
xt': ' κ°™λ‹€.', 'timestamp': (20.42, 20.96)}, {'text': ' λͺ…λ Ήκ·œμΉ™', 'timestamp': (20.96, 22.16)}, {'text': ' λ˜λŠ”', 'timestamp': (22.16, 22.4
4)}, {'text': ' μ²˜λΆ„μ΄', 'timestamp': (22.44, 22.86)}, {'text': ' ν—Œλ²•μ΄λ‚˜', 'timestamp': (22.86, 23.34)}, {'text': ' 법λ₯ μ—', 'timestamp': 
(23.34, 24.18)}, {'text': ' μœ„λ°˜λ˜λŠ”', 'timestamp': (24.18, 24.96)}, {'text': ' μ—¬λΆ€κ°€', 'timestamp': (24.96, 25.44)}, {'text': ' μž¬νŒμ—', '
timestamp': (25.44, 26.0)}, {'text': ' μ „μ œκ°€', 'timestamp': (26.0, 26.36)}, {'text': ' 된', 'timestamp': (26.36, 26.56)}, {'text': ' κ²½μš°μ—', 'timestamp': (26.56, 26.94)}, {'text': ' λŒ€λ²•μ›μ€', 'timestamp': (26.94, 27.9)}]


Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
[{'text': ' μΆ”λ©΄', 'timestamp': (29.98, 29.98)}, {'text': ' μ„œλ‘œ', 'timestamp': (29.98, 29.98)}, {'text': ' 달라뢙어', 'timestamp': (29.98, 
29.98)}, {'text': ' μ–ΌμŒμ΄', 'timestamp': (29.98, 29.98)}, {'text': ' λ˜λŠ”', 'timestamp': (29.98, 29.98)}, {'text': ' κ±°μ•Ό.', 'timestamp': (29.98, 29.98)}]

Related to https://github.com/huggingface/transformers/pull/25607 @xenova

kyle-v6x avatar Mar 07 '24 07:03 kyle-v6x

After further testing, adding the chunk_size_s parameter to the inference works correctly for small inputs. Even if this is expected, it would be nice to throw some error or warning if Whisper now treats small inputs differently without an explicit chunk-size.

 raw_transcriptions = model(
    whisper_audio, 
    generate_kwargs={"task": "transcribe"}, 
    return_timestamps="word", 
    chunk_size_s=30
)

kyle-v6x avatar Mar 07 '24 07:03 kyle-v6x

cc @sanchit-gandhi @ylacombe A warning or information in the docs seems reasonable if chunk_size_s is necessary for even small inputs. WDYT?

amyeroberts avatar Apr 08 '24 14:04 amyeroberts

Gentle ping @sanchit-gandhi

amyeroberts avatar May 07 '24 10:05 amyeroberts

Hey @kyle-v6x, seems lile #30325 should have fixed this, could you verify that it does fix your issue ? Thanks for your help!

ylacombe avatar May 13 '24 15:05 ylacombe

It was indeed solved with #30325, I'm closing for now!

kamilakesbi avatar May 16 '24 16:05 kamilakesbi