Whisper Word-level Timestamps broken on some inputs
System Info
transformersversion: 4.38.2- Platform: Linux-5.15.0-1036-aws-x86_64-with-glibc2.31
- Python version: 3.10.13
- Huggingface_hub version: 0.21.4
- Safetensors version: 0.4.1
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help?
No response
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Audio files can be found here.
import librosa
import soundfile as sf
from transformers import pipeline, AutoModelForSpeechSeq2Seq, AutoProcessor
from torch import float16
import numpy as np
model_id = "openai/whisper-large-v3"
dtype = float16
device = "cuda:0"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=dtype, use_flash_attention_2=False, attn_implementation="eager"
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
model = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
device=device,
batch_size=4,
framework="pt",
torch_dtype=float16
)
def infer_whisper(audio_file):
audio, sr = sf.read(audio_file, dtype=np.float32)
if sr != 16000:
whisper_audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
else:
whisper_audio = audio
raw_transcriptions = model(whisper_audio, generate_kwargs={"task": "transcribe"}, return_timestamps="word")
print(raw_transcriptions["chunks"])
infer_whisper("./test_16k.wav")
infer_whisper("./test_ko_new.wav")
Expected behavior
Both audio files should output correct word-level timestamps. However, the output is as follows:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
[{'text': ' κ΅λ¬΄μ΄λ¦¬λ', 'timestamp': (0.0, 0.52)}, {'text': ' κ΅νμ', 'timestamp': (0.52, 1.08)}, {'text': ' λμλ₯Ό', 'timestamp': (1.08, xt': ' μ»μ΄', 'timestamp': (1.4, 1.72)}, {'text': ' λν΅λ Ήμ΄', 'timestamp': (1.72, 2.28)}, {'text': ' μλͺ
νλ€.', 'timestamp': (2.28, 2.92)} ' λλ²κ΄μ', 'timestamp': (2.92, 4.26)}, {'text': ' λλ²μμ₯μ', 'timestamp': (4.26, 5.06)}, {'text': ' μ μ²μΌλ‘', 'timestamp': (5.06, 5.54' κ΅νμ', 'timestamp': (5.54, 5.96)}, {'text': ' λμλ₯Ό', 'timestamp': (5.96, 6.32)}, {'text': ' μ»μ΄', 'timestamp': (6.32, 6.64)}, {'text'', 'timestamp': (6.64, 7.18)}, {'text': ' μλͺ
νλ€.', 'timestamp': (7.18, 8.32)}, {'text': '
μ무κ΅μ‘μ', 'timestamp': (8.32, 9.14)}, {'text': ' 무μμΌλ‘', 'timestamp': (9.14, 9.58)}, {'text': ' νλ€.', 'timestamp': (9.58, 10.38)}, {λ₯ ', 'timestamp': (10.38, 10.96)}, {'text': ' μμ', 'timestamp': (10.96, 11.1)}, {'text': ' μ΄μκ°', 'timestamp': (11.1, 11.66)}, {'text': 'timestamp': (11.66, 11.84)}, {'text': ' λμλ', 'timestamp': (11.84, 12.24)}, {'text': ' λν΅λ Ήμ', 'timestamp': (12.24, 13.04)}, {'text'
: ' μ¬κ²¬', 'timestamp': (13.04, 13.38)}, {'text': ' νν΄', 'timestamp': (13.38, 13.7)}, {'text': ' κΈ°κ°', 'timestamp': ( (
13.7, 14.04)}, {'text': ' λ΄μ', 'timestamp': (14.04, 14.26)}, {'text': ' μ΄μμλ₯Ό', 'timestamp': (14.26, 14.82)}, {'text': ' λΆμ¬', 'timest
amp': (14.82, 15.2)}, {'text': ' κ΅νλ‘', 'timestamp': (15.2, 15.7)}, {'text': ' νλΆνκ³ ', 'timestamp': (15.7, 16.16)}, {'text': ' κ·Έ', 'ti
mestamp': (16.16, 16.84)}, {'text': ' μ μλ₯Ό', 'timestamp': (16.84, 17.28)}, {'text': ' μꡬν ', 'timestamp': (17.28, 17.68)}, {'text': ' μ
, 'timestamp': (17.68, 17.8)}, {'text': ' μλ€.', 'timestamp': (17.8, 18.78)}, {'text': ' κ΅νμ', 'timestamp': (18.78, 19.44)}, {'text': ''
νν', 'timestamp': (19.44, 19.72)}, {'text': ' μ€μλ', 'timestamp': (19.72, 20.06)}, {'text': ' λν', 'timestamp': (20.06, 20.42)}, {'te
xt': ' κ°λ€.', 'timestamp': (20.42, 20.96)}, {'text': ' λͺ
λ Ήκ·μΉ', 'timestamp': (20.96, 22.16)}, {'text': ' λλ', 'timestamp': (22.16, 22.4
4)}, {'text': ' μ²λΆμ΄', 'timestamp': (22.44, 22.86)}, {'text': ' νλ²μ΄λ', 'timestamp': (22.86, 23.34)}, {'text': ' λ²λ₯ μ', 'timestamp':
(23.34, 24.18)}, {'text': ' μλ°λλ', 'timestamp': (24.18, 24.96)}, {'text': ' μ¬λΆκ°', 'timestamp': (24.96, 25.44)}, {'text': ' μ¬νμ', '
timestamp': (25.44, 26.0)}, {'text': ' μ μ κ°', 'timestamp': (26.0, 26.36)}, {'text': ' λ', 'timestamp': (26.36, 26.56)}, {'text': ' κ²½μ°μ', 'timestamp': (26.56, 26.94)}, {'text': ' λλ²μμ', 'timestamp': (26.94, 27.9)}]
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
[{'text': ' μΆλ©΄', 'timestamp': (29.98, 29.98)}, {'text': ' μλ‘', 'timestamp': (29.98, 29.98)}, {'text': ' λ¬λΌλΆμ΄', 'timestamp': (29.98,
29.98)}, {'text': ' μΌμμ΄', 'timestamp': (29.98, 29.98)}, {'text': ' λλ', 'timestamp': (29.98, 29.98)}, {'text': ' κ±°μΌ.', 'timestamp': (29.98, 29.98)}]
Related to https://github.com/huggingface/transformers/pull/25607 @xenova
After further testing, adding the chunk_size_s parameter to the inference works correctly for small inputs. Even if this is expected, it would be nice to throw some error or warning if Whisper now treats small inputs differently without an explicit chunk-size.
raw_transcriptions = model(
whisper_audio,
generate_kwargs={"task": "transcribe"},
return_timestamps="word",
chunk_size_s=30
)
cc @sanchit-gandhi @ylacombe A warning or information in the docs seems reasonable if chunk_size_s is necessary for even small inputs. WDYT?
Gentle ping @sanchit-gandhi
Hey @kyle-v6x, seems lile #30325 should have fixed this, could you verify that it does fix your issue ? Thanks for your help!
It was indeed solved with #30325, I'm closing for now!