transformers icon indicating copy to clipboard operation
transformers copied to clipboard

The whisper-large-v3 model randomly misses sentences during recognition when return_timestamps="word"

Open zxl777 opened this issue 4 months ago • 5 comments

System Info

  • transformers version: 4.40.0.dev0
  • Platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.21.4
  • Safetensors version: 0.4.2
  • Accelerate version: 0.28.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@sanchit-gandhi

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

  1. Download audio from https://www.youtube.com/watch?v=CK_wQEX_yS8
python3 -m pip install -U yt-dlp[default]
yt-dlp -f 'bestaudio[ext=webm]' -o audio.webm "https://www.youtube.com/watch?v=CK_wQEX_yS8"
yt-dlp -f 'bestaudio[ext=m4a]' -o audio.m4a "https://www.youtube.com/watch?v=CK_wQEX_yS8"
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=4,
    return_timestamps=True, 
    torch_dtype=torch_dtype,

    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
# sample = dataset[0]["audio"]

sample = 'audio.webm'
# result = pipe(sample,return_timestamps=True,)
result = pipe(sample,return_timestamps="word",)

print('== '*10)
print(result)
  1. When I searched for "Elon said" in the results, I got "Elon said, understand it." This is incomplete and misses an entire sentence.

  2. Change the code to result = pipe(sample, return_timestamps=True,), then the result is "Elon said, When you struggle with a problem, that's when you," which is correct and meets expectations.

  3. if set return_timestamps=False, # Occasionally missing sentences.

  4. if set return_timestamps=True, # Occasionally there will be problems with repeated sentences.

Expected behavior

In the case of setting return_timestamps="word", the whisper-large-v3 model randomly misses sentences during recognition.

Everything works normally when return_timestamps=True.

The test audio comes from https://www.youtube.com/watch?v=CK_wQEX_yS8 . Different audio formats were downloaded, and the sentences that were missed vary.

I'm using the latest version available on GitHub right now, and I believe this is a bug.

zxl777 avatar Mar 23 '24 16:03 zxl777

I've run into this as well- one unblock I found (haven't tracked why this is the case), is that if you also include return_language=True in your pipe (so have both return_language=True, return_timestamps="word"), then the word level timestamps are correct / make sense. We were seeing some pretty nonsense timestamps without this, it could be the case that some other intermediate reps as needed to properly time align, and are only getting passed through when language info is being passed

naveen-corpusant avatar Mar 23 '24 17:03 naveen-corpusant

Thank you for your response. Even after I added return_language=True, the issue still persists. This parameter does not affect the problem I've encountered.

I've run into this as well- one unblock I found (haven't tracked why this is the case), is that if you also include return_language=True in your pipe (so have both return_language=True, return_timestamps="word"), then the word level timestamps are correct / make sense. We were seeing some pretty nonsense timestamps without this, it could be the case that some other intermediate reps as needed to properly time align, and are only getting passed through when language info is being passed

zxl777 avatar Mar 23 '24 17:03 zxl777

also cc @ylacombe

amyeroberts avatar Mar 23 '24 19:03 amyeroberts

Any update?

zxl777 avatar Apr 09 '24 18:04 zxl777

Gentle ping @sanchit-gandhi @ylacombe

amyeroberts avatar May 07 '24 09:05 amyeroberts