transformers whisper recognition error！

System Info

transformers version: 4.28.0.dev0
Platform: Linux-5.4.0-144-generic-x86_64-with-debian-bullseye-sid
Python version: 3.7.16
Huggingface_hub version: 0.13.2
PyTorch version (GPU?): 1.13.1+cu116 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@sanchit-gandhi @Narsil @sgugger

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

i was fine-tune whisper-base model with wenetspeech datasets，need to verify effectiveness use pipeline：

processor = WhisperProcessor.from_pretrained(model_path)
asr_pipeline = pipeline(task="automatic-speech-recognition", model=model_path, device="cpu")
asr_pipeline.model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language=lang, task="transcribe")
ds = load_dataset("audiofolder", data_dir=wav_path)
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
audio = ds['train'][0]['audio']
inputs = processor(audio["array"], sampling_rate=audio["sampling_rate"], language=lang, task="transcribe", return_tensors="pt")
input_features = inputs.input_features
generated_ids = asr_pipeline.model.generate(inputs=input_features, max_new_tokens=32767)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

frist of all ,this script can works,but in some mp3 file it tips error:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ │ │ /home/youxixie/008-Whisper-Pro/005-whisper-fineturn-pro/transformers-main/ex │ │ amples/pytorch/speech-recognition/run_whisper_speech_recognition.py:45 in │ │ │ │ │ │ 42 │ print("test model:{} ".format(args.model)) │ │ 43 │ print("test wav path:{} ".format(args.path)) │ │ 44 │ print("test language:{} ".format(args.lang)) │ │ ❱ 45 │ eval_whisper(args.model, args.path, args.lang) │ │ 46 │ │ /home/youxixie/008-Whisper-Pro/005-whisper-fineturn-pro/transformers-main/ex │ │ amples/pytorch/speech-recognition/run_whisper_speech_recognition.py:25 in │ │ eval_whisper │ │ │ │ 22 │ │ audio = ds['train'][i]['audio'] │ │ 23 │ │ inputs = processor(audio["array"], sampling_rate=audio["samplin │ │ 24 │ │ input_features = inputs.input_features │ │ ❱ 25 │ │ generated_ids = asr_pipeline.model.generate(inputs=input_featur │ │ 26 │ │ │ │ 27 │ │ transcription = processor.batch_decode(generated_ids, skip_spec │ │ 28 │ │ #print(transcription) │ │ │ │ /home/youxixie/008-Whisper-Pro/005-whisper-fineturn-pro/transformers-main/sr │ │ c/transformers/models/whisper/modeling_whisper.py:1613 in generate │ │ │ │ 1610 │ │ │ stopping_criteria, │ │ 1611 │ │ │ prefix_allowed_tokens_fn, │ │ 1612 │ │ │ synced_gpus, │ │ ❱ 1613 │ │ │ **kwargs, │ │ 1614 │ │ ) │ │ 1615 │ │ │ 1616 │ def prepare_inputs_for_generation( │ │ │ │ /home/youxixie/anaconda3/envs/Huggingface-Whisper/lib/python3.7/site-package │ │ s/torch/autograd/grad_mode.py:27 in decorate_context │ │ │ │ 24 │ │ @functools.wraps(func) │ │ 25 │ │ def decorate_context(*args, **kwargs): │ │ 26 │ │ │ with self.clone(): │ │ ❱ 27 │ │ │ │ return func(*args, **kwargs) │ │ 28 │ │ return cast(F, decorate_context) │ │ 29 │ │ │ 30 │ def _wrap_generator(self, func): │ │ │ │ /home/youxixie/008-Whisper-Pro/005-whisper-fineturn-pro/transformers-main/sr │ │ c/transformers/generation/utils.py:1415 in generate │ │ │ │ 1412 │ │ │ │ output_scores=generation_config.output_scores, │ │ 1413 │ │ │ │ return_dict_in_generate=generation_config.return_dict │ │ 1414 │ │ │ │ synced_gpus=synced_gpus, │ │ ❱ 1415 │ │ │ │ **model_kwargs, │ │ 1416 │ │ │ ) │ │ 1417 │ │ │ │ 1418 │ │ elif is_contrastive_search_gen_mode: │ │ │ │ /home/youxixie/008-Whisper-Pro/005-whisper-fineturn-pro/transformers-main/sr │ │ c/transformers/generation/utils.py:2211 in greedy_search │ │ │ │ 2208 │ │ │ if synced_gpus and this_peer_finished: │ │ 2209 │ │ │ │ continue # don't waste resources running the code we │ │ 2210 │ │ │ │ │ ❱ 2211 │ │ │ next_token_logits = outputs.logits[:, -1, :] │ │ 2212 │ │ │ │ │ 2213 │ │ │ # pre-process distribution │ │ 2214 │ │ │ next_tokens_scores = logits_processor(input_ids, next_tok │ ╰──────────────────────────────────────────────────────────────────────────────╯ IndexError: index -1 is out of bounds for dimension 1 with size 0

Expected behavior

if some file can't have result, just give empty resulte or special symbols

Apr 10 '23 03:04 xyx361100238

What is the value of model_path in the above code?

Apr 10 '23 06:04 chakravarthik27

fine-tune model based on whisper-base use wenetspeech datasets

Apr 10 '23 07:04 xyx361100238

use huggingface model “whisper-base”，test file common_voice_zh-CN_18662117.mp3，got the same error

Apr 10 '23 08:04 xyx361100238

When I had this error, limiting the max_new_tokens specified to the amount the model can generate per chunk fixed it for me (see the generation_config.json's max_length). Looks like that might be the case here since the max is 448 for whisper-base and 32767 is given. Maybe a nice error message for when max_new_tokens is > max_length would be wanted?

Apr 17 '23 16:04 connor-henderson

Hey @xyx361100238! In this case, you can probably simplify how you're transcribing the audio file to simply:

asr_pipeline = pipeline(task="automatic-speech-recognition", model=model_path, device="cpu")
transcription = processor.batch_decode("path/to/audio/file",  generate_kwargs={"language": lang, "task": "transcribe"})

This looks like quite a strange error for Whisper - in most cases you can specify max_new_tokens as some arbitrary value (e.g. for LLMs this is just the number of new tokens generated, which doesn't depend on our max length).

Apr 21 '23 15:04 sanchit-gandhi

processor = WhisperProcessor.from_pretrained(model_path) asr_pipeline = pipeline(task="automatic-speech-recognition", model=model_path, device="cpu") transcription = processor.batch_decode("common_voice_zh-CN_18524189.wav", generate_kwargs={"language": lang, "task": "transcribe"}) tips error：

Apr 23 '23 02:04 xyx361100238

Sorry, I rushed my code snippet! It should have been:

from transformers import pipeline

asr_pipeline = pipeline(task="automatic-speech-recognition", model=model_path, device="cpu")  # change device to "cuda:0" to run on GPU
transcription = asr_pipeline("path/to/audio/file", chunk_length_s=30, generate_kwargs={"language": "<|zh|>", "task": "transcribe"})  # change language as required - I've set it to Chinese

Apr 27 '23 17:04 sanchit-gandhi

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 22 '23 15:05 github-actions[bot]

transformers transformers copied to clipboard

whisper recognition error！

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard