transformers icon indicating copy to clipboard operation
transformers copied to clipboard

whisper recognition error๏ผ

Open xyx361100238 opened this issue 1 year ago โ€ข 7 comments

System Info

  • transformers version: 4.28.0.dev0
  • Platform: Linux-5.4.0-144-generic-x86_64-with-debian-bullseye-sid
  • Python version: 3.7.16
  • Huggingface_hub version: 0.13.2
  • PyTorch version (GPU?): 1.13.1+cu116 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@sanchit-gandhi @Narsil @sgugger

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [x] My own task or dataset (give details below)

Reproduction

i was fine-tune whisper-base model with wenetspeech datasets๏ผŒneed to verify effectiveness use pipeline๏ผš

processor = WhisperProcessor.from_pretrained(model_path)
asr_pipeline = pipeline(task="automatic-speech-recognition", model=model_path, device="cpu")
asr_pipeline.model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language=lang, task="transcribe")
ds = load_dataset("audiofolder", data_dir=wav_path)
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
audio = ds['train'][0]['audio']
inputs = processor(audio["array"], sampling_rate=audio["sampling_rate"], language=lang, task="transcribe", return_tensors="pt")
input_features = inputs.input_features
generated_ids = asr_pipeline.model.generate(inputs=input_features, max_new_tokens=32767)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

frist of all ,this script can works,but in some mp3 file it tips error:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ โ”‚ โ”‚ โ”‚ /home/youxixie/008-Whisper-Pro/005-whisper-fineturn-pro/transformers-main/ex โ”‚ โ”‚ amples/pytorch/speech-recognition/run_whisper_speech_recognition.py:45 in โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ 42 โ”‚ print("test model:{} ".format(args.model)) โ”‚ โ”‚ 43 โ”‚ print("test wav path:{} ".format(args.path)) โ”‚ โ”‚ 44 โ”‚ print("test language:{} ".format(args.lang)) โ”‚ โ”‚ โฑ 45 โ”‚ eval_whisper(args.model, args.path, args.lang) โ”‚ โ”‚ 46 โ”‚ โ”‚ /home/youxixie/008-Whisper-Pro/005-whisper-fineturn-pro/transformers-main/ex โ”‚ โ”‚ amples/pytorch/speech-recognition/run_whisper_speech_recognition.py:25 in โ”‚ โ”‚ eval_whisper โ”‚ โ”‚ โ”‚ โ”‚ 22 โ”‚ โ”‚ audio = ds['train'][i]['audio'] โ”‚ โ”‚ 23 โ”‚ โ”‚ inputs = processor(audio["array"], sampling_rate=audio["samplin โ”‚ โ”‚ 24 โ”‚ โ”‚ input_features = inputs.input_features โ”‚ โ”‚ โฑ 25 โ”‚ โ”‚ generated_ids = asr_pipeline.model.generate(inputs=input_featur โ”‚ โ”‚ 26 โ”‚ โ”‚ โ”‚ โ”‚ 27 โ”‚ โ”‚ transcription = processor.batch_decode(generated_ids, skip_spec โ”‚ โ”‚ 28 โ”‚ โ”‚ #print(transcription) โ”‚ โ”‚ โ”‚ โ”‚ /home/youxixie/008-Whisper-Pro/005-whisper-fineturn-pro/transformers-main/sr โ”‚ โ”‚ c/transformers/models/whisper/modeling_whisper.py:1613 in generate โ”‚ โ”‚ โ”‚ โ”‚ 1610 โ”‚ โ”‚ โ”‚ stopping_criteria, โ”‚ โ”‚ 1611 โ”‚ โ”‚ โ”‚ prefix_allowed_tokens_fn, โ”‚ โ”‚ 1612 โ”‚ โ”‚ โ”‚ synced_gpus, โ”‚ โ”‚ โฑ 1613 โ”‚ โ”‚ โ”‚ **kwargs, โ”‚ โ”‚ 1614 โ”‚ โ”‚ ) โ”‚ โ”‚ 1615 โ”‚ โ”‚ โ”‚ 1616 โ”‚ def prepare_inputs_for_generation( โ”‚ โ”‚ โ”‚ โ”‚ /home/youxixie/anaconda3/envs/Huggingface-Whisper/lib/python3.7/site-package โ”‚ โ”‚ s/torch/autograd/grad_mode.py:27 in decorate_context โ”‚ โ”‚ โ”‚ โ”‚ 24 โ”‚ โ”‚ @functools.wraps(func) โ”‚ โ”‚ 25 โ”‚ โ”‚ def decorate_context(*args, **kwargs): โ”‚ โ”‚ 26 โ”‚ โ”‚ โ”‚ with self.clone(): โ”‚ โ”‚ โฑ 27 โ”‚ โ”‚ โ”‚ โ”‚ return func(*args, **kwargs) โ”‚ โ”‚ 28 โ”‚ โ”‚ return cast(F, decorate_context) โ”‚ โ”‚ 29 โ”‚ โ”‚ โ”‚ 30 โ”‚ def _wrap_generator(self, func): โ”‚ โ”‚ โ”‚ โ”‚ /home/youxixie/008-Whisper-Pro/005-whisper-fineturn-pro/transformers-main/sr โ”‚ โ”‚ c/transformers/generation/utils.py:1415 in generate โ”‚ โ”‚ โ”‚ โ”‚ 1412 โ”‚ โ”‚ โ”‚ โ”‚ output_scores=generation_config.output_scores, โ”‚ โ”‚ 1413 โ”‚ โ”‚ โ”‚ โ”‚ return_dict_in_generate=generation_config.return_dict โ”‚ โ”‚ 1414 โ”‚ โ”‚ โ”‚ โ”‚ synced_gpus=synced_gpus, โ”‚ โ”‚ โฑ 1415 โ”‚ โ”‚ โ”‚ โ”‚ **model_kwargs, โ”‚ โ”‚ 1416 โ”‚ โ”‚ โ”‚ ) โ”‚ โ”‚ 1417 โ”‚ โ”‚ โ”‚ โ”‚ 1418 โ”‚ โ”‚ elif is_contrastive_search_gen_mode: โ”‚ โ”‚ โ”‚ โ”‚ /home/youxixie/008-Whisper-Pro/005-whisper-fineturn-pro/transformers-main/sr โ”‚ โ”‚ c/transformers/generation/utils.py:2211 in greedy_search โ”‚ โ”‚ โ”‚ โ”‚ 2208 โ”‚ โ”‚ โ”‚ if synced_gpus and this_peer_finished: โ”‚ โ”‚ 2209 โ”‚ โ”‚ โ”‚ โ”‚ continue # don't waste resources running the code we โ”‚ โ”‚ 2210 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โฑ 2211 โ”‚ โ”‚ โ”‚ next_token_logits = outputs.logits[:, -1, :] โ”‚ โ”‚ 2212 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ 2213 โ”‚ โ”‚ โ”‚ # pre-process distribution โ”‚ โ”‚ 2214 โ”‚ โ”‚ โ”‚ next_tokens_scores = logits_processor(input_ids, next_tok โ”‚ โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ IndexError: index -1 is out of bounds for dimension 1 with size 0

Expected behavior

if some file can't have result, just give empty resulte or special symbols

xyx361100238 avatar Apr 10 '23 03:04 xyx361100238

What is the value of model_path in the above code?

chakravarthik27 avatar Apr 10 '23 06:04 chakravarthik27

fine-tune model based on whisper-base use wenetspeech datasets

xyx361100238 avatar Apr 10 '23 07:04 xyx361100238

use huggingface model โ€œwhisper-baseโ€๏ผŒtest file common_voice_zh-CN_18662117.mp3๏ผŒgot the same error

xyx361100238 avatar Apr 10 '23 08:04 xyx361100238

When I had this error, limiting the max_new_tokens specified to the amount the model can generate per chunk fixed it for me (see the generation_config.json's max_length). Looks like that might be the case here since the max is 448 for whisper-base and 32767 is given. Maybe a nice error message for when max_new_tokens is > max_length would be wanted?

connor-henderson avatar Apr 17 '23 16:04 connor-henderson

Hey @xyx361100238! In this case, you can probably simplify how you're transcribing the audio file to simply:

asr_pipeline = pipeline(task="automatic-speech-recognition", model=model_path, device="cpu")
transcription = processor.batch_decode("path/to/audio/file",  generate_kwargs={"language": lang, "task": "transcribe"})

This looks like quite a strange error for Whisper - in most cases you can specify max_new_tokens as some arbitrary value (e.g. for LLMs this is just the number of new tokens generated, which doesn't depend on our max length).

sanchit-gandhi avatar Apr 21 '23 15:04 sanchit-gandhi

processor = WhisperProcessor.from_pretrained(model_path) asr_pipeline = pipeline(task="automatic-speech-recognition", model=model_path, device="cpu") transcription = processor.batch_decode("common_voice_zh-CN_18524189.wav", generate_kwargs={"language": lang, "task": "transcribe"}) tips error๏ผš image

xyx361100238 avatar Apr 23 '23 02:04 xyx361100238

Sorry, I rushed my code snippet! It should have been:

from transformers import pipeline

asr_pipeline = pipeline(task="automatic-speech-recognition", model=model_path, device="cpu")  # change device to "cuda:0" to run on GPU
transcription = asr_pipeline("path/to/audio/file", chunk_length_s=30, generate_kwargs={"language": "<|zh|>", "task": "transcribe"})  # change language as required - I've set it to Chinese

sanchit-gandhi avatar Apr 27 '23 17:04 sanchit-gandhi

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 22 '23 15:05 github-actions[bot]