transformers
transformers copied to clipboard
whisper recognition error๏ผ
System Info
-
transformers
version: 4.28.0.dev0 - Platform: Linux-5.4.0-144-generic-x86_64-with-debian-bullseye-sid
- Python version: 3.7.16
- Huggingface_hub version: 0.13.2
- PyTorch version (GPU?): 1.13.1+cu116 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
@sanchit-gandhi @Narsil @sgugger
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [x] My own task or dataset (give details below)
Reproduction
i was fine-tune whisper-base model with wenetspeech datasets๏ผneed to verify effectiveness use pipeline๏ผ
processor = WhisperProcessor.from_pretrained(model_path)
asr_pipeline = pipeline(task="automatic-speech-recognition", model=model_path, device="cpu")
asr_pipeline.model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language=lang, task="transcribe")
ds = load_dataset("audiofolder", data_dir=wav_path)
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
audio = ds['train'][0]['audio']
inputs = processor(audio["array"], sampling_rate=audio["sampling_rate"], language=lang, task="transcribe", return_tensors="pt")
input_features = inputs.input_features
generated_ids = asr_pipeline.model.generate(inputs=input_features, max_new_tokens=32767)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
frist of all ,this script can works,but in some mp3 file it tips error:
โญโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโฎ โ โ โ /home/youxixie/008-Whisper-Pro/005-whisper-fineturn-pro/transformers-main/ex โ โ amples/pytorch/speech-recognition/run_whisper_speech_recognition.py:45 in โ โ
โ โ โ โ 42 โ print("test model:{} ".format(args.model)) โ โ 43 โ print("test wav path:{} ".format(args.path)) โ โ 44 โ print("test language:{} ".format(args.lang)) โ โ โฑ 45 โ eval_whisper(args.model, args.path, args.lang) โ โ 46 โ โ /home/youxixie/008-Whisper-Pro/005-whisper-fineturn-pro/transformers-main/ex โ โ amples/pytorch/speech-recognition/run_whisper_speech_recognition.py:25 in โ โ eval_whisper โ โ โ โ 22 โ โ audio = ds['train'][i]['audio'] โ โ 23 โ โ inputs = processor(audio["array"], sampling_rate=audio["samplin โ โ 24 โ โ input_features = inputs.input_features โ โ โฑ 25 โ โ generated_ids = asr_pipeline.model.generate(inputs=input_featur โ โ 26 โ โ โ โ 27 โ โ transcription = processor.batch_decode(generated_ids, skip_spec โ โ 28 โ โ #print(transcription) โ โ โ โ /home/youxixie/008-Whisper-Pro/005-whisper-fineturn-pro/transformers-main/sr โ โ c/transformers/models/whisper/modeling_whisper.py:1613 in generate โ โ โ โ 1610 โ โ โ stopping_criteria, โ โ 1611 โ โ โ prefix_allowed_tokens_fn, โ โ 1612 โ โ โ synced_gpus, โ โ โฑ 1613 โ โ โ **kwargs, โ โ 1614 โ โ ) โ โ 1615 โ โ โ 1616 โ def prepare_inputs_for_generation( โ โ โ โ /home/youxixie/anaconda3/envs/Huggingface-Whisper/lib/python3.7/site-package โ โ s/torch/autograd/grad_mode.py:27 in decorate_context โ โ โ โ 24 โ โ @functools.wraps(func) โ โ 25 โ โ def decorate_context(*args, **kwargs): โ โ 26 โ โ โ with self.clone(): โ โ โฑ 27 โ โ โ โ return func(*args, **kwargs) โ โ 28 โ โ return cast(F, decorate_context) โ โ 29 โ โ โ 30 โ def _wrap_generator(self, func): โ โ โ โ /home/youxixie/008-Whisper-Pro/005-whisper-fineturn-pro/transformers-main/sr โ โ c/transformers/generation/utils.py:1415 in generate โ โ โ โ 1412 โ โ โ โ output_scores=generation_config.output_scores, โ โ 1413 โ โ โ โ return_dict_in_generate=generation_config.return_dict โ โ 1414 โ โ โ โ synced_gpus=synced_gpus, โ โ โฑ 1415 โ โ โ โ **model_kwargs, โ โ 1416 โ โ โ ) โ โ 1417 โ โ โ โ 1418 โ โ elif is_contrastive_search_gen_mode: โ โ โ โ /home/youxixie/008-Whisper-Pro/005-whisper-fineturn-pro/transformers-main/sr โ โ c/transformers/generation/utils.py:2211 in greedy_search โ โ โ โ 2208 โ โ โ if synced_gpus and this_peer_finished: โ โ 2209 โ โ โ โ continue # don't waste resources running the code we โ โ 2210 โ โ โ โ โ โฑ 2211 โ โ โ next_token_logits = outputs.logits[:, -1, :] โ โ 2212 โ โ โ โ โ 2213 โ โ โ # pre-process distribution โ โ 2214 โ โ โ next_tokens_scores = logits_processor(input_ids, next_tok โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ IndexError: index -1 is out of bounds for dimension 1 with size 0
Expected behavior
if some file can't have result, just give empty resulte or special symbols
What is the value of model_path in the above code?
fine-tune model based on whisper-base use wenetspeech datasets
use huggingface model โwhisper-baseโ๏ผtest file common_voice_zh-CN_18662117.mp3๏ผgot the same error
When I had this error, limiting the max_new_tokens specified to the amount the model can generate per chunk fixed it for me (see the generation_config.json's max_length). Looks like that might be the case here since the max is 448 for whisper-base and 32767 is given. Maybe a nice error message for when max_new_tokens is > max_length would be wanted?
Hey @xyx361100238! In this case, you can probably simplify how you're transcribing the audio file to simply:
asr_pipeline = pipeline(task="automatic-speech-recognition", model=model_path, device="cpu")
transcription = processor.batch_decode("path/to/audio/file", generate_kwargs={"language": lang, "task": "transcribe"})
This looks like quite a strange error for Whisper - in most cases you can specify max_new_tokens
as some arbitrary value (e.g. for LLMs this is just the number of new tokens generated, which doesn't depend on our max length).
processor = WhisperProcessor.from_pretrained(model_path) asr_pipeline = pipeline(task="automatic-speech-recognition", model=model_path, device="cpu") transcription = processor.batch_decode("common_voice_zh-CN_18524189.wav", generate_kwargs={"language": lang, "task": "transcribe"})
tips error๏ผ
Sorry, I rushed my code snippet! It should have been:
from transformers import pipeline
asr_pipeline = pipeline(task="automatic-speech-recognition", model=model_path, device="cpu") # change device to "cuda:0" to run on GPU
transcription = asr_pipeline("path/to/audio/file", chunk_length_s=30, generate_kwargs={"language": "<|zh|>", "task": "transcribe"}) # change language as required - I've set it to Chinese
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.