wav2vec2-sprint Predictions are always pad_token, logits are always the same distribution for different time

Predictions are always pad_token, logits are always the same distribution for different time_step and speech utterances.

Open louislau1129 opened this issue 3 years ago • 2 comments

Hi, thanks for your wav2vec2 fine-tuning scripts first. Recently, I used this script (run_common_voice.py) to fine-tune the newest Hubert model for ASR on Chinese language (zh-CN). Hubert basically has the same interface like wav2vec2 in huggingface. So I just replace the Wav2Vec2ForCTC with HubertForCTC and run on the machine using one GPU. When I inspect the training progress, I found the wer/cer metrics are always shown to be 1.0 in evaluation stage (after each several steps) , which is unexpected.

Then I use the latest checkpoint to check the performance of the ASR model on common voice (simply using 10 utterances). I found Predictions are always pad_token, logits are always the same distribution for different time_step and speech utterances. The following figure shows the logits for one utterance. 3609 denotes the vocab_size. The first index denotes the pad_token. issue Do you have any idea about why this issue happens? Thanks in advance!

Jul 30 '21 04:07 louislau1129

Hi @louislau1129 ! Sorry for the very late reply. I had some health issues that forced me to stay offline for a while. My life is going back to normal slowly now...

It is weird. I really don't know why it was happening to you. I hope you've already fixed that by yourself. This repository is now deprecated in favor of https://github.com/jonatasgrosman/huggingsound. So you can try to reproduce this error on that new tool, and if it's still happening, please open an issue there, and I'll try to help you to fix it.

Feb 22 '22 21:02 jonatasgrosman

Have you solve it ? I met same problem when I fine-tune wav2vec2 with HuggingFace Wav2Vec2ForCTC.

Jul 26 '22 08:07 qinyuenlp

wav2vec2-sprint wav2vec2-sprint copied to clipboard

Predictions are always pad_token, logits are always the same distribution for different time_step and speech utterances.

wav2vec2-sprint
wav2vec2-sprint copied to clipboard