rnnt-speech-recognition icon indicating copy to clipboard operation
rnnt-speech-recognition copied to clipboard

Training is not converging. eval_wer sticks at ~95%.

Open stefan-falk opened this issue 4 years ago • 7 comments

I finally was able to run a training on a single GPU (multi-GPU does not seem to work right now) but the word-error-rate is not dropping.

I did not change anything in the code and I am using the common voice dataset as suggested by the README.md

As you can see below, the train_loss drops but the eval_wer goes back up after a slight drop:

image

image

Any idea where this might come from?

stefan-falk avatar Jun 25 '20 07:06 stefan-falk

Excuse me, but I have another question. When I train the model, I always run into "out of memory". Just like this:

RuntimeError: CUDA out of memory. Tried to allocate 8.05 GiB (GPU 0; 23.62 GiB total capacity; 18.02 GiB already allocated; 2.84 GiB free; 19.59 GiB reserved in total by PyTorch)

I use one GPU to train, the memory size is 23.6GiB. So how could you succeed running model only on one GPU? Many thanks!

PeiyanFlying avatar Jul 01 '20 03:07 PeiyanFlying

@PeiyanFlying I am using a rather small batch size like 8 or 16 on a GeForce 1080 Ti (11 GB VRAM). In fact, multi-GPU seems to be broken at the moment. I am not able to use more GPUs than one at this point.

stefan-falk avatar Jul 01 '20 08:07 stefan-falk

@PeiyanFlying I am using a rather small batch size like 8 or 16 on a GeForce 1080 Ti (11 GB VRAM). In fact, multi-GPU seems to be broken at the moment. I am not able to use more GPUs than one at this point.

Thank you very much. These days I am working on RNNT training on LibriSpeech with Pytorch. But with the same config setting of this repository, It's easy to run into the OOM problem. I try to check. Thanks!

PeiyanFlying avatar Jul 01 '20 19:07 PeiyanFlying

@PeiyanFlying Did you have any success yet? And, could you link me to that Pytorch library you're using? I'd like to take a look in case https://github.com/noahchalifour/rnnt-speech-recognition won't work for me

stefan-falk avatar Jul 02 '20 06:07 stefan-falk

Ok, I am working on it. Once the PyTorch library can run successfully, I give you the link.

PeiyanFlying avatar Jul 03 '20 14:07 PeiyanFlying

@stefan-falk I have also noted that the model is not converging. I have been working on a solution for a while. It seems though if you use a small enough dataset (as a test) the model does successfully converge. I did read that in the original paper they are using massive batch sizes and im not sure if that is the reason why the model is not converging. Any insights?

noahchalifour avatar Sep 04 '20 23:09 noahchalifour

@noahchalifour Correct me if I'm wrong... Nobody has managed to train the network from this repo to reach at least 30 WER on Libri/common_voice?

WrathOfGrapes avatar Sep 22 '20 18:09 WrathOfGrapes