deepspeech icon indicating copy to clipboard operation
deepspeech copied to clipboard

nan cost

Open basant-kumar opened this issue 7 years ago • 8 comments

hi, I'm getting nan cost after resuming the training for the pre-trained model (librispeech_16_epochs.prm). the cost becomes nan after 16/17 epoch and the testing results (after each epoch) are null.

OS: Ubuntu 16.04 GPU: Nvidia Titan-X Pascal (12GB RAM) Neon: version 1.9.0

basant-kumar avatar Jun 26 '17 16:06 basant-kumar

Could you share a bit more details on your setup? We haven't seen this behavior. What is the command you are running to train further? Which dataset are you using? Is there anything different about your data from the librispeech dataset?

tyler-nervana avatar Jul 11 '17 16:07 tyler-nervana

I am getting the same problem. My audio data are in wav format other than flac. Is this a problem? following is my command: python train.py --manifest train:data/train_1700hour.csv --manifest val:data/dev_1700hour.csv -e 20 -z 12 -s model/ds2_1700hour_20_epochs.prm --model_file model/librispeech_16_epochs.prm

gardenia22 avatar Jul 24 '17 03:07 gardenia22

My transcription files have '\n' in the file, which leads to nan cost problem.

gardenia22 avatar Jul 25 '17 03:07 gardenia22

Thanks for the quick update. Currently anything in the transcript files is treated as a character, including "\n".

tyler-nervana avatar Jul 25 '17 22:07 tyler-nervana

I also get the same problem, when .wav files are used. When I converted the files to flac files then the nan value problem did not appear.

pankaj2701 avatar Jul 31 '17 14:07 pankaj2701

Thanks for noticing the difficulty with .wav files. We'll take a look.

tyler-nervana avatar Aug 04 '17 19:08 tyler-nervana

hello, i write here because i encountered a problem with nan cost as well. I am using Neon 2.0 for python 2.7 on Ubuntu 16.04 using GTX1080 backend.

in my case i am using librispeech train-500-other and after 50-60% of the epoch the cost becomes nan. i have tried training the model only using the other libispeech packages and it trains as expected. any thoughts on this?

Drea1989 avatar Aug 10 '17 01:08 Drea1989

i was able to fix the issue by dropping the learning rate of 2 order of magnitude, the issue was apparently due to an infinite cost caused by a prediction being too certain of a very wrong value.

Drea1989 avatar Nov 08 '17 00:11 Drea1989