decaNLP icon indicating copy to clipboard operation
decaNLP copied to clipboard

NaN loss and only OOV in the greedy output

Open debajyotidatta opened this issue 6 years ago • 2 comments

The loss initially was decreasing until it reach nan's for a while. I am running it on the squad dataset and the exact argument used for running it is:

python train.py --train_tasks squad --device 0 --data ./.data --save ./results/ --embeddings ./.embeddings/ --train_batch_tokens 2000

So the only change is the train batch tokens to 2000 since my GPU was running out of memory. I am attaching a screenshot. Is there anything I am missing? Should I try something else?

screenshot 2018-11-02 14 35 47

debajyotidatta avatar Nov 02 '18 22:11 debajyotidatta

Well that's no good. Let me try running your exact command on my side to see if I get the same thing. Do you know which iteration this first started on? Is it 438000?

bmccann avatar Nov 16 '18 19:11 bmccann

Well that's no good. Let me try running your exact command on my side to see if I get the same thing. Do you know which iteration this first started on? Is it 438000?

I had the same question when I ran nvidia-docker run -it --rm -v pwd:/decaNLP/ -u $(id -u):$(id -g) bmccann/decanlp:cuda9_torch041 bash -c "python /decaNLP/train.py --train_tasks squad --device 0" It started at iretation_316800.

Llaneige avatar Nov 21 '18 08:11 Llaneige