pytorch-video-recognition icon indicating copy to clipboard operation
pytorch-video-recognition copied to clipboard

why the training loss always none?

Open lucasjinreal opened this issue 6 years ago • 14 comments

I got some loss like this:


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 424/424 [04:10<00:00,  2.24it/s]
[train] Epoch: 22/100 Loss: nan Acc: 0.010870849580527
Execution time: 250.25667172999238

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 108/108 [00:26<00:00,  5.16it/s]
[val] Epoch: 22/100 Loss: nan Acc: 0.011121408711770158
Execution time: 26.448329468010343

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 424/424 [04:09<00:00,  2.23it/s]
[train] Epoch: 23/100 Loss: nan Acc: 0.010870849580527
Execution time: 249.90277546200377

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 108/108 [00:26<00:00,  5.09it/s]
[val] Epoch: 23/100 Loss: nan Acc: 0.011121408711770158
Execution time: 26.87914375399123

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 424/424 [04:09<00:00,  2.24it/s]
[train] Epoch: 24/100 Loss: nan Acc: 0.010870849580527
Execution time: 249.9237438449927

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 108/108 [00:26<00:00,  5.16it/s]
[val] Epoch: 24/100 Loss: nan Acc: 0.011121408711770158
Execution time: 26.460865497996565

It;s all nan, for what reason maybe?

lucasjinreal avatar Feb 15 '19 06:02 lucasjinreal

This happens to me , too . the version of Pytorch is 0.4.1 . `100%|█████████████████████████████████████████████████████████████████████████████████| 423/423 [09:39<00:00, 1.34s/it] [train] Epoch: 100/100 Loss: nan Acc: 0.010874704491725768 Execution time: 579.1260393778794

100%|█████████████████████████████████████████████████████████████████████████████████| 108/108 [01:02<00:00, 2.30it/s] [val] Epoch: 100/100 Loss: nan Acc: 0.0111162575266327 Execution time: 62.677289011888206

Save model at /media/ext/lizhongguo/ActionRecognition/pytorch-video-recognition/run/run_1/models/C3D-ucf101_epoch-99.pth.tar

100%|█████████████████████████████████████████████████████████████████████████████████| 136/136 [01:16<00:00, 3.15it/s] [test] Epoch: 100/100 Loss: nan Acc: 0.010736764161421697 Execution time: 76.43733210070059 `

lizhongguo avatar Feb 18 '19 06:02 lizhongguo

Hi, you may reduce the learning rate.

jfzhang95 avatar Feb 22 '19 04:02 jfzhang95

i also suffered from Loss:Nan.. I reduce learning rate from 1e-3 to 1e-1, but results is same(Loss : nan).

If Loss is nan, then cannot store weights. so model cant increase accuracy.... Anybody solved this problem?

KyuminHwang avatar Feb 26 '19 05:02 KyuminHwang

I checked the code from https://github.com/facebookresearch/VMZ/blob/master/lib/models/c3d_model.py , and added BatchNorm layer between Conv layer and Relu layer . Now it seems working on UCF-101 dataset .

lizhongguo avatar Feb 26 '19 08:02 lizhongguo

@lizhongguo let me have a look

lucasjinreal avatar Feb 26 '19 08:02 lucasjinreal

i also suffered from Loss:Nan.. I reduce learning rate from 1e-3 to 1e-1, but results is same(Loss : nan).

If Loss is nan, then cannot store weights. so model cant increase accuracy.... Anybody solved this problem?

Reducing learning rate means selecting a rate lower than 1e-3, such as 1e-5 or 0.5e-3. Personally I trained the model from scratch on UCF101 with learning rate equal to 1e-3, without having any NaN issues.

wave-transmitter avatar Feb 26 '19 08:02 wave-transmitter

@wave-transmitter Thank you for comment ! i solved this problem using learning rate. i reduced learning rate to 1e-5, then it worked correctly !

KyuminHwang avatar Feb 27 '19 00:02 KyuminHwang

however, when i reduce Learning rate, the acc is just 0.20, what should i do

ilovekj avatar May 02 '19 12:05 ilovekj

@ilovekj i recommend to find your proper learning rate ! i control to several times, and found proper rate. how about augment your dataset ?

KyuminHwang avatar May 05 '19 14:05 KyuminHwang

@makeastir but there is another question, it seems that they are splitting the dataset randomly, which is not allowed, there are three official splits, and when I use this code, it performance poor

ilovekj avatar May 07 '19 04:05 ilovekj

@ilovekj i also used this code and i got efficient performance. In this code has augmentation module so that this code should make dataset more useful. how about increase to your dataset quantity ? In my case, Non-True is 400 , True is 150. Or reduce to features of dataset ?

KyuminHwang avatar May 08 '19 04:05 KyuminHwang

@makeastir but you didn't use the official splits

ilovekj avatar May 08 '19 04:05 ilovekj

@ilovekj Hi. I used official split and corresponding dataloader and I only got 1% accuracy. But the same code on the random split is 98%. I wonder did you figure out the problem?

ziqi-zhang avatar May 09 '19 02:05 ziqi-zhang

maybe we didn't use pretrain model, but i am not sure

ilovekj avatar May 09 '19 06:05 ilovekj