tensorflow-yolov3 icon indicating copy to clipboard operation
tensorflow-yolov3 copied to clipboard

loss non on test set

Open justanotherYO opened this issue 5 years ago • 8 comments

I am testing on VOC2007 dataset. The training went ok and the training loss is keep dropping in a good way (after 3 epoch it was ~30). However, every time after a epoch finished, the test loss is always NAN. Anybody face the similar problem? PS: I am training from the scratch.

justanotherYO avatar May 26 '19 10:05 justanotherYO

Never mind. I got it fixed. The problem is mine, and has nothing to do with the code...

justanotherYO avatar May 26 '19 23:05 justanotherYO

Never mind. I got it fixed. The problem is mine, and has nothing to do with the code...

I have same problem with you. So, how to solve it? reduce learning rate?

ZH-Lee avatar May 29 '19 08:05 ZH-Lee

same issue? anyone?

Sahaj09 avatar May 29 '19 17:05 Sahaj09

I changed my __C.TRAIN.BATCH_SIZE to 3, which caused me to get loss=nan issue. I changed it to 2, which fixed the issue. Originally I had it at 6, but ran into OOM exception. Running on crappy AM8 2 core cpu, with only 8 gigs of ram, and a new RTX 2080, 8 gig.

andydion avatar May 30 '19 07:05 andydion

I changed my __C.TRAIN.BATCH_SIZE to 3, which caused me to get loss=nan issue. I changed it to 2, which fixed the issue. Originally I had it at 6, but ran into OOM exception. Running on crappy AM8 2 core cpu, with only 8 gigs of ram, and a new RTX 2080, 8 gig.

So, does that mean we can't set the batch_size to large? or maybe depend on our GPU memory?

ZH-Lee avatar May 30 '19 08:05 ZH-Lee

I'm not sure. Probably the batch_size was too large possibly in combination with a larger dataset with 8000 images. How did you resolve your "nan" issue?

andydion avatar May 30 '19 13:05 andydion

Did you wait for few epochs? I had nan on a test set for first few epochs (training on the XISRay dataset with default settings) and then it went back to normal.

ps. I didn't have the GPU memory issue.

FangliangBai avatar May 30 '19 16:05 FangliangBai

Did you wait for few epochs? I had nan on a test set for first few epochs (training on the XISRay dataset with default settings) and then it went back to normal.

ps. I didn't have the GPU memory issue.

@FangliangBai After how many iterations it become normal?

MuhammadAsadJaved avatar Aug 07 '20 08:08 MuhammadAsadJaved