InsightFace_TF Loss become nan

I use train_nets_mgpu.py with batch size 64 to run on 2 gpu.The only change is the batch size. But at step 88840, all the losses become nan.According to the code, at this iteration, the lr is 0.001, not a large lr problem. Also, the training accuracy is very weird.

should I change the initial lr from 0.001 to 0.0001, because the batch size is only 64

Here is the training log

epoch 0, total_step 88680, total loss: [16.43, 16.24], inference loss: [7.73, 7.54], weight deacy loss: [8.70, 8.70], training accuracy is 0.375000, time 92.323 samples/sec epoch 0, total_step 88700, total loss: [17.05, 18.03], inference loss: [8.35, 9.33], weight deacy loss: [8.70, 8.70], training accuracy is 0.156250, time 92.043 samples/sec epoch 0, total_step 88720, total loss: [16.25, 15.06], inference loss: [7.56, 6.36], weight deacy loss: [8.70, 8.70], training accuracy is 0.312500, time 93.241 samples/sec epoch 0, total_step 88740, total loss: [18.12, 20.31], inference loss: [9.42, 11.62], weight deacy loss: [8.70, 8.70], training accuracy is 0.093750, time 93.002 samples/sec epoch 0, total_step 88760, total loss: [16.82, 17.98], inference loss: [8.12, 9.29], weight deacy loss: [8.70, 8.70], training accuracy is 0.375000, time 92.508 samples/sec epoch 0, total_step 88780, total loss: [17.50, 20.11], inference loss: [8.80, 11.41], weight deacy loss: [8.70, 8.70], training accuracy is 0.281250, time 92.468 samples/sec epoch 0, total_step 88800, total loss: [17.24, 17.54], inference loss: [8.54, 8.85], weight deacy loss: [8.70, 8.70], training accuracy is 0.187500, time 93.764 samples/sec epoch 0, total_step 88820, total loss: [14.16, 15.54], inference loss: [5.47, 6.84], weight deacy loss: [8.69, 8.69], training accuracy is 0.343750, time 93.311 samples/sec epoch 0, total_step 88840, total loss: [nan, nan], inference loss: [nan, nan], weight deacy loss: [nan, nan], training accuracy is 0.000000, time 92.466 samples/sec epoch 0, total_step 88860, total loss: [nan, nan], inference loss: [nan, nan], weight deacy loss: [nan, nan], training accuracy is 0.000000, time 92.594 samples/sec epoch 0, total_step 88880, total loss: [nan, nan], inference loss: [nan, nan], weight deacy loss: [nan, nan], training accuracy is 0.000000, time 92.265 samples/sec epoch 0, total_step 88900, total loss: [nan, nan], inference loss: [nan, nan], weight deacy loss: [nan, nan], training accuracy is 0.000000, time 92.653 samples/sec

Jan 09 '19 02:01 ghost

Hi! Did you solve this problem?

Mar 12 '19 02:03 chaitanyasiva8

@chaitanyasiva8
Yes, I download the MS1M-ArcFace (85K ids/5.8M images) from https://github.com/deepinsight/insightface/wiki/Dataset-Zoo and re-calculate the num of label, which is 85742. The origianl num_output arguments is 85164, that is wrong and it's the reason why nan happens.

Mar 12 '19 03:03 ghost

Thanks! I hope it works!

Mar 12 '19 03:03 chaitanyasiva8

Thanks so much! @ghost. However, how did you find the true reason is "num_output" not others? I have tried different lr list , but it does not work...

Oct 24 '19 08:10 sunruina2

InsightFace_TF InsightFace_TF copied to clipboard

Loss become nan

InsightFace_TF
InsightFace_TF copied to clipboard