AdelaiDepth icon indicating copy to clipboard operation
AdelaiDepth copied to clipboard

loss becomes nan

Open erzhu222 opened this issue 2 years ago • 4 comments

lib.utils.logging INFO: [Step 10470/182650] [Epoch 2/50] [multi] loss: nan, time: 5.862533, eta: 11 days, 16:23:31 meanstd-tanh_auxiloss: nan, meanstd-tanh_loss: nan, msg_normal_loss: nan, pairwise-normal-regress-edge_loss: nan, pairwise-normal-regress-plane_loss: nan, ranking-edge_auxiloss: nan, ranking-edge_loss: nan, abs_rel: 0.211080, whdr: 0.087764, group0_lr: 0.001000, group1_lr: 0.001000, 您好,当我在用taskonomy DiverseDepth HRWSI Holopix50k这四个数据集训练的时候,loss变成了nan,请问您在训练的时候有遇到这样的问题吗?如果有应该怎么解决呢?谢谢!下面是我输入的参数 --backbone resnext101
--dataset_list taskonomy DiverseDepth HRWSI Holopix50k
--batchsize 16
--base_lr 0.001
--use_tfboard
--thread 8
--loss_mode ranking-edge_pairwise-normal-regress-edge_msgil-normal_meanstd-tanh_pairwise-normal-regress-plane_ranking-edge-auxi_meanstd-tanh-auxi
--epoch 50
--lr_scheduler_multiepochs 10 25 40
--val_step 5000
--snapshot_iters 5000
--log_interval 10 \

erzhu222 avatar Aug 23 '22 01:08 erzhu222

I didn't face this issue. You can clip your gradient to avoid this issue.

YvanYin avatar Aug 23 '22 03:08 YvanYin

Thanks very much, I will try! However, I didn't change the code (the latest) and only change the batchsize and thread and use 8 nvidia V100 to train, what batchsize and thread did you set as you train?

erzhu222 avatar Aug 23 '22 03:08 erzhu222

The change of batchsize will not cause the loss nan. I ever faced the "loss nan" problem due to the crop operation. If the depth image becomes invalid(0) for the whole image after cropping, the loss will be nan. I will try to debug and avoid it but it may be time-consuming due to the need for 8 nvidia V100 GPUs.

How many iterations have you trained before the loss nan? You can try to clip the gradient to avoid it, or wait for my debugging. Thank you!

guangkaixu avatar Aug 27 '22 12:08 guangkaixu

Thanks for your reply!The loss became nan after I trained about 12000 iterations (the 3rd epoch), and I see the code you released contains gradient clip, it seems not work.

erzhu222 avatar Aug 29 '22 03:08 erzhu222