Exception: found a loss that is not finite
Hello, thanks for the excellent work! Your answer would be deeply appreciated.
I'm training a shufflenetv2k30 for human pose estimation using customised dataset and customised data-augmentation. I have visualised and validated the training samples, however during the training, I ran into this problem:
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/train.py", line 202, in
again: File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/composite.py", line 339, in forward raise Exception('found a loss that is not finite: {}, prev: {}' Exception: found a loss that is not finite: [tensor(-181.7841, device='cuda:0', grad_fn=<DivBackward0>), tensor(inf, device='cuda:0', grad_fn=<DivBackward0>), tensor(0.1630, device='cuda:0', grad_fn=<DivBackward0>)], prev: [-187.84259033203125, 1307.96533203125, 0.12597596645355225]
It happens in a low frequency: ~once per 250k images. And I printed out the related values when it happened again. torch.sum(l_confidence_bg): tensor(499.5447, device='cuda:0', grad_fn=<SumBackward0>) torch.sum(l_confidence)): tensor(-16766.3672, device='cuda:0', grad_fn=<SumBackward0>) torch.sum(l_reg): tensor(inf, device='cuda:0', grad_fn=<SumBackward0>) torch.sum(l_scale): tensor(7.2607, device='cuda:0', grad_fn=<SumBackward0>) batch_size: 64 x_regs: tensor(-68.8318, device='cuda:0', grad_fn=<SumBackward0>) t_regs: tensor(139.7582, device='cuda:0') t_sigma_min: tensor(190.9250, device='cuda:0') t_scales_reg: tensor(17457.9258, device='cuda:0')
In composite.py and components.py under /networks/losses, I noticed that: x[above_max] = self.max_value + torch.log(1 - self.max_value + x[above_max]), this formula does not actually clamp the value when the value is inf. Also there's no corresponding remedy for inf values' occurrance in class RegressionLoss(). I have confirmed that the input images don't contain inf values or nan values. Please could you suggest what the possible reasons might be?
I have exactly the same issue.
It happens when I am trying to fine-tune a model (using --checkpoint) that has not been trained on my custom dataset.
Surprisingly, I don't have this issue when I am training from scratch (using --basenet) with my custom dataset.
I managed to finetune a checkoint by decreasing the learning rate.
I use now --lr=0.0001 with SGD optimizer and everything seems to work.