openpifpaf Exception: found a loss that is not finite

Hello, thanks for the excellent work! Your answer would be deeply appreciated.

I'm training a shufflenetv2k30 for human pose estimation using customised dataset and customised data-augmentation. I have visualised and validated the training samples, however during the training, I ran into this problem: File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/train.py", line 202, in main() File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/train.py", line 198, in main trainer.loop(train_loader, val_loader, start_epoch=start_epoch) File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/trainer.py", line 158, in loop self.train(train_scenes, epoch) File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/trainer.py", line 294, in train loss, head_losses = self.train_batch(data, target, apply_gradients) File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/trainer.py", line 183, in train_batch loss, head_losses = self.loss(outputs, targets) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/multi_head.py", line 29, in forward flat_head_losses = [ll File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/multi_head.py", line 31, in for ll in l(f, t)] File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/composite.py", line 339, in forward raise Exception('found a loss that is not finite: {}, prev: {}' Exception: found a loss that is not finite: [tensor(-187.4948, device='cuda:0', grad_fn=<DivBackward0>), tensor(inf, device='cuda:0', grad_fn=<DivBackward0>), tensor(0.2073, device='cuda:0', grad_fn=<DivBackward0>)], prev: [-169.70677185058594, 1248.675048828125, 0.15428416430950165]

again: File "/opt/conda/lib/python3.8/site-packages/openpifpaf-0.13.4+2.gc7b05a2-py3.8-linux-x86_64.egg/openpifpaf/network/losses/composite.py", line 339, in forward raise Exception('found a loss that is not finite: {}, prev: {}' Exception: found a loss that is not finite: [tensor(-181.7841, device='cuda:0', grad_fn=<DivBackward0>), tensor(inf, device='cuda:0', grad_fn=<DivBackward0>), tensor(0.1630, device='cuda:0', grad_fn=<DivBackward0>)], prev: [-187.84259033203125, 1307.96533203125, 0.12597596645355225]

It happens in a low frequency: ~once per 250k images. And I printed out the related values when it happened again. torch.sum(l_confidence_bg): tensor(499.5447, device='cuda:0', grad_fn=<SumBackward0>) torch.sum(l_confidence)): tensor(-16766.3672, device='cuda:0', grad_fn=<SumBackward0>) torch.sum(l_reg): tensor(inf, device='cuda:0', grad_fn=<SumBackward0>) torch.sum(l_scale): tensor(7.2607, device='cuda:0', grad_fn=<SumBackward0>) batch_size: 64 x_regs: tensor(-68.8318, device='cuda:0', grad_fn=<SumBackward0>) t_regs: tensor(139.7582, device='cuda:0') t_sigma_min: tensor(190.9250, device='cuda:0') t_scales_reg: tensor(17457.9258, device='cuda:0')

In composite.py and components.py under /networks/losses, I noticed that: x[above_max] = self.max_value + torch.log(1 - self.max_value + x[above_max]), this formula does not actually clamp the value when the value is inf. Also there's no corresponding remedy for inf values' occurrance in class RegressionLoss(). I have confirmed that the input images don't contain inf values or nan values. Please could you suggest what the possible reasons might be?

Sep 24 '22 14:09 XiaodongGuan

I have exactly the same issue.

It happens when I am trying to fine-tune a model (using --checkpoint) that has not been trained on my custom dataset. Surprisingly, I don't have this issue when I am training from scratch (using --basenet) with my custom dataset.

Jan 06 '23 14:01 bstandaert

I managed to finetune a checkoint by decreasing the learning rate. I use now --lr=0.0001 with SGD optimizer and everything seems to work.

Jan 11 '23 10:01 bstandaert