iMTFA
iMTFA copied to clipboard
NaN losses
Hi,
I am observing Nan losses just after 1st or 2nd iteration. I am running following command python3 tools/run_train.py --config-file configs/coco-experiments/mask_rcnn_R_50_FPN_fc_fullclsag_base.yaml
Following is the error that I am getting:
Traceback (most recent call last):
File "/home/puneet/segmentation/iMTFA/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/home/puneet/segmentation/iMTFA/detectron2/engine/train_loop.py", line 217, in run_step
self._detect_anomaly(losses, loss_dict)
File "/home/puneet/segmentation/iMTFA/detectron2/engine/train_loop.py", line 240, in _detect_anomaly
self.iter, loss_dict
FloatingPointError: Loss became infinite or NaN at iteration=2!
loss_dict = {'loss_cls': tensor(6.5565, device='cuda:0', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(0.0876, device='cuda:0', grad_fn=<DivBackward0>), 'loss_mask': tensor(nan, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_rpn_cls': tensor(0.7022, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rpn_loc': tensor(0.0971, device='cuda:0', grad_fn=<MulBackward0>)}
[07/25 13:14:56 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks)
Traceback (most recent call last):
File "tools/run_train.py", line 151, in <module>
args=(args,),
File "/home/puneet/segmentation/iMTFA/detectron2/engine/launch.py", line 72, in launch
main_func(*args)
File "tools/run_train.py", line 126, in main
return trainer.train()
File "/home/puneet/segmentation/iMTFA/detectron2/engine/defaults.py", line 393, in train
super().train(self.start_iter, self.max_iter)
File "/home/puneet/segmentation/iMTFA/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/home/puneet/segmentation/iMTFA/detectron2/engine/train_loop.py", line 217, in run_step
self._detect_anomaly(losses, loss_dict)
File "/home/puneet/segmentation/iMTFA/detectron2/engine/train_loop.py", line 240, in _detect_anomaly
self.iter, loss_dict
FloatingPointError: Loss became infinite or NaN at iteration=2!
loss_dict = {'loss_cls': tensor(6.5565, device='cuda:0', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(0.0876, device='cuda:0', grad_fn=<DivBackward0>), 'loss_mask': tensor(nan, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_rpn_cls': tensor(0.7022, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rpn_loc': tensor(0.0971, device='cuda:0', grad_fn=<MulBackward0>)}
Segmentation fault (core dumped)```
Any idea what could be the issue here?