loss become infinite while training quant models
hi, when i try to train a quant model using configdetectron2/configs/COCO-Detection/retinanet_R_18_FPN_1x-Full-SyncBN-lsq-2bit.yaml, and the loss became nan at iterations 390
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/zhangjinhe/anaconda3/envs/torch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/launch.py", line 125, in _distributed_worker main_func(*args)
File "/home/zhangjinhe/QTools/git/detectron2/tools/train_net.py", line 154, in main
return trainer.train()
File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/defaults.py", line 489, in train super().train(self.start_iter, self.max_iter)
File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/train_loop.py", line 149, in train self.run_step() File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/defaults.py", line 499, in run_step self._trainer.run_step() File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/train_loop.py", line 289, in run_step self._write_metrics(loss_dict, data_time) File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/train_loop.py", line 332, in _write_metrics
f"Loss became infinite or NaN at iteration={self.iter}!\n"
FloatingPointError: Loss became infinite or NaN at iteration=390!
The commang i use is python tools/train_net.py --config-file configs/COCO-Detection/retinanet_R_18_FPN_1x-Full-SyncBN-lsq-2bit.yaml --num-gpus 4 MODEL.WEIGHTS output/coco-detection/retinanet_R_18_FPN_1x-Full_BN/model_final.pth
I change the input_size from (640, 672, 704, 736, 768, 800) to (800,) and the checkpoint file is the result of another experiment using config retinanet_R_18_FPN_1x-Full-BN.yaml
Any ideas why?
HI, @RaidenE1
There maybe many reasons for the loss NAN. I also frequently met such a problem. My general ways are:
- empoly single GPU for training to see any error
- close the quantization to verify whether the issue is caused by quantization or other problems
- set all weight decay to zero (sometimes weight decay draw the quant scale to zero)
- enable the wt_stable or fm_stable to allow a better initilization of quant scale (clip_val)
- use a smaller learning rate and try another optimizer (sgd or adam)
- check which tensor becomes NAN
Hope these tips help~