loss become infinite while training quant models

Open RaidenE1 opened this issue 4 years ago • 1 comments

hi, when i try to train a quant model using configdetectron2/configs/COCO-Detection/retinanet_R_18_FPN_1x-Full-SyncBN-lsq-2bit.yaml, and the loss became nan at iterations 390

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/zhangjinhe/anaconda3/envs/torch/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/launch.py", line 125, in _distributed_worker    main_func(*args)
  File "/home/zhangjinhe/QTools/git/detectron2/tools/train_net.py", line 154, in main
    return trainer.train()
  File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/defaults.py", line 489, in train    super().train(self.start_iter, self.max_iter)
  File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/train_loop.py", line 149, in train    self.run_step()  File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/defaults.py", line 499, in run_step    self._trainer.run_step()  File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/train_loop.py", line 289, in run_step    self._write_metrics(loss_dict, data_time)  File "/home/zhangjinhe/QTools/git/detectron2/detectron2/engine/train_loop.py", line 332, in _write_metrics
    f"Loss became infinite or NaN at iteration={self.iter}!\n"
FloatingPointError: Loss became infinite or NaN at iteration=390!

The commang i use is python tools/train_net.py --config-file configs/COCO-Detection/retinanet_R_18_FPN_1x-Full-SyncBN-lsq-2bit.yaml --num-gpus 4 MODEL.WEIGHTS output/coco-detection/retinanet_R_18_FPN_1x-Full_BN/model_final.pth

I change the input_size from (640, 672, 704, 736, 768, 800) to (800,) and the checkpoint file is the result of another experiment using config retinanet_R_18_FPN_1x-Full-BN.yaml

Any ideas why?

Oct 11 '21 15:10 RaidenE1

HI, @RaidenE1

There maybe many reasons for the loss NAN. I also frequently met such a problem. My general ways are:

empoly single GPU for training to see any error
close the quantization to verify whether the issue is caused by quantization or other problems
set all weight decay to zero (sometimes weight decay draw the quant scale to zero)
enable the wt_stable or fm_stable to allow a better initilization of quant scale (clip_val)
use a smaller learning rate and try another optimizer (sgd or adam)
check which tensor becomes NAN

Hope these tips help~

May 07 '22 00:05 blueardour