VC-R-CNN icon indicating copy to clipboard operation
VC-R-CNN copied to clipboard

nan loss while training

Open Park-ing-lot opened this issue 2 years ago • 1 comments

I use CUDA_VISIBLE_DEVICES=2 python tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_101_FPN_1x.yaml" --skip-test SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1 MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN 2000

this command to follow your instruction and I use coco 2017 train and val data.

While training, the loss keeps around 8 and did not drop. after 6000 steps, the model spits nan loss.

do you have any idea why nan loss is coming? What is the problem?

Park-ing-lot avatar Aug 12 '21 15:08 Park-ing-lot

I and my partner had got the same problem. We tried to train the R101 network after uncommenting rpn. It is working (our present iteration number is 31K+). We agree that it is different from the CVPR VCRCNN paper's training method. We think the backbone would not be trained well after removing RPN. We may be wrong.

Request @Wangt-CN comment in this regard.

ArghyaPal avatar Oct 11 '21 17:10 ArghyaPal