FCOS icon indicating copy to clipboard operation
FCOS copied to clipboard

loss become nan after 20 or 40 iters.

Open hheavenknowss opened this issue 3 years ago • 4 comments

Hi there, thanks for your work! I have some issues about training, here is my log.

nohup: ignoring input 2020-12-28 12:35:18,951 fcos_core INFO: Using 2 GPUs 2020-12-28 12:35:18,951 fcos_core INFO: Namespace(config_file='configs/fcos/fcos_imprv_dcnv2_X_101_64x4d_FPN_2x.yaml', distributed=True, local_rank=0, opts=['DATALOADER.NUM_WORKERS', '2', 'OUTPUT_DIR', 'training_dir/fcos_Decathlon'], skip_test=False) 2020-12-28 12:35:18,951 fcos_core INFO: Collecting env info (might take some time) 2020-12-28 12:35:21,205 fcos_core INFO: PyTorch version: 1.1.0 Is debug build: No CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.3 LTS GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609 CMake version: version 3.5.1

Python version: 3.6 Is CUDA available: Yes CUDA runtime version: 9.0.176 GPU models and configuration: GPU 0: GeForce GTX 1080 Ti GPU 1: GeForce GTX 1080 Ti

Nvidia driver version: 384.111 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.0.5

Versions of relevant libraries: [pip3] msgpack-numpy==0.4.3.2 [pip3] numpy==1.15.0 [pip3] numpydoc==0.7.0 [pip3] pytorch-pretrained-bert==0.6.2 [pip3] torch==1.1.0 [pip3] torchfile==0.1.0 [pip3] torchtext==0.3.1 [pip3] torchvision==0.3.0 [conda] Could not collect Pillow (5.0.0) 2020-12-28 12:35:21,206 fcos_core INFO: Loaded configuration file configs/fcos/fcos_imprv_dcnv2_X_101_64x4d_FPN_2x.yaml 2020-12-28 12:35:21,206 fcos_core INFO: MODEL: META_ARCHITECTURE: "GeneralizedRCNN" WEIGHT: "https://cloudstor.aarnet.edu.au/plus/s/k3ys35075jmU1RP/download#X-101-64x4d.pkl" RPN_ONLY: True FCOS_ON: True BACKBONE: CONV_BODY: "R-101-FPN-RETINANET" RESNETS: STRIDE_IN_1X1: False BACKBONE_OUT_CHANNELS: 256 NUM_GROUPS: 64 WIDTH_PER_GROUP: 4 STAGE_WITH_DCN: (False, False, True, True) WITH_MODULATED_DCN: True DEFORMABLE_GROUPS: 1 RETINANET: USE_C5: False # FCOS uses P5 instead of C5 FCOS: # normalizing the regression targets with FPN strides NORM_REG_TARGETS: True # positioning centerness on the regress branch. # Please refer to https://github.com/tianzhi0549/FCOS/issues/89#issuecomment-516877042 CENTERNESS_ON_REG: True # using center sampling and GIoU. # Please refer to https://github.com/yqyao/FCOS_PLUS CENTER_SAMPLING_RADIUS: 1.5 IOU_LOSS_TYPE: "giou" # we only use dcn in the last layer of towers USE_DCN_IN_TOWER: True DATASETS: TRAIN: ("coco_Decathlon_train",) TEST: ("coco_Decathlon_val",) INPUT: MIN_SIZE_RANGE_TRAIN: (640, 800) MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 800 MAX_SIZE_TEST: 1333 DATALOADER: SIZE_DIVISIBILITY: 32 SOLVER: BASE_LR: 0.01 WEIGHT_DECAY: 0.0001 STEPS: (120000, 160000) MAX_ITER: 180000 IMS_PER_BATCH: 10 WARMUP_METHOD: "constant" TEST: BBOX_AUG: ENABLED: False H_FLIP: True SCALES: (400, 500, 600, 700, 900, 1000, 1100, 1200) MAX_SIZE: 2000 SCALE_H_FLIP: True

2020-12-28 12:35:21,207 fcos_core INFO: Running with config: DATALOADER: ASPECT_RATIO_GROUPING: True NUM_WORKERS: 2 SIZE_DIVISIBILITY: 32 DATASETS: TEST: ('coco_Decathlon_val',) TRAIN: ('coco_Decathlon_train',) INPUT: MAX_SIZE_TEST: 1333 MAX_SIZE_TRAIN: 1333 MIN_SIZE_RANGE_TRAIN: (640, 800) MIN_SIZE_TEST: 800 MIN_SIZE_TRAIN: (800,) PIXEL_MEAN: [102.9801, 115.9465, 122.7717] PIXEL_STD: [1.0, 1.0, 1.0] TO_BGR255: True MODEL: BACKBONE: CONV_BODY: R-101-FPN-RETINANET FREEZE_CONV_BODY_AT: 2 USE_GN: False CLS_AGNOSTIC_BBOX_REG: False DEVICE: cuda FBNET: ARCH: default ARCH_DEF: BN_TYPE: bn DET_HEAD_BLOCKS: [] DET_HEAD_LAST_SCALE: 1.0 DET_HEAD_STRIDE: 0 DW_CONV_SKIP_BN: True DW_CONV_SKIP_RELU: True KPTS_HEAD_BLOCKS: [] KPTS_HEAD_LAST_SCALE: 0.0 KPTS_HEAD_STRIDE: 0 MASK_HEAD_BLOCKS: [] MASK_HEAD_LAST_SCALE: 0.0 MASK_HEAD_STRIDE: 0 RPN_BN_TYPE: RPN_HEAD_BLOCKS: 0 SCALE_FACTOR: 1.0 WIDTH_DIVISOR: 1 FCOS: CENTERNESS_ON_REG: True CENTER_SAMPLING_RADIUS: 1.5 FPN_STRIDES: [8, 16, 32, 64, 128] INFERENCE_TH: 0.05 IOU_LOSS_TYPE: giou LOSS_ALPHA: 0.25 LOSS_GAMMA: 2.0 NMS_TH: 0.6 NORM_REG_TARGETS: True NUM_CLASSES: 2 NUM_CONVS: 4 PRE_NMS_TOP_N: 1000 PRIOR_PROB: 0.01 USE_DCN_IN_TOWER: True FCOS_ON: True FPN: USE_GN: False USE_RELU: False GROUP_NORM: DIM_PER_GP: -1 EPSILON: 1e-05 NUM_GROUPS: 32 KEYPOINT_ON: False MASK_ON: False META_ARCHITECTURE: GeneralizedRCNN RESNETS: BACKBONE_OUT_CHANNELS: 256 DEFORMABLE_GROUPS: 1 NUM_GROUPS: 64 RES2_OUT_CHANNELS: 256 RES5_DILATION: 1 STAGE_WITH_DCN: (False, False, True, True) STEM_FUNC: StemWithFixedBatchNorm STEM_OUT_CHANNELS: 64 STRIDE_IN_1X1: False TRANS_FUNC: BottleneckWithFixedBatchNorm WIDTH_PER_GROUP: 4 WITH_MODULATED_DCN: True RETINANET: ANCHOR_SIZES: (32, 64, 128, 256, 512) ANCHOR_STRIDES: (8, 16, 32, 64, 128) ASPECT_RATIOS: (0.5, 1.0, 2.0) BBOX_REG_BETA: 0.11 BBOX_REG_WEIGHT: 4.0 BG_IOU_THRESHOLD: 0.4 FG_IOU_THRESHOLD: 0.5 INFERENCE_TH: 0.05 LOSS_ALPHA: 0.25 LOSS_GAMMA: 2.0 NMS_TH: 0.4 NUM_CLASSES: 81 NUM_CONVS: 4 OCTAVE: 2.0 PRE_NMS_TOP_N: 1000 PRIOR_PROB: 0.01 SCALES_PER_OCTAVE: 3 STRADDLE_THRESH: 0 USE_C5: False RETINANET_ON: False ROI_BOX_HEAD: CONV_HEAD_DIM: 256 DILATION: 1 FEATURE_EXTRACTOR: ResNet50Conv5ROIFeatureExtractor MLP_HEAD_DIM: 1024 NUM_CLASSES: 81 NUM_STACKED_CONVS: 4 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_SCALES: (0.0625,) PREDICTOR: FastRCNNPredictor USE_GN: False ROI_HEADS: BATCH_SIZE_PER_IMAGE: 512 BBOX_REG_WEIGHTS: (10.0, 10.0, 5.0, 5.0) BG_IOU_THRESHOLD: 0.5 DETECTIONS_PER_IMG: 100 FG_IOU_THRESHOLD: 0.5 NMS: 0.5 POSITIVE_FRACTION: 0.25 SCORE_THRESH: 0.05 USE_FPN: False ROI_KEYPOINT_HEAD: CONV_LAYERS: (512, 512, 512, 512, 512, 512, 512, 512) FEATURE_EXTRACTOR: KeypointRCNNFeatureExtractor MLP_HEAD_DIM: 1024 NUM_CLASSES: 17 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_SCALES: (0.0625,) PREDICTOR: KeypointRCNNPredictor RESOLUTION: 14 SHARE_BOX_FEATURE_EXTRACTOR: True ROI_MASK_HEAD: CONV_LAYERS: (256, 256, 256, 256) DILATION: 1 FEATURE_EXTRACTOR: ResNet50Conv5ROIFeatureExtractor MLP_HEAD_DIM: 1024 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_SCALES: (0.0625,) POSTPROCESS_MASKS: False POSTPROCESS_MASKS_THRESHOLD: 0.5 PREDICTOR: MaskRCNNC4Predictor RESOLUTION: 14 SHARE_BOX_FEATURE_EXTRACTOR: True USE_GN: False RPN: ANCHOR_SIZES: (32, 64, 128, 256, 512) ANCHOR_STRIDE: (16,) ASPECT_RATIOS: (0.5, 1.0, 2.0) BATCH_SIZE_PER_IMAGE: 256 BG_IOU_THRESHOLD: 0.3 FG_IOU_THRESHOLD: 0.7 FPN_POST_NMS_TOP_N_TEST: 2000 FPN_POST_NMS_TOP_N_TRAIN: 2000 MIN_SIZE: 0 NMS_THRESH: 0.7 POSITIVE_FRACTION: 0.5 POST_NMS_TOP_N_TEST: 1000 POST_NMS_TOP_N_TRAIN: 2000 PRE_NMS_TOP_N_TEST: 6000 PRE_NMS_TOP_N_TRAIN: 12000 RPN_HEAD: SingleConvRPNHead STRADDLE_THRESH: 0 USE_FPN: False RPN_ONLY: True USE_SYNCBN: False WEIGHT: https://cloudstor.aarnet.edu.au/plus/s/k3ys35075jmU1RP/download#X-101-64x4d.pkl OUTPUT_DIR: training_dir/fcos_Decathlon PATHS_CATALOG: /data/pxf/FCOS-master/fcos_core/config/paths_catalog.py SOLVER: BASE_LR: 0.01 BIAS_LR_FACTOR: 2 CHECKPOINT_PERIOD: 2500 DCONV_OFFSETS_LR_FACTOR: 1.0 GAMMA: 0.1 IMS_PER_BATCH: 10 MAX_ITER: 180000 MOMENTUM: 0.9 STEPS: (120000, 160000) WARMUP_FACTOR: 0.3333333333333333 WARMUP_ITERS: 500 WARMUP_METHOD: constant WEIGHT_DECAY: 0.0001 WEIGHT_DECAY_BIAS: 0 TEST: BBOX_AUG: ENABLED: False H_FLIP: True MAX_SIZE: 2000 SCALES: (400, 500, 600, 700, 900, 1000, 1100, 1200) SCALE_H_FLIP: True DETECTIONS_PER_IMG: 100 EXPECTED_RESULTS: [] EXPECTED_RESULTS_SIGMA_TOL: 4 IMS_PER_BATCH: 8


loading annotations into memory... loading annotations into memory... Done (t=0.06s) creating index... index created! Done (t=0.06s) creating index... index created! 2020-12-28 12:35:23,601 fcos_core.trainer INFO: Start training /root/anaconda3/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:100: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " /root/anaconda3/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:100: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn("torch.distributed.reduce_op is deprecated, please use " 2020-12-28 12:36:34,561 fcos_core.trainer INFO: eta: 7 days, 9:22:30 iter: 20 loss: 13.8627 (14.0002) loss_centerness: 0.6938 (0.7076) loss_cls: 12.2059 (12.4410) loss_reg: 0.8402 (0.8516) time: 3.4431 (3.5479) data: 0.0109 (0.0280) lr: 0.003333 max mem: 9314 2020-12-28 12:37:42,998 fcos_core.trainer INFO: eta: 7 days, 6:12:21 iter: 40 loss: 23.3687 (nan) loss_centerness: 0.7030 (0.7552) loss_cls: 21.8341 (nan) loss_reg: 0.8110 (0.8365) time: 3.3760 (3.4849) data: 0.0098 (0.0190) lr: 0.003333 max mem: 9314 2020-12-28 12:38:54,639 fcos_core.trainer INFO: eta: 7 days, 7:48:16 iter: 60 loss: nan (nan) loss_centerness: nan (nan) loss_cls: nan (nan) loss_reg: nan (nan) time: 3.5713 (3.5173) data: 0.0095 (0.0161) lr: 0.003333 max mem: 9314 2020-12-28 12:40:04,952 fcos_core.trainer INFO: eta: 7 days, 7:45:54 iter: 80 loss: nan (nan) loss_centerness: nan (nan) loss_cls: nan (nan) loss_reg: nan (nan) time: 3.4935 (3.5169) data: 0.0100 (0.0147) lr: 0.003333 max mem: 9314 2020-12-28 12:41:15,379 fcos_core.trainer INFO: eta: 7 days, 7:47:24 iter: 100 loss: nan (nan) loss_centerness: nan (nan) loss_cls: nan (nan) loss_reg: nan (nan) time: 3.5659 (3.5178) data: 0.0098 (0.0137) lr: 0.003333 max mem: 9314 2020-12-28 12:42:26,265 fcos_core.trainer INFO: eta: 7 days, 7:59:31 iter: 120 loss: nan (nan) loss_centerness: nan (nan) loss_cls: nan (nan) loss_reg: nan (nan) time: 3.5816 (3.5222) data: 0.0093 (0.0131) lr: 0.003333 max mem: 9314 2020-12-28 12:43:37,949 fcos_core.trainer INFO: eta: 7 days, 8:24:53 iter: 140 loss: nan (nan) loss_centerness: nan (nan) loss_cls: nan (nan) loss_reg: nan (nan) time: 3.5643 (3.5310) data: 0.0094 (0.0127) lr: 0.003333 max mem: 9314 2020-12-28 12:44:49,125 fcos_core.trainer INFO: eta: 7 days, 8:34:07 iter: 160 loss: nan (nan) loss_centerness: nan (nan) loss_cls: nan (nan) loss_reg: nan (nan) time: 3.5420 (3.5345) data: 0.0104 (0.0125) lr: 0.003333 max mem: 9314 2020-12-28 12:45:59,373 fcos_core.trainer INFO: eta: 7 days, 8:25:33 iter: 180 loss: nan (nan) loss_centerness: nan (nan) loss_cls: nan (nan) loss_reg: nan (nan) time: 3.5570 (3.5321) data: 0.0100 (0.0123) lr: 0.003333 max mem: 9314

I've tried many times, and sometimes loss is normal. but most times loss is like this. I'm wondering why loss always become nan after few iters. Is there something wrong about that?

hheavenknowss avatar Dec 29 '20 09:12 hheavenknowss

@hheavenknowss Please try to clip the gradients.

tianzhi0549 avatar Dec 30 '20 02:12 tianzhi0549

@hheavenknowss Please try to clip the gradients.

I have reduce the learning rate and it works, thanks reply

hheavenknowss avatar Dec 30 '20 03:12 hheavenknowss

@hheavenknowss Can you tell me your revised learning rate? I have the same problem. I'm looking forward to your reply.

maojiaoli avatar Apr 07 '21 02:04 maojiaoli

@hheavenknowss Can you tell me your revised learning rate? I have the same problem. I'm looking forward to your reply.

Sorry about that, it has been a while and I forgot that. But I remember I revised my learning rate to a lower learning rate. Hope that can help you.

hheavenknowss avatar Apr 07 '21 02:04 hheavenknowss