SegmenTron
SegmenTron copied to clipboard
Bug in tools/train.py
It worked by single GPU traing. But it failed no matter how many GPUs I appointed when I tried distributed training.
https://github.com/LikeLy-Journey/SegmenTron/blob/4bc605eedde7d680314f63d329277b73f83b1c5f/tools/train.py#L109
It shall be self.model.cuda()
It works when I change this line.
The following part is the message of error I met with the former code: (faceparsing) mjq@amax:~/SegmenTron$ CUDA_VISIBLE_DEVICES=0,7 ./tools/dist_train.sh ${CONFIG_FILE} configs/pascal_voc_deeplabv3_plus.yaml ${GPU_NUM} 2
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
2020-06-06 02:21:55,815 Segmentron INFO: Using 2 GPUs
2020-06-06 02:21:55,816 Segmentron INFO: Namespace(config_file='configs/pascal_voc_deeplabv3_plus.yaml', device='cuda', distributed=True, input_img='tools/demo_vis.png', local_rank=0, log_iter=10, no_cuda=False, num_gpus=2, opts=[], resume=None, skip_val=False, val_epoch=1)
2020-06-06 02:21:55,816 Segmentron INFO: {
"SEED": 1024,
"TIME_STAMP": "2020-06-06-02-21",
"ROOT_PATH": "/data1/mjq/SegmenTron",
"PHASE": "train",
"DATASET": {
"NAME": "pascal_voc",
"MEAN": [
0.5,
0.5,
0.5
],
"STD": [
0.5,
0.5,
0.5
],
"IGNORE_INDEX": -1,
"WORKERS": 4,
"MODE": "val"
},
"AUG": {
"MIRROR": true,
"BLUR_PROB": 0.0,
"BLUR_RADIUS": 0.0,
"COLOR_JITTER": null
},
"TRAIN": {
"EPOCHS": 50,
"BATCH_SIZE": 4,
"CROP_SIZE": 480,
"BASE_SIZE": 520,
"MODEL_SAVE_DIR": "runs/checkpoints/",
"LOG_SAVE_DIR": "runs/logs/",
"PRETRAINED_MODEL_PATH": "",
"BACKBONE_PRETRAINED": true,
"BACKBONE_PRETRAINED_PATH": "",
"RESUME_MODEL_PATH": "",
"SYNC_BATCH_NORM": true,
"SNAPSHOT_EPOCH": 10
},
"SOLVER": {
"LR": 0.0001,
"OPTIMIZER": "sgd",
"EPSILON": 1e-08,
"MOMENTUM": 0.9,
"WEIGHT_DECAY": 0.0001,
"DECODER_LR_FACTOR": 10.0,
"LR_SCHEDULER": "poly",
"POLY": {
"POWER": 0.9
},
"STEP": {
"GAMMA": 0.1,
"DECAY_EPOCH": [
10,
20
]
},
"WARMUP": {
"EPOCHS": 0.0,
"FACTOR": 0.3333333333333333,
"METHOD": "linear"
},
"OHEM": false,
"AUX": false,
"AUX_WEIGHT": 0.4,
"LOSS_NAME": ""
},
"TEST": {
"TEST_MODEL_PATH": "",
"BATCH_SIZE": 8,
"CROP_SIZE": null,
"SCALES": [
1.0
],
"FLIP": false
},
"VISUAL": {
"OUTPUT_DIR": "../runs/visual/"
},
"MODEL": {
"MODEL_NAME": "DeepLabV3_Plus",
"BACKBONE": "xception65",
"BACKBONE_SCALE": 1.0,
"MULTI_LOSS_WEIGHT": [
1.0
],
"DEFAULT_GROUP_NUMBER": 32,
"DEFAULT_EPSILON": 1e-05,
"BN_TYPE": "BN",
"BN_EPS_FOR_ENCODER": 0.001,
"BN_EPS_FOR_DECODER": null,
"OUTPUT_STRIDE": 16,
"BN_MOMENTUM": null,
"DEEPLABV3_PLUS": {
"USE_ASPP": true,
"ENABLE_DECODER": true,
"ASPP_WITH_SEP_CONV": true,
"DECODER_USE_SEP_CONV": true
},
"CCNET": {
"RECURRENCE": 2
}
}
}
Found 1464 images in the folder datasets/voc/VOC2012
Found 1464 images in the folder datasets/voc/VOC2012
Found 1449 images in the folder datasets/voc/VOC2012
Found 1449 images in the folder datasets/voc/VOC2012
2020-06-06 02:21:56,181 Segmentron INFO: load backbone pretrained model from url..
2020-06-06 02:21:56,480 Segmentron INFO: <All keys matched successfully>
Traceback (most recent call last):
File "./tools/train.py", line 223, in
Thanks for your attention! @LikeLy-Journey
@leonmakise hello,i met the same error,when i tried distributed training.. ...i see your changes,but i can not understand what it means
It shall be self.model.cuda() ori code:
self.model = nn.parallel.DistributedDataParallel(self.model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True)
your change is:
self.model.cuda() =nn.parallel.DistributedDataParallel(self.model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True)
thank you~~~~looking for your reply~~