YOLOP icon indicating copy to clipboard operation
YOLOP copied to clipboard

多卡训练的时候,卡在=> start training...

Open csn223355 opened this issue 2 years ago • 11 comments

训练配置如下: (torch171) lpj@252-2titanx:~/csn_work/YOLOP$ python -m torch.distributed.launch --nproc_per_node=2 tools/train.py


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


begin to bulid up model... => creating runs/BddDataset/_2022-11-29-15-36 Namespace(conf_thres=0.001, dataDir='', iou_thres=0.6, local_rank=0, logDir='runs/', modelDir='', prevModelDir='', sync_bn=False) AUTO_RESUME: False CUDNN: BENCHMARK: True DETERMINISTIC: False ENABLED: True DATASET: COLOR_RGB: False DATAROOT: /media/new_data4/csn_work/Datasets/BDD/BDD100K/images DATASET: BddDataset DATA_FORMAT: jpg FLIP: True HSV_H: 0.015 HSV_S: 0.7 HSV_V: 0.4 LABELROOT: /media/new_data4/csn_work/Datasets/BDD/BDD100K/det_annotations LANEROOT: /media/new_data4/csn_work/Datasets/BDD/BDD100K/ll_seg_annotations MASKROOT: /media/new_data4/csn_work/Datasets/BDD/BDD100K/da_seg_annotations ORG_IMG_SIZE: [720, 1280] ROT_FACTOR: 10 SCALE_FACTOR: 0.25 SELECT_DATA: False SHEAR: 0.0 TEST_SET: val TRAIN_SET: train TRANSLATE: 0.1 DEBUG: False GPUS: (0, 1) LOG_DIR: runs/ LOSS: BOX_GAIN: 0.05 CLS_GAIN: 0.5 CLS_POS_WEIGHT: 1.0 DA_SEG_GAIN: 0.2 FL_GAMMA: 0.0 LL_IOU_GAIN: 0.2 LL_SEG_GAIN: 0.2 LOSS_NAME: MULTI_HEAD_LAMBDA: None OBJ_GAIN: 1.0 OBJ_POS_WEIGHT: 1.0 SEG_POS_WEIGHT: 1.0 MODEL: EXTRA:

HEADS_NAME: [''] IMAGE_SIZE: [640, 640] NAME: PRETRAINED: PRETRAINED_DET: STRU_WITHSHARE: False NEED_AUTOANCHOR: False PIN_MEMORY: False PRINT_FREQ: 20 TEST: BATCH_SIZE_PER_GPU: 16 MODEL_FILE: NMS_CONF_THRESHOLD: 0.001 NMS_IOU_THRESHOLD: 0.6 PLOTS: True SAVE_JSON: False SAVE_TXT: False TRAIN: ANCHOR_THRESHOLD: 4.0 BATCH_SIZE_PER_GPU: 16 BEGIN_EPOCH: 0 DET_ONLY: False DRIVABLE_ONLY: False ENC_DET_ONLY: False ENC_SEG_ONLY: True END_EPOCH: 200 GAMMA1: 0.99 GAMMA2: 0.0 IOU_THRESHOLD: 0.2 LANE_ONLY: False LR0: 0.001 LRF: 0.2 MOMENTUM: 0.937 NESTEROV: True OPTIMIZER: adam PLOT: True SEG_ONLY: True SHUFFLE: True VAL_FREQ: 1 WARMUP_BIASE_LR: 0.1 WARMUP_EPOCHS: 3.0 WARMUP_MOMENTUM: 0.8 WD: 0.0005 WORKERS: 8 num_seg_class: 2 begin to bulid up model... Using torch 1.7.1 CUDA:0 (NVIDIA GeForce RTX 3080 Ti, 12053MB) CUDA:1 (NVIDIA GeForce RTX 3080 Ti, 12053MB)

load model to device load model to device freeze encoder and Det head... freezing model.0.conv.conv.weight freezing model.0.conv.bn.weight freezing model.0.conv.bn.bias freezing model.1.conv.weight freezing model.1.bn.weight freezing model.1.bn.bias freezing model.2.cv1.conv.weight freezing model.2.cv1.bn.weight freezing model.2.cv1.bn.bias freezing model.2.cv2.weight freezing model.2.cv3.weight freezing model.2.cv4.conv.weight freezing model.2.cv4.bn.weight freezing model.2.cv4.bn.bias freezing model.2.bn.weight freezing model.2.bn.bias freezing model.2.m.0.cv1.conv.weight freezing model.2.m.0.cv1.bn.weight freezing model.2.m.0.cv1.bn.bias freezing model.2.m.0.cv2.conv.weight freezing model.2.m.0.cv2.bn.weight freezing model.2.m.0.cv2.bn.bias freezing model.3.conv.weight freezing model.3.bn.weight freezing model.3.bn.bias freezing model.4.cv1.conv.weight freezing model.4.cv1.bn.weight freezing model.4.cv1.bn.bias freezing model.4.cv2.weight freezing model.4.cv3.weight freezing model.4.cv4.conv.weight freezing model.4.cv4.bn.weight freezing model.4.cv4.bn.bias freezing model.4.bn.weight freezing model.4.bn.bias freezing model.4.m.0.cv1.conv.weight freezing model.4.m.0.cv1.bn.weight freezing model.4.m.0.cv1.bn.bias freezing model.4.m.0.cv2.conv.weight freezing model.4.m.0.cv2.bn.weight freezing model.4.m.0.cv2.bn.bias freezing model.4.m.1.cv1.conv.weight freezing model.4.m.1.cv1.bn.weight freezing model.4.m.1.cv1.bn.bias freezing model.4.m.1.cv2.conv.weight freezing model.4.m.1.cv2.bn.weight freezing model.4.m.1.cv2.bn.bias freezing model.4.m.2.cv1.conv.weight freezing model.4.m.2.cv1.bn.weight freezing model.4.m.2.cv1.bn.bias freezing model.4.m.2.cv2.conv.weight freezing model.4.m.2.cv2.bn.weight freezing model.4.m.2.cv2.bn.bias freezing model.5.conv.weight freezing model.5.bn.weight freezing model.5.bn.bias freezing model.6.cv1.conv.weight freezing model.6.cv1.bn.weight freezing model.6.cv1.bn.bias freezing model.6.cv2.weight freezing model.6.cv3.weight freezing model.6.cv4.conv.weight freezing model.6.cv4.bn.weight freezing model.6.cv4.bn.bias freezing model.6.bn.weight freezing model.6.bn.bias freezing model.6.m.0.cv1.conv.weight freezing model.6.m.0.cv1.bn.weight freezing model.6.m.0.cv1.bn.bias freezing model.6.m.0.cv2.conv.weight freezing model.6.m.0.cv2.bn.weight freezing model.6.m.0.cv2.bn.bias freezing model.6.m.1.cv1.conv.weight freezing model.6.m.1.cv1.bn.weight freezing model.6.m.1.cv1.bn.bias freezing model.6.m.1.cv2.conv.weight freezing model.6.m.1.cv2.bn.weight freezing model.6.m.1.cv2.bn.bias freezing model.6.m.2.cv1.conv.weight freezing model.6.m.2.cv1.bn.weight freezing model.6.m.2.cv1.bn.bias freezing model.6.m.2.cv2.conv.weight freezing model.6.m.2.cv2.bn.weight freezing model.6.m.2.cv2.bn.bias freezing model.7.conv.weight freezing model.7.bn.weight freezing model.7.bn.bias freezing model.8.cv1.conv.weight freezing model.8.cv1.bn.weight freezing model.8.cv1.bn.bias freezing model.8.cv2.conv.weight freezing model.8.cv2.bn.weight freezing model.8.cv2.bn.bias freezing model.9.cv1.conv.weight freezing model.9.cv1.bn.weight freezing model.9.cv1.bn.bias freezing model.9.cv2.weight freezing model.9.cv3.weight freezing model.9.cv4.conv.weight freezing model.9.cv4.bn.weight freezing model.9.cv4.bn.bias freezing model.9.bn.weight freezing model.9.bn.bias freezing model.9.m.0.cv1.conv.weight freezing model.9.m.0.cv1.bn.weight freezing model.9.m.0.cv1.bn.bias freezing model.9.m.0.cv2.conv.weight freezing model.9.m.0.cv2.bn.weight freezing model.9.m.0.cv2.bn.bias freezing model.10.conv.weight freezing model.10.bn.weight freezing model.10.bn.bias freezing model.13.cv1.conv.weight freezing model.13.cv1.bn.weight freezing model.13.cv1.bn.bias freezing model.13.cv2.weight freezing model.13.cv3.weight freezing model.13.cv4.conv.weight freezing model.13.cv4.bn.weight freezing model.13.cv4.bn.bias freezing model.13.bn.weight freezing model.13.bn.bias freezing model.13.m.0.cv1.conv.weight freezing model.13.m.0.cv1.bn.weight freezing model.13.m.0.cv1.bn.bias freezing model.13.m.0.cv2.conv.weight freezing model.13.m.0.cv2.bn.weight freezing model.13.m.0.cv2.bn.bias freezing model.14.conv.weight freezing model.14.bn.weight freezing model.14.bn.bias freezing model.17.cv1.conv.weight freezing model.17.cv1.bn.weight freezing model.17.cv1.bn.bias freezing model.17.cv2.weight freezing model.17.cv3.weight freezing model.17.cv4.conv.weight freezing model.17.cv4.bn.weight freezing model.17.cv4.bn.bias freezing model.17.bn.weight freezing model.17.bn.bias freezing model.17.m.0.cv1.conv.weight freezing model.17.m.0.cv1.bn.weight freezing model.17.m.0.cv1.bn.bias freezing model.17.m.0.cv2.conv.weight freezing model.17.m.0.cv2.bn.weight freezing model.17.m.0.cv2.bn.bias freezing model.18.conv.weight freezing model.18.bn.weight freezing model.18.bn.bias freezing model.20.cv1.conv.weight freezing model.20.cv1.bn.weight freezing model.20.cv1.bn.bias freezing model.20.cv2.weight freezing model.20.cv3.weight freezing model.20.cv4.conv.weight freezing model.20.cv4.bn.weight freezing model.20.cv4.bn.bias freezing model.20.bn.weight freezing model.20.bn.bias freezing model.20.m.0.cv1.conv.weight freezing model.20.m.0.cv1.bn.weight freezing model.20.m.0.cv1.bn.bias freezing model.20.m.0.cv2.conv.weight freezing model.20.m.0.cv2.bn.weight freezing model.20.m.0.cv2.bn.bias freezing model.21.conv.weight freezing model.21.bn.weight freezing model.21.bn.bias freezing model.23.cv1.conv.weight freezing model.23.cv1.bn.weight freezing model.23.cv1.bn.bias freezing model.23.cv2.weight freezing model.23.cv3.weight freezing model.23.cv4.conv.weight freezing model.23.cv4.bn.weight freezing model.23.cv4.bn.bias freezing model.23.bn.weight freezing model.23.bn.bias freezing model.23.m.0.cv1.conv.weight freezing model.23.m.0.cv1.bn.weight freezing model.23.m.0.cv1.bn.bias freezing model.23.m.0.cv2.conv.weight freezing model.23.m.0.cv2.bn.weight freezing model.23.m.0.cv2.bn.bias freezing model.24.m.0.weight freezing model.24.m.0.bias freezing model.24.m.1.weight freezing model.24.m.1.bias freezing model.24.m.2.weight freezing model.24.m.2.bias freeze Det head... freezing model.17.cv1.conv.weight freezing model.17.cv1.bn.weight freezing model.17.cv1.bn.bias freezing model.17.cv2.weight freezing model.17.cv3.weight freezing model.17.cv4.conv.weight freezing model.17.cv4.bn.weight freezing model.17.cv4.bn.bias freezing model.17.bn.weight freezing model.17.bn.bias freezing model.17.m.0.cv1.conv.weight freezing model.17.m.0.cv1.bn.weight freezing model.17.m.0.cv1.bn.bias freezing model.17.m.0.cv2.conv.weight freezing model.17.m.0.cv2.bn.weight freezing model.17.m.0.cv2.bn.bias freezing model.18.conv.weight freezing model.18.bn.weight freezing model.18.bn.bias freezing model.20.cv1.conv.weight freezing model.20.cv1.bn.weight freezing model.20.cv1.bn.bias freezing model.20.cv2.weight freezing model.20.cv3.weight freezing model.20.cv4.conv.weight freezing model.20.cv4.bn.weight freezing model.20.cv4.bn.bias freezing model.20.bn.weight freezing model.20.bn.bias freezing model.20.m.0.cv1.conv.weight freezing model.20.m.0.cv1.bn.weight freezing model.20.m.0.cv1.bn.bias freezing model.20.m.0.cv2.conv.weight freezing model.20.m.0.cv2.bn.weight freezing model.20.m.0.cv2.bn.bias freezing model.21.conv.weight freezing model.21.bn.weight freezing model.21.bn.bias freezing model.23.cv1.conv.weight freezing model.23.cv1.bn.weight freezing model.23.cv1.bn.bias freezing model.23.cv2.weight freezing model.23.cv3.weight freezing model.23.cv4.conv.weight freezing model.23.cv4.bn.weight freezing model.23.cv4.bn.bias freezing model.23.bn.weight freezing model.23.bn.bias freezing model.23.m.0.cv1.conv.weight freezing model.23.m.0.cv1.bn.weight freezing model.23.m.0.cv1.bn.bias freezing model.23.m.0.cv2.conv.weight freezing model.23.m.0.cv2.bn.weight freezing model.23.m.0.cv2.bn.bias freezing model.24.m.0.weight freezing model.24.m.0.bias freezing model.24.m.1.weight freezing model.24.m.1.bias freezing model.24.m.2.weight freezing model.24.m.2.bias begin to load data building database... 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70000/70000 [00:24<00:00, 2912.50it/s] database build finish building database... 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:03<00:00, 2907.35it/s] database build finish load data finished anchors loaded successfully tensor([[[0.3750, 1.1250], [0.6250, 1.3750], [0.5000, 2.5000]],

    [[0.4375, 1.1250],
     [0.3750, 2.4375],
     [0.7500, 1.9375]],

    [[0.5938, 1.5625],
     [1.1875, 2.5312],
     [2.1250, 4.9062]]], device='cuda:0')

=> start training... Start traning 请问这是什么原因呢?有什么好的解决办法吗?

csn223355 avatar Nov 29 '22 07:11 csn223355

我也遇到了,请问你解决了吗

QinghuanWei avatar Mar 16 '23 00:03 QinghuanWei

解决了,这是去年11月份实习的项目,我有点记不清了。原代码没有不需要修改。我不成熟的建议你可以尝试一下 1.你查看一下default.py参数配置和训练部分代码吗的参数配置。 2.查看一下=> start training...后面部分代码有没有死循环。 3.查看一下后台进程,是不是有之前的进程还占用显存 4.重启一下服务器。

csn223355 avatar Mar 16 '23 02:03 csn223355

同样的问题,请问如何解决的

happyday-lkj avatar May 18 '23 05:05 happyday-lkj

同样的问题,请问如何解决的

我发现当存在冻结网络的行为时,多卡训练就会卡死,没有冻结时就可以正常训练。看源码我猜测应该是把模型放到多张显卡上之后,冻结了部分网络,但是这个操作又没有同步到所有显卡上,所以卡死。我对多卡了解不多,也不确定原因对不对,你可以看看。

QinghuanWei avatar May 18 '23 06:05 QinghuanWei

同样的问题,请问如何解决的

我发现当存在冻结网络的行为时,多卡训练就会卡死,没有冻结时就可以正常训练。看源码我猜测应该是把模型放到多张显卡上之后,冻结了部分网络,但是这个操作又没有同步到所有显卡上,所以卡死。我对多卡了解不多,也不确定原因对不对,你可以看看。

感谢回复,我现在就是重新训练,冻结seg分支,训练det分支

happyday-lkj avatar May 18 '23 06:05 happyday-lkj

同样的问题,请问如何解决的

我发现当存在冻结网络的行为时,多卡训练就会卡死,没有冻结时就可以正常训练。看源码我猜测应该是把模型放到多张显卡上之后,冻结了部分网络,但是这个操作又没有同步到所有显卡上,所以卡死。我对多卡了解不多,也不确定原因对不对,你可以看看。

感谢回复,我现在就是重新训练,冻结seg分支,训练det分支

我最后还是用的单卡练的,应该是要先冻结模型然后再把模型包装成DDP模型也就是model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]),你可以试试

QinghuanWei avatar May 18 '23 06:05 QinghuanWei

同样的问题,请问如何解决的

我发现当存在冻结网络的行为时,多卡训练就会卡死,没有冻结时就可以正常训练。看源码我猜测应该是把模型放到多张显卡上之后,冻结了部分网络,但是这个操作又没有同步到所有显卡上,所以卡死。我对多卡了解不多,也不确定原因对不对,你可以看看。

感谢回复,我现在就是重新训练,冻结seg分支,训练det分支

我最后还是用的单卡练的,应该是要先冻结模型然后再把模型包装成DDP模型也就是model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]),你可以试试

我想按照YOLOv5的代码重构一下

happyday-lkj avatar May 18 '23 08:05 happyday-lkj

如果你不用seg分支可以直接删掉,修改模型配置文件就可以

QinghuanWei avatar May 18 '23 08:05 QinghuanWei

需要用的,如果不用seg分支,那我就直接使用yolov5或者yolov8了,现在需要这样一个大框架,省去一些事情

happyday-lkj avatar May 18 '23 08:05 happyday-lkj

同样的问题,请问如何解决的

我发现当存在冻结网络的行为时,多卡训练就会卡死,没有冻结时就可以正常训练。看源码我猜测应该是把模型放到多张显卡上之后,冻结了部分网络,但是这个操作又没有同步到所有显卡上,所以卡死。我对多卡了解不多,也不确定原因对不对,你可以看看。

你使用bdd100k训练一个epoch大概需要多久时间啊,我这边单卡大概需要25分钟,这个时间的话,太久了啊

happyday-lkj avatar Jun 01 '23 11:06 happyday-lkj

同样的问题,请问如何解决的

我发现当存在冻结网络的行为时,多卡训练就会卡死,没有冻结时就可以正常训练。看源码我猜测应该是把模型放到多张显卡上之后,冻结了部分网络,但是这个操作又没有同步到所有显卡上,所以卡死。我对多卡了解不多,也不确定原因对不对,你可以看看。

你使用bdd100k训练一个epoch大概需要多久时间啊,我这边单卡大概需要25分钟,这个时间的话,太久了啊

happyday-lkj avatar Jun 01 '23 11:06 happyday-lkj