YOLOv3v4-ModelCompression-MultidatasetTraining-Multibackbone icon indicating copy to clipboard operation
YOLOv3v4-ModelCompression-MultidatasetTraining-Multibackbone copied to clipboard

多GPU训练

Open chenxyyy opened this issue 4 years ago • 3 comments

你好, 我在使用多GPU训练的时候, 每次都会遇到这个问题

Namespace(BN_Fold=False, FPGA=False, KDstr=-1, a_bit=8, adam=False, batch_size=16, bucket='', cache_images=False, cfg='./cfg/yolov4/yolov4.cfg', data='data/coco2017.data', device='0,1,2,4', ema=False, epochs=300, evolve=False, img_size=[320, 640], multi_scale=False, name='', nosave=False, notest=False, prune=0, pt=False, quantized=0, rect=False, resume=False, s=0.0001, single_cls=False, sr=True, t_cfg='', t_weights='', w_bit=8, weights='weights/yolo4_coco/qianyi_weight/best.pt')
Using CUDA Apex device0 _CudaDeviceProperties(name='Tesla T4', total_memory=15109MB)
                device1 _CudaDeviceProperties(name='Tesla T4', total_memory=15109MB)
                device2 _CudaDeviceProperties(name='Tesla T4', total_memory=15109MB)
                device3 _CudaDeviceProperties(name='Tesla T4', total_memory=15109MB)

Start Tensorboard with "tensorboard --logdir=runs", view at http://localhost:6006/
Model Summary: 327 layers, 6.43631e+07 parameters, 6.43631e+07 gradients
Optimizer groups: 110 .bias, 110 Conv2d.weight, 107 other
muti-gpus sparse
normal sparse training 
Image sizes 320 - 640 train, 640 test
Using 8 dataloader workers
Starting training for 300 epochs...

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size


  0%|          | 0/4381 [00:00<?, ?it/s]
  0%|          | 0/4381 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 987, in <module>
    train(hyp)  # train normally
  File "train.py", line 330, in train
    pred, feature_s = model(imgs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 580, in forward
    output = self.gather(outputs, self.output_device)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 607, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/_functions.py", line 71, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/comm.py", line 230, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA out of memory. Tried to allocate 400.00 MiB (GPU 0; 14.76 GiB total capacity; 13.07 GiB already allocated; 5.75 MiB free; 13.43 GiB reserved in total by PyTorch)

我使用的训练命令是

python train.py --data data/coco2017.data --batch-size 16 --cfg cfg/yolov4/yolov4.cfg --weights weights/yolo4_coco/qianyi_weight/best.pt --cfg cfg/yolov4/yolov4.cfg --device 0,1,2,4 -sr --s 0.0001 --prune 0

我用了4个Tesla T4 GPU, 而且4张卡都是空闲状态,为什么会出现显存不足的现象呢?

chenxyyy avatar Dec 30 '20 06:12 chenxyyy

你可以试试减小batchsize 或者减小imgsize,yolov4就是比较吃显存

SpursLipu avatar Dec 30 '20 13:12 SpursLipu

单张T4显卡训练时候,batchsize 的大小设置为10是没问题的。 4张显卡设置成16就不行了,感觉不是batchsize 的问题啊

chenxyyy avatar Dec 30 '20 14:12 chenxyyy

@chenxyyy did you solve that issue. I am pruning a model, I tried different pruning thresholds and even reduce my batch size to 1 but the same error is comming. RuntimeError: CUDA out of memory. Tried to allocate 170.00 MiB (GPU 0; 7.79 GiB total capacity; 5.54 GiB already allocated; 43.25 MiB free; 6.11 GiB reserved in total by PyTorch) @SpursLipu can you suggest something. Also, I tried with pruning threshold between 0.5 to 0.01 but every time it's showing after pruning model mAP is 0.0, which could be the possible problem ??

sharoseali avatar Mar 31 '21 05:03 sharoseali