ScaledYOLOv4 icon indicating copy to clipboard operation
ScaledYOLOv4 copied to clipboard

Fail to train with multiple of GPU in DP mode

Open Lin-Qingyang-Alec opened this issue 5 years ago • 4 comments

Here is the wrong detail. Traceback (most recent call last): File "/home/xxx/hard_disk/xxx/ScaledYOLOv4/train.py", line 438, in train(hyp, opt, device, tb_writer) File "/home/xxx/hard_disk/xxx/ScaledYOLOv4/train.py", line 255, in train loss, loss_items = compute_loss(pred, targets.to(device), model) # scaled by batch_size File "/home/xxx/hard_disk/xxx/ScaledYOLOv4/utils/general.py", line 446, in compute_loss tcls, tbox, indices, anchors = build_targets(p, targets, model) # targets File "/home/xxx/hard_disk/xxx/ScaledYOLOv4/utils/general.py", line 526, in build_targets r = t[None, :, 4:6] / anchors[:, None] # wh ratio RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Process finished with exit code 1

Lin-Qingyang-Alec avatar Nov 22 '20 13:11 Lin-Qingyang-Alec

Have you solved it?

yuyijie1995 avatar Nov 26 '20 12:11 yuyijie1995

Have you solved it?

Sorry I don't. So I run it in DDP mode[Laugh and cry]. It runs well.

Lin-Qingyang-Alec avatar Nov 26 '20 12:11 Lin-Qingyang-Alec

Have you solved it?

This is a problem that arises because the elements being computed go into the cpu and gpu respectively. At line 531 of the 'general.py' file, t(target) goes into gpu and anchor goes into cpu, so when you divide anchor by t, an error occurs. This is solved by sending the anchor to the gpu before the calculation takes place. The code is anchor = anchor.to(device='cuda'). Please understand that I am unfamiliar with using github.

JoonHoonKim avatar Apr 01 '21 09:04 JoonHoonKim

Have you solved it?

This is a problem that arises because the elements being computed go into the cpu and gpu respectively. At line 531 of the 'general.py' file, t(target) goes into gpu and anchor goes into cpu, so when you divide anchor by t, an error occurs. This is solved by sending the anchor to the gpu before the calculation takes place. The code is anchor = anchor.to(device='cuda'). Please understand that I am unfamiliar with using github.

I added "anchors = anchors.to(device='cuda')" in 141 line in loss.py file and that been work! (06.09.2021) Now, my code in loss.py (135-149 line) look like for i, jj in enumerate(model.module.yolo_layers if multi_gpu else model.yolo_layers): # get number of grid points and anchor vec for this yolo layer anchors = model.module.module_list[jj].anchor_vec if multi_gpu else model.module_list[jj].anchor_vec gain[2:] = torch.tensor(p[i].shape)[[3, 2, 3, 2]] # xyxy gain

    # Match targets to anchors
    anchors = anchors.to(device='cuda')
    a, t, offsets = [], targets * gain, 0
    if nt:
        na = anchors.shape[0]  # number of anchors
        at = torch.arange(na).view(na, 1).repeat(1, nt)  # anchor tensor, same as .repeat_interleave(nt)
        r = t[None, :, 4:6] / anchors[:, None]  # wh ratio
        j = torch.max(r, 1. / r).max(2)[0] < model.hyp['anchor_t']  # compare
        # j = wh_iou(anchors, t[:, 4:6]) > model.hyp['iou_t']  # iou(3,n) = wh_iou(anchors(3,2), gwh(n,2))
        a, t = at[j], t.repeat(na, 1, 1)[j]  # filter

ShADAMoV avatar Sep 06 '21 13:09 ShADAMoV