FasterNet icon indicating copy to clipboard operation
FasterNet copied to clipboard

Multi GPU training

Open VV1314 opened this issue 2 years ago • 7 comments
trafficstars

PConv may be just useful in only 1 GPU, I run it in two GPUs, it doesn't work. So it can be resolved?

VV1314 avatar Mar 30 '23 03:03 VV1314

Hi, could you be more specific on your question? PConv should work with multiple GPUs. And there have been some works successfully reproducing our results, e.g., the repo.

JierunChen avatar Mar 30 '23 14:03 JierunChen

"RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)" this is the error

VV1314 avatar Mar 31 '23 12:03 VV1314

"RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)" this is the error

This is a general issue regardless of PConv. Please make sure your model and data are on the same devices. You may refer to the following links fore more details: https://stackoverflow.com/questions/70884007/runtimeerror-expected-all-tensors-to-be-on-the-same-device-but-found-at-least https://discuss.pytorch.org/t/nn-conv2d-causing-error-in-multi-gpu-learning/142804/2 https://zhuanlan.zhihu.com/p/560322701 https://discuss.pytorch.org/t/error-with-dataparallel/139384

JierunChen avatar Apr 04 '23 08:04 JierunChen

same issue. this may be caused by internal implementation in DataParallel. To solve it, you can simply rename forward_split_cat to forward and comment some codes in __init__.

HTensor avatar May 30 '23 10:05 HTensor

在使用nn.DataParallel单机多卡时,报Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution),解决方法如下: 把所有self.forward = self.forward_slicing、self.forward = self.forward_split_cat、self.forward = self.forward_layer_scale...这种self.forward被赋值的写法全部改写:

class Partial_conv3(nn.Module):

def __init__(self, dim, n_div, forward):
    super().__init__()
    self.dim_conv3 = dim // n_div
    self.dim_untouched = dim - self.dim_conv3
    self.partial_conv3 = nn.Conv2d(self.dim_conv3, self.dim_conv3, 3, 1, 1, bias=False)
    self.forward_type = forward
    # if forward == 'slicing':
    #     self.forward = self.forward_slicing
    # elif forward == 'split_cat':
    #     self.forward = self.forward_split_cat
    # else:
    #     raise NotImplementedError

def forward(self, x: Tensor) -> Tensor:
    if self.forward_type == 'slicing':
        # only for inference
        x = x.clone()   # !!! Keep the original input intact for the residual connection later
        x[:, :self.dim_conv3, :, :] = self.partial_conv3(x[:, :self.dim_conv3, :, :])
    elif self.forward_type == 'split_cat':
        # for training/inference
        x1, x2 = torch.split(x, [self.dim_conv3, self.dim_untouched], dim=1)
        x1 = self.partial_conv3(x1)
        x = torch.cat((x1, x2), 1)

    return x        

yirui-fafa avatar Jul 03 '23 08:07 yirui-fafa

在使用nn.DataParallel单机多卡时,报Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution),解决方法如下: 把所有self.forward = self.forward_slicing、self.forward = self.forward_split_cat、self.forward = self.forward_layer_scale...这种self.forward被赋值的写法全部改写:

class Partial_conv3(nn.Module):

def __init__(self, dim, n_div, forward):
    super().__init__()
    self.dim_conv3 = dim // n_div
    self.dim_untouched = dim - self.dim_conv3
    self.partial_conv3 = nn.Conv2d(self.dim_conv3, self.dim_conv3, 3, 1, 1, bias=False)
    self.forward_type = forward
    # if forward == 'slicing':
    #     self.forward = self.forward_slicing
    # elif forward == 'split_cat':
    #     self.forward = self.forward_split_cat
    # else:
    #     raise NotImplementedError

def forward(self, x: Tensor) -> Tensor:
    if self.forward_type == 'slicing':
        # only for inference
        x = x.clone()   # !!! Keep the original input intact for the residual connection later
        x[:, :self.dim_conv3, :, :] = self.partial_conv3(x[:, :self.dim_conv3, :, :])
    elif self.forward_type == 'split_cat':
        # for training/inference
        x1, x2 = torch.split(x, [self.dim_conv3, self.dim_untouched], dim=1)
        x1 = self.partial_conv3(x1)
        x = torch.cat((x1, x2), 1)

    return x        

太感谢你了

xibici avatar Aug 11 '23 05:08 xibici

谢谢我已收到您的来信~

yirui-fafa avatar Aug 11 '23 05:08 yirui-fafa