FasterNet Multi GPU training

trafficstars

PConv may be just useful in only 1 GPU, I run it in two GPUs, it doesn't work. So it can be resolved?

Mar 30 '23 03:03 VV1314

Hi, could you be more specific on your question? PConv should work with multiple GPUs. And there have been some works successfully reproducing our results, e.g., the repo.

Mar 30 '23 14:03 JierunChen

"RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)" this is the error

Mar 31 '23 12:03 VV1314

"RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)" this is the error

This is a general issue regardless of PConv. Please make sure your model and data are on the same devices. You may refer to the following links fore more details: https://stackoverflow.com/questions/70884007/runtimeerror-expected-all-tensors-to-be-on-the-same-device-but-found-at-least https://discuss.pytorch.org/t/nn-conv2d-causing-error-in-multi-gpu-learning/142804/2 https://zhuanlan.zhihu.com/p/560322701 https://discuss.pytorch.org/t/error-with-dataparallel/139384

Apr 04 '23 08:04 JierunChen

same issue. this may be caused by internal implementation in DataParallel. To solve it, you can simply rename forward_split_cat to forward and comment some codes in __init__.

May 30 '23 10:05 HTensor

在使用nn.DataParallel单机多卡时，报Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)，解决方法如下：把所有self.forward = self.forward_slicing、self.forward = self.forward_split_cat、self.forward = self.forward_layer_scale...这种self.forward被赋值的写法全部改写：

class Partial_conv3(nn.Module):

def __init__(self, dim, n_div, forward):
    super().__init__()
    self.dim_conv3 = dim // n_div
    self.dim_untouched = dim - self.dim_conv3
    self.partial_conv3 = nn.Conv2d(self.dim_conv3, self.dim_conv3, 3, 1, 1, bias=False)
    self.forward_type = forward
    # if forward == 'slicing':
    #     self.forward = self.forward_slicing
    # elif forward == 'split_cat':
    #     self.forward = self.forward_split_cat
    # else:
    #     raise NotImplementedError

def forward(self, x: Tensor) -> Tensor:
    if self.forward_type == 'slicing':
        # only for inference
        x = x.clone()   # !!! Keep the original input intact for the residual connection later
        x[:, :self.dim_conv3, :, :] = self.partial_conv3(x[:, :self.dim_conv3, :, :])
    elif self.forward_type == 'split_cat':
        # for training/inference
        x1, x2 = torch.split(x, [self.dim_conv3, self.dim_untouched], dim=1)
        x1 = self.partial_conv3(x1)
        x = torch.cat((x1, x2), 1)

    return x

Jul 03 '23 08:07 yirui-fafa

在使用nn.DataParallel单机多卡时，报Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)，解决方法如下：把所有self.forward = self.forward_slicing、self.forward = self.forward_split_cat、self.forward = self.forward_layer_scale...这种self.forward被赋值的写法全部改写：

class Partial_conv3(nn.Module):
def __init__(self, dim, n_div, forward):
    super().__init__()
    self.dim_conv3 = dim // n_div
    self.dim_untouched = dim - self.dim_conv3
    self.partial_conv3 = nn.Conv2d(self.dim_conv3, self.dim_conv3, 3, 1, 1, bias=False)
    self.forward_type = forward
    # if forward == 'slicing':
    #     self.forward = self.forward_slicing
    # elif forward == 'split_cat':
    #     self.forward = self.forward_split_cat
    # else:
    #     raise NotImplementedError

def forward(self, x: Tensor) -> Tensor:
    if self.forward_type == 'slicing':
        # only for inference
        x = x.clone()   # !!! Keep the original input intact for the residual connection later
        x[:, :self.dim_conv3, :, :] = self.partial_conv3(x[:, :self.dim_conv3, :, :])
    elif self.forward_type == 'split_cat':
        # for training/inference
        x1, x2 = torch.split(x, [self.dim_conv3, self.dim_untouched], dim=1)
        x1 = self.partial_conv3(x1)
        x = torch.cat((x1, x2), 1)

    return x        

太感谢你了

Aug 11 '23 05:08 xibici

谢谢我已收到您的来信~

Aug 11 '23 05:08 yirui-fafa

FasterNet FasterNet copied to clipboard

Multi GPU training

FasterNet
FasterNet copied to clipboard