FaPN-full icon indicating copy to clipboard operation
FaPN-full copied to clipboard

dcn_v2 error RuntimeError: expected scalar type Float but found Half

Open KingWangJL opened this issue 3 years ago • 8 comments

When running the network, I encountered this problem. Through debugging, I found that the offset in DCN's forword function is a type of float16.So I think this might be the cause of the problem,Do you have a better idea for this problem.

KingWangJL avatar Feb 24 '22 12:02 KingWangJL

class DCN(DCNv2): def init(self, in_channels, out_channels, kernel_size, stride, padding, dilation=1, deformable_groups=1, extra_offset_mask=False,): super(DCN, self).init(in_channels, out_channels, kernel_size, stride, padding, dilation, deformable_groups)

    self.extra_offset_mask = extra_offset_mask
    channels_ = self.deformable_groups * 3 * self.kernel_size[0] * self.kernel_size[1]
    self.conv_offset_mask = nn.Conv2d(self.in_channels, channels_, kernel_size=self.kernel_size, stride=self.stride, padding=self.padding, bias=True)
    self.init_offset()

def init_offset(self):
    self.conv_offset_mask.weight.data.zero_()
    self.conv_offset_mask.bias.data.zero_()

def forward(self, input, main_path=None):
    if self.extra_offset_mask:
        out = self.conv_offset_mask(input[1])
        input = input[0]
    else:
        out = self.conv_offset_mask(input)
    o1, o2, mask = torch.chunk(out, 3, dim=1)       # each has self.deformable_groups * self.kernel_size[0] * self.kernel_size[1] channels
    offset = torch.cat((o1, o2), dim=1)  # x, y [0-8]: the first group,

KingWangJL avatar Feb 24 '22 12:02 KingWangJL

@KingWangJL Many thanks for your interest in our work. We also find this problem when we train our models with Apex Mixed Precision. However, we still have not found any good solution to this problem now. For now, we just train the model with full precision.

ShihuaHuang95 avatar Feb 25 '22 06:02 ShihuaHuang95

Thanks your reply,I directly modified the dCN_v2 source code,At present, the network model training is normal,But I don't think it's a good way,it's only run is OK!

At 2022-02-25 14:08:07, "Shihua Huang" @.***> wrote:

@KingWangJL Many thanks for your interest in our work. We also find this problem when we train our models with Apex Mixed Precision. However, we still have not found any good solution to this problem now. For now, we just train the model with full precision.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.Message ID: @.***>

KingWangJL avatar Feb 26 '22 02:02 KingWangJL

I have successfully trained the model using apex.amp and got comparable results. You can add @amp.float_function on top of the forward and backward function of modules in DCNv2. Maybe you can refer to https://github.com/CharlesShang/DCNv2/pull/50

LeoniusChen avatar May 13 '22 08:05 LeoniusChen

@LeoniusChen Cool! Thanks for your sharing!

ShihuaHuang95 avatar May 13 '22 14:05 ShihuaHuang95

@LeoniusChen By the way, could you please share the final results when apex is used?

ShihuaHuang95 avatar May 13 '22 14:05 ShihuaHuang95

I only apply apex.amp to test the Cityscapes Semantic Segmentation (PointRend + FaPN R50) task. Here is the result. a05725ac-a9b5-4286-a296-779abb2369eb

LeoniusChen avatar May 14 '22 04:05 LeoniusChen

Noted. Thanks again for your interest in our work. By the way, compared to the results in our paper, it is not good.

ShihuaHuang95 avatar May 14 '22 08:05 ShihuaHuang95