TensorRT TensorRT8.5.2 conversion failed when converting groups convolutions

Description

It will report an error when converting a model with group convolution using TensorRT8.5.2.

Environment

TensorRT Version: 8.5.2.2

Relevant Files

The error message is as follows：

[08/29/2023-21:40:17] [TRT] [E] Conv_221:kernel weights has count 294912 but 147456 was expected
[08/29/2023-21:40:17] [TRT] [E] Conv_221: count of 294912 weights in kernel, but kernel dimensions (3,3) with 128 input channels, 256 output channels and 2 groups were specified. Expected Weights count is 128 * 3*3 * 256 / 2 = 147456
[08/29/2023-21:40:17] [TRT] [E] [convolutionNode.cpp::computeOutputExtents::57] Error Code 4: Internal Error (Conv_221: number of kernel weights does not match tensor dimensions)
[08/29/2023-21:40:17] [TRT] [E] ModelImporter.cpp:726: While parsing node number 221 [Conv -> "input.487"]:
[08/29/2023-21:40:17] [TRT] [E] ModelImporter.cpp:727: --- Begin node ---
[08/29/2023-21:40:17] [TRT] [E] ModelImporter.cpp:728: input: "input.483"
input: "bbox_head.reg_convs.1.conv.weight"
input: "bbox_head.reg_convs.1.conv.bias"
output: "input.487"
name: "Conv_221"
op_type: "Conv"
attribute {
  name: "dilations"
  ints: 1
  ints: 1
  type: INTS
}
attribute {
  name: "group"
  i: 2
  type: INT
}
attribute {
  name: "kernel_shape"
  ints: 3
  ints: 3
  type: INTS
}
attribute {
  name: "pads"
  ints: 1
  ints: 1
  ints: 1
  ints: 1
  type: INTS
}
attribute {
  name: "strides"
  ints: 1
  ints: 1
  type: INTS
}

[08/29/2023-21:40:17] [TRT] [E] ModelImporter.cpp:729: --- End node ---
[08/29/2023-21:40:17] [TRT] [E] ModelImporter.cpp:731: ERROR: ModelImporter.cpp:172 In function parseGraph:
[6] Invalid Node - Conv_221
Conv_221:kernel weights has count 294912 but 147456 was expected
Conv_221: count of 294912 weights in kernel, but kernel dimensions (3,3) with 128 input channels, 256 output channels and 2 groups were specified. Expected Weights count is 128 * 3*3 * 256 / 2 = 147456
[convolutionNode.cpp::computeOutputExtents::57] Error Code 4: Internal Error (Conv_221: number of kernel weights does not match tensor dimensions)

The error message display was caused by a mismatch between the weight and the actual convolution, but I was able to run normally on Pytorch using the same model and configuration file. The error message displays "but kernel dimensions (3,3) with 128 input channels, 256 output channels, and 2 groups were specified." However, in reality, the number of input channels for this convolution should be 256. The number of input channels I used when building the model is 256, which can be determined. Therefore, I suspect that grouping convolution caused an error in TensorRT calculation, as the model is actually continuous.

In addition, another issue was encountered while converting shuffleNet.

[08/29/2023-17:23:23] [TRT] [E] [convolutionNode.cpp::computeOutputExtents::54] Error Code 4: Internal Error (Conv_136: group count must divide input channel count)
[08/29/2023-17:23:23] [TRT] [E] ModelImporter.cpp:726: While parsing node number 136 [Conv -> "input.131"]:
[08/29/2023-17:23:23] [TRT] [E] ModelImporter.cpp:727: --- Begin node ---
[08/29/2023-17:23:23] [TRT] [E] ModelImporter.cpp:728: input: "input.123"
input: "onnx::Conv_2562"
input: "onnx::Conv_2563"
output: "input.131"
name: "Conv_136"
op_type: "Conv"
attribute {
  name: "dilations"
  ints: 1
  ints: 1
  type: INTS
}
attribute {
  name: "group"
  i: 116
  type: INT
}
attribute {
  name: "kernel_shape"
  ints: 3
  ints: 3
  type: INTS
}
attribute {
  name: "pads"
  ints: 1
  ints: 1
  ints: 1
  ints: 1
  type: INTS
}
attribute {
  name: "strides"
  ints: 2
  ints: 2
  type: INTS
}

[08/29/2023-17:23:23] [TRT] [E] ModelImporter.cpp:729: --- End node ---
[08/29/2023-17:23:23] [TRT] [E] ModelImporter.cpp:731: ERROR: ModelImporter.cpp:172 In function parseGraph:
[6] Invalid Node - Conv_136
[convolutionNode.cpp::computeOutputExtents::54] Error Code 4: Internal Error (Conv_136: group count must divide input channel count)

It is worth mentioning that both of these use the shuffle operation, which is the following code:

def channel_shuffle(x, groups):
   Channel Shuffle operation.

    This function enables cross-group information flow for multiple groups
    convolution layers.

    Args:
        x (Tensor): The input tensor.
        groups (int): The number of groups to divide the input tensor
            in the channel dimension.

    Returns:
        Tensor: The output tensor after channel shuffle operation.


    batch_size, num_channels, height, width = x.size()
    assert (num_channels % groups == 0), ('num_channels should be '
                                          'divisible by groups')
    channels_per_group = num_channels // groups

    x = x.view(batch_size, groups, channels_per_group, height, width)
    x = torch.transpose(x, 1, 2).contiguous()
    x = x.view(batch_size, -1, height, width)

    return x

Will this operation also affect the correct transformation of the TensorRT model?

Aug 29 '23 14:08 hh123445

I conducted further experiments and found that using only one group conv can successfully transform the model. If the shuffle operation is removed, using continuous group convs can still successfully transform the model.

Now I can confirm that the conversion failure was caused by the shuffle operation, but this operation is crucial. How can I successfully convert the model while retaining the shuffle operation?

def channel_shuffle(x, groups):
   Channel Shuffle operation.

    This function enables cross-group information flow for multiple groups
    convolution layers.

    Args:
        x (Tensor): The input tensor.
        groups (int): The number of groups to divide the input tensor
            in the channel dimension.

    Returns:
        Tensor: The output tensor after channel shuffle operation.


    batch_size, num_channels, height, width = x.size()
    assert (num_channels % groups == 0), ('num_channels should be '
                                          'divisible by groups')
    channels_per_group = num_channels // groups

    x = x.view(batch_size, groups, channels_per_group, height, width)
    x = torch.transpose(x, 1, 2).contiguous()
    x = x.view(batch_size, -1, height, width)

    return x

When using Pytorch for training, the shuffle operation performs very normally, with 256 output and input channels. Why does it cause the output channel number to become 128 when converted to TensorRT, which means the input channel number for the next convolution becomes 128? This Pytorch model can run normally, and there should be no problem building the model.

Aug 30 '23 06:08 hh123445

[6] Invalid Node - Conv_221 Conv_221:kernel weights has count 294912 but 147456 was expected Conv_221: count of 294912 weights in kernel, but kernel dimensions (3,3) with 128 input channels, 256 output channels and 2 groups were specified. Expected Weights count is 128 * 3*3 * 256 / 2 = 147456

Does the model work with onnxruntime? what's the input shape when building the engine?

Aug 31 '23 01:08 zerollzeng

[6] Invalid Node - Conv_221 Conv_221:kernel weights has count 294912 but 147456 was expected Conv_221: count of 294912 weights in kernel, but kernel dimensions (3,3) with 128 input channels, 256 output channels and 2 groups were specified. Expected Weights count is 128 * 3*3 * 256 / 2 = 147456

Does the model work with onnxruntime? what's the input shape when building the engine?

It can successfully output the onnx model. The number of channels in the input feature map and the number of output channels are both 256

Aug 31 '23 01:08 hh123445

Could you please try latest TRT release? if it still fails please provide a reproduce to us for further investigation. Many thanks!

Sep 01 '23 14:09 zerollzeng

Closing since no activity for more than 3 weeks, pls reopen if you still have question, thanks!

Sep 26 '23 19:09 ttyio

please open this issue!

Mar 04 '24 07:03 braindotai

I am having same issue - tensorrt version - nvcr.io/nvidia/tensorrt:23.05-py3

Mar 04 '24 07:03 braindotai

@braindotai please try with latest release and share with @zerollzeng the repro if still failed, thanks!

Mar 09 '24 03:03 ttyio

@ttyio appreciate it. I'll update in couple of days. Thanks

Mar 09 '24 17:03 braindotai

@ttyio The issue still occurs on different devices using both v10.0.1.6 and v8.6.1 versions.

May 25 '24 02:05 laugh12321

Hi,

I fixed the problem by adding .contiguous() to the end of the .view(.) function. I suggest that the reason for the error is that the tensor data is not in a suitable format in memory.

def channel_shuffle(x, groups):
   Channel Shuffle operation.

    This function enables cross-group information flow for multiple groups
    convolution layers.

    Args:
        x (Tensor): The input tensor.
        groups (int): The number of groups to divide the input tensor
            in the channel dimension.

    Returns:
        Tensor: The output tensor after channel shuffle operation.


    batch_size, num_channels, height, width = x.size()
    assert (num_channels % groups == 0), ('num_channels should be '
                                          'divisible by groups')
    channels_per_group = num_channels // groups

    x = x.view(batch_size, groups, channels_per_group, height, width)
    x = torch.transpose(x, 1, 2).contiguous()
    x = x.view(batch_size, -1, height, width)

    return x.contiguous()

Jul 25 '24 12:07 serdaryildiz