nni EmptyLayerError() or UnBalancedGroupError() during pruning depthwise separable convolution

Describe the issue: Errors when pruning depthwise separable convolution

when 'total_sparsity' has different value , it may encounter error as follows:

EmptyLayerError

    raise EmptyLayerError()
nni.compression.pytorch.speedup.error_code.EmptyLayerError: Pruning a Layer to empty is not legal

UnBalancedGroupError

    raise UnBalancedGroupError()
nni.compression.pytorch.speedup.error_code.UnBalancedGroupError: The number remained filters in each group is different

#4648 and #4796 seems solve the error, but it doesn't work in my code.

How can i solve it? Any help would be greatly appreciated, thanks!

Environment:

NNI version: 2.8
Training service (local|remote|pai|aml|etc):
Client OS: ubuntu 18.04
Server OS (for remote mode only):
Python version: 3.8
PyTorch/TensorFlow version: pytorch=1.10.1
Is conda/virtualenv/venv used?: conda
Is running in Docker?: no

How to reproduce it?:

import torch
from nni.compression.pytorch.pruning import L1NormPruner, L2NormPruner
from nni.compression.pytorch.speedup import ModelSpeedup
from torch import nn

model = nn.Sequential(
    nn.Conv2d(128, 128, (3, 3), (1, 1), 1, groups=128, bias=False),
    nn.BatchNorm2d(128),
    nn.ReLU(inplace=True),
)

config_list = [{'total_sparsity': 0.6, 'op_types': ['Conv2d']}]

dummy_input = torch.rand(5, 128, 256, 256)
pruner = L2NormPruner(model, config_list, mode='dependency_aware', dummy_input=dummy_input)
_, masks = pruner.compress()
pruner._unwrap_model()
ModelSpeedup(model, dummy_input, masks).speedup_model()

print(model)

Log message: [2022-07-08 11:41:49] start to speedup the model [2022-07-08 11:41:53] infer module masks... [2022-07-08 11:41:53] Update mask for 0 [2022-07-08 11:41:55] Update mask for 1 [2022-07-08 11:41:57] Update mask for 2 [2022-07-08 11:41:59] Update the indirect sparsity for the 2 /home/sf/anaconda3/envs/nni_py38/lib/python3.8/site-packages/torch/_tensor.py:1013: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at /opt/conda/conda-bld/pytorch_1639180523671/work/build/aten/src/ATen/core/TensorBody.h:417.) return self._grad [2022-07-08 11:41:59] Update the indirect sparsity for the 1 [2022-07-08 11:42:00] Update the indirect sparsity for the 0 [2022-07-08 11:42:01] resolve the mask conflict [2022-07-08 11:42:01] replace compressed modules... [2022-07-08 11:42:01] replace module (name: 0, op_type: Conv2d) Traceback (most recent call last): File "ttttt.py", line 35, in ModelSpeedup(model, dummy_input, masks).speedup_model() File "/home/sf/anaconda3/envs/nni_py38/lib/python3.8/site-packages/nni/compression/pytorch/speedup/compressor.py", line 543, in speedup_model self.replace_compressed_modules() File "/home/sf/anaconda3/envs/nni_py38/lib/python3.8/site-packages/nni/compression/pytorch/speedup/compressor.py", line 402, in replace_compressed_modules self.replace_submodule(unique_name) File "/home/sf/anaconda3/envs/nni_py38/lib/python3.8/site-packages/nni/compression/pytorch/speedup/compressor.py", line 473, in replace_submodule compressed_module = replace_function( File "/home/sf/anaconda3/envs/nni_py38/lib/python3.8/site-packages/nni/compression/pytorch/speedup/compress_modules.py", line 16, in 'Conv2d': lambda module, masks: replace_conv2d(module, masks), File "/home/sf/anaconda3/envs/nni_py38/lib/python3.8/site-packages/nni/compression/pytorch/speedup/compress_modules.py", line 424, in replace_conv2d raise UnBalancedGroupError() nni.compression.pytorch.speedup.error_code.UnBalancedGroupError: The number remained filters in each group is different

Jul 08 '22 06:07 HiddenMarkovModel

@louis-j

Jul 08 '22 08:07 ultmaster

We will detect the layers where the in_chan and out_chan are not pruned peer-to-peer, and disable the pruning on these layers. I'm checking the detection.

Jul 12 '22 03:07 Louis-J

We will detect the layers where the in_chan and out_chan are not pruned peer-to-peer, and disable the pruning on these layers. I'm checking the detection.

I use 'exclude' to disable the pruning on these layers manually.

Jul 14 '22 05:07 HiddenMarkovModel

There's a bug in the conflicts detection when the first layer with depthwise separable is set to be pruned. But I think first layer with depthwise separable should not be pruned because the in_chan cannot be pruned. Do you really need to prune the first layer, or the error didn't occur at the first layer in the actual model and this is just an example?

Jul 15 '22 09:07 Louis-J

There's a bug in the conflicts detection when the first layer with depthwise separable is set to be pruned. But I think first layer with depthwise separable should not be pruned because the in_chan cannot be pruned. Do you really need to prune the first layer, or the error didn't occur at the first layer in the actual model and this is just an example?

The error occur in the actual model. I prune "Light weight openpose" model and encounter this bug.

Part of source codes are as folllows:


def conv_dw_no_bn(in_channels, out_channels, kernel_size=3, padding=1, stride=1, dilation=1):
    return nn.Sequential(
        nn.Conv2d(in_channels, in_channels, kernel_size, stride, padding, dilation=dilation, groups=in_channels, bias=False),
        nn.ELU(inplace=True),

        nn.Conv2d(in_channels, out_channels, 1, 1, 0, bias=False),
        nn.ELU(inplace=True),
    )


class Cpm(nn.Module):
    """
    定义Cpm网络，包括conv和conv_dw_no_bn
    由于mobilenet V1输出为512维，有一个cpm的降维层，降维到128维
    """
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.align = conv(in_channels, out_channels, kernel_size=1, padding=0, bn=False)
        self.trunk = nn.Sequential(
            conv_dw_no_bn(out_channels, out_channels),
            conv_dw_no_bn(out_channels, out_channels),
            conv_dw_no_bn(out_channels, out_channels)
        )
        self.conv = conv(out_channels, out_channels, bn=False)

    def forward(self, x):
        x = self.align(x)
        x = self.conv(x + self.trunk(x))
        return x


class PoseEstimationNet(nn.Module):
    def __init__(self, num_refinement_stages=1, num_channels=128, num_heatmaps=19, num_pafs=38):
        """
        构建人体姿态估计的网络结构
        :param num_refinement_stages: refinement阶段的个数
        :param num_channels: 特征的个数. cpm阶段的输出channel数
        :param num_heatmaps: 热图的个数.
        :param num_pafs: num_pafs的个数. 其为(x,y)最表组合, 因此等于num_heatmaps*2
        """
        super(PoseEstimationNet, self).__init__()
        # 构建mobilenet的前conv5_5
        self.model = nn.Sequential(
            conv(     3,  32, stride=2, bias=False),
            conv_dw( 32,  64),
            conv_dw( 64, 128, stride=2),
            conv_dw(128, 128),
            conv_dw(128, 256, stride=2),
            conv_dw(256, 256),
            conv_dw(256, 512),  # conv4_2
            conv_dw(512, 512, dilation=2, padding=2),
            conv_dw(512, 512),
            conv_dw(512, 512),
            conv_dw(512, 512),
            conv_dw(512, 512)   # conv5_5
        )
        # 使用cpm对特征进行进一步的整合
        self.cpm = Cpm(512, num_channels)
        
        # fixme
        # other code ...

    def forward(self, x):
        # backbone
        backbone_features = self.model(x)
        # cpm
        backbone_features = self.cpm(backbone_features)
        
        # fixme
        # other code ...

        return backbone_features

The error log:

[2022-07-15 18:42:42] replace module (name: model.11.5, op_type: ReLU)
[2022-07-15 18:42:42] replace module (name: cpm.align.0, op_type: Conv2d)
[2022-07-15 18:42:42] replace module (name: cpm.align.1, op_type: ReLU)
[2022-07-15 18:42:42] replace module (name: cpm.trunk.0.0, op_type: Conv2d)
Traceback (most recent call last):
  File "Pruning/LightOpenposePruning.py", line 38, in <module>
    ModelSpeedup(model, dummy_input, mask).speedup_model()
  File "site-packages\nni\compression\pytorch\speedup\compressor.py", line 543, in speedup_model
    self.replace_compressed_modules()
  File "site-packages\nni\compression\pytorch\speedup\compressor.py", line 402, in replace_compressed_modules
    self.replace_submodule(unique_name)
  File "site-packages\nni\compression\pytorch\speedup\compressor.py", line 474, in replace_submodule
    leaf_module, auto_infer.get_masks())
  File "site-packages\nni\compression\pytorch\speedup\compress_modules.py", line 16, in <lambda>
    'Conv2d': lambda module, masks: replace_conv2d(module, masks),
  File "site-packages\nni\compression\pytorch\speedup\compress_modules.py", line 424, in replace_conv2d
    raise UnBalancedGroupError()
nni.compression.pytorch.speedup.error_code.UnBalancedGroupError: The number remained filters in each group is different`

The backbone called "model" is mobelnet v1, it can be pruned correctly. But the sub-model called 'cpm' which constructs of depthwise separable layers encounters this bug, the error layer is cpm.trunk[0].Conv2d which is not the first layer of 'cpm'.

So why?

Jul 15 '22 11:07 HiddenMarkovModel

Could you please offer a runnable code that I can test it

Jul 19 '22 11:07 Louis-J

Could you please offer a runnable code that I can test it

Code Repository: https://github.com/HiddenMarkovModel/LightWeightOpenposePruning

You should build folder "checkpoints" in root, download model from https://drive.google.com/file/d/11h0tkTqpX76ikv4HotHgHb-eDDbL5NQj/view and then restore the model to "checkpoints".

Run "python LightOpenposePruning.py" to start to prune.

Jul 22 '22 07:07 HiddenMarkovModel

reproduced the bug. need some time to fix it

Jul 26 '22 06:07 Louis-J

reproduced the bug. need some time to fix it

Please replay after you fixed it~~

Aug 01 '22 08:08 HiddenMarkovModel

Hi, any progress in this ? @Louis-J

Aug 29 '22 03:08 XiaoLaoDi

Sorry for not replying these days, we postpone the fix to v2.9.1.

Sep 13 '22 06:09 Louis-J

Hi, any progress in this ? @Louis-J

Nov 16 '22 07:11 nnzs9248

will confirm that, please wait some time, thanks!

Nov 16 '22 09:11 Lijiaoa

sorry no progress so far. it's hard to fix, and we are now refactoring ModelSpeedup in 3.0. then we will try to fix it.

Nov 16 '22 10:11 Louis-J

Hi, any progress in this ? @Louis-J

Hello @nnzs9248 , could you give us a demo to reproduce your issue? so that we could make sure they are the same issue.

Nov 18 '22 05:11 J-shang

nni nni copied to clipboard

EmptyLayerError() or UnBalancedGroupError() during pruning depthwise separable convolution

nni
nni copied to clipboard