nni icon indicating copy to clipboard operation
nni copied to clipboard

EmptyLayerError() or UnBalancedGroupError() during pruning depthwise separable convolution

Open HiddenMarkovModel opened this issue 2 years ago • 11 comments

Describe the issue: Errors when pruning depthwise separable convolution

when 'total_sparsity' has different value , it may encounter error as follows:

  1. EmptyLayerError
    raise EmptyLayerError()
nni.compression.pytorch.speedup.error_code.EmptyLayerError: Pruning a Layer to empty is not legal
  1. UnBalancedGroupError
    raise UnBalancedGroupError()
nni.compression.pytorch.speedup.error_code.UnBalancedGroupError: The number remained filters in each group is different

#4648 and #4796 seems solve the error, but it doesn't work in my code.

How can i solve it? Any help would be greatly appreciated, thanks!

Environment:

  • NNI version: 2.8
  • Training service (local|remote|pai|aml|etc):
  • Client OS: ubuntu 18.04
  • Server OS (for remote mode only):
  • Python version: 3.8
  • PyTorch/TensorFlow version: pytorch=1.10.1
  • Is conda/virtualenv/venv used?: conda
  • Is running in Docker?: no

How to reproduce it?:

import torch
from nni.compression.pytorch.pruning import L1NormPruner, L2NormPruner
from nni.compression.pytorch.speedup import ModelSpeedup
from torch import nn

model = nn.Sequential(
    nn.Conv2d(128, 128, (3, 3), (1, 1), 1, groups=128, bias=False),
    nn.BatchNorm2d(128),
    nn.ReLU(inplace=True),
)

config_list = [{'total_sparsity': 0.6, 'op_types': ['Conv2d']}]

dummy_input = torch.rand(5, 128, 256, 256)
pruner = L2NormPruner(model, config_list, mode='dependency_aware', dummy_input=dummy_input)
_, masks = pruner.compress()
pruner._unwrap_model()
ModelSpeedup(model, dummy_input, masks).speedup_model()

print(model)

Log message: [2022-07-08 11:41:49] start to speedup the model [2022-07-08 11:41:53] infer module masks... [2022-07-08 11:41:53] Update mask for 0 [2022-07-08 11:41:55] Update mask for 1 [2022-07-08 11:41:57] Update mask for 2 [2022-07-08 11:41:59] Update the indirect sparsity for the 2 /home/sf/anaconda3/envs/nni_py38/lib/python3.8/site-packages/torch/_tensor.py:1013: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at /opt/conda/conda-bld/pytorch_1639180523671/work/build/aten/src/ATen/core/TensorBody.h:417.) return self._grad [2022-07-08 11:41:59] Update the indirect sparsity for the 1 [2022-07-08 11:42:00] Update the indirect sparsity for the 0 [2022-07-08 11:42:01] resolve the mask conflict [2022-07-08 11:42:01] replace compressed modules... [2022-07-08 11:42:01] replace module (name: 0, op_type: Conv2d) Traceback (most recent call last): File "ttttt.py", line 35, in ModelSpeedup(model, dummy_input, masks).speedup_model() File "/home/sf/anaconda3/envs/nni_py38/lib/python3.8/site-packages/nni/compression/pytorch/speedup/compressor.py", line 543, in speedup_model self.replace_compressed_modules() File "/home/sf/anaconda3/envs/nni_py38/lib/python3.8/site-packages/nni/compression/pytorch/speedup/compressor.py", line 402, in replace_compressed_modules self.replace_submodule(unique_name) File "/home/sf/anaconda3/envs/nni_py38/lib/python3.8/site-packages/nni/compression/pytorch/speedup/compressor.py", line 473, in replace_submodule compressed_module = replace_function( File "/home/sf/anaconda3/envs/nni_py38/lib/python3.8/site-packages/nni/compression/pytorch/speedup/compress_modules.py", line 16, in 'Conv2d': lambda module, masks: replace_conv2d(module, masks), File "/home/sf/anaconda3/envs/nni_py38/lib/python3.8/site-packages/nni/compression/pytorch/speedup/compress_modules.py", line 424, in replace_conv2d raise UnBalancedGroupError() nni.compression.pytorch.speedup.error_code.UnBalancedGroupError: The number remained filters in each group is different

HiddenMarkovModel avatar Jul 08 '22 06:07 HiddenMarkovModel

@louis-j

ultmaster avatar Jul 08 '22 08:07 ultmaster

We will detect the layers where the in_chan and out_chan are not pruned peer-to-peer, and disable the pruning on these layers. I'm checking the detection.

Louis-J avatar Jul 12 '22 03:07 Louis-J

We will detect the layers where the in_chan and out_chan are not pruned peer-to-peer, and disable the pruning on these layers. I'm checking the detection.

I use 'exclude' to disable the pruning on these layers manually.

HiddenMarkovModel avatar Jul 14 '22 05:07 HiddenMarkovModel

There's a bug in the conflicts detection when the first layer with depthwise separable is set to be pruned. But I think first layer with depthwise separable should not be pruned because the in_chan cannot be pruned. Do you really need to prune the first layer, or the error didn't occur at the first layer in the actual model and this is just an example?

Louis-J avatar Jul 15 '22 09:07 Louis-J

There's a bug in the conflicts detection when the first layer with depthwise separable is set to be pruned. But I think first layer with depthwise separable should not be pruned because the in_chan cannot be pruned. Do you really need to prune the first layer, or the error didn't occur at the first layer in the actual model and this is just an example?

The error occur in the actual model. I prune "Light weight openpose" model and encounter this bug.

Part of source codes are as folllows:


def conv_dw_no_bn(in_channels, out_channels, kernel_size=3, padding=1, stride=1, dilation=1):
    return nn.Sequential(
        nn.Conv2d(in_channels, in_channels, kernel_size, stride, padding, dilation=dilation, groups=in_channels, bias=False),
        nn.ELU(inplace=True),

        nn.Conv2d(in_channels, out_channels, 1, 1, 0, bias=False),
        nn.ELU(inplace=True),
    )


class Cpm(nn.Module):
    """
    定义Cpm网络,包括conv和conv_dw_no_bn
    由于mobilenet V1输出为512维,有一个cpm的降维层,降维到128维
    """
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.align = conv(in_channels, out_channels, kernel_size=1, padding=0, bn=False)
        self.trunk = nn.Sequential(
            conv_dw_no_bn(out_channels, out_channels),
            conv_dw_no_bn(out_channels, out_channels),
            conv_dw_no_bn(out_channels, out_channels)
        )
        self.conv = conv(out_channels, out_channels, bn=False)

    def forward(self, x):
        x = self.align(x)
        x = self.conv(x + self.trunk(x))
        return x


class PoseEstimationNet(nn.Module):
    def __init__(self, num_refinement_stages=1, num_channels=128, num_heatmaps=19, num_pafs=38):
        """
        构建人体姿态估计的网络结构
        :param num_refinement_stages: refinement阶段的个数
        :param num_channels: 特征的个数. cpm阶段的输出channel数
        :param num_heatmaps: 热图的个数.
        :param num_pafs: num_pafs的个数. 其为(x,y)最表组合, 因此等于num_heatmaps*2
        """
        super(PoseEstimationNet, self).__init__()
        # 构建mobilenet的前conv5_5
        self.model = nn.Sequential(
            conv(     3,  32, stride=2, bias=False),
            conv_dw( 32,  64),
            conv_dw( 64, 128, stride=2),
            conv_dw(128, 128),
            conv_dw(128, 256, stride=2),
            conv_dw(256, 256),
            conv_dw(256, 512),  # conv4_2
            conv_dw(512, 512, dilation=2, padding=2),
            conv_dw(512, 512),
            conv_dw(512, 512),
            conv_dw(512, 512),
            conv_dw(512, 512)   # conv5_5
        )
        # 使用cpm对特征进行进一步的整合
        self.cpm = Cpm(512, num_channels)
        
        # fixme
        # other code ...

    def forward(self, x):
        # backbone
        backbone_features = self.model(x)
        # cpm
        backbone_features = self.cpm(backbone_features)
        
        # fixme
        # other code ...

        return backbone_features

The error log:

[2022-07-15 18:42:42] replace module (name: model.11.5, op_type: ReLU)
[2022-07-15 18:42:42] replace module (name: cpm.align.0, op_type: Conv2d)
[2022-07-15 18:42:42] replace module (name: cpm.align.1, op_type: ReLU)
[2022-07-15 18:42:42] replace module (name: cpm.trunk.0.0, op_type: Conv2d)
Traceback (most recent call last):
  File "Pruning/LightOpenposePruning.py", line 38, in <module>
    ModelSpeedup(model, dummy_input, mask).speedup_model()
  File "site-packages\nni\compression\pytorch\speedup\compressor.py", line 543, in speedup_model
    self.replace_compressed_modules()
  File "site-packages\nni\compression\pytorch\speedup\compressor.py", line 402, in replace_compressed_modules
    self.replace_submodule(unique_name)
  File "site-packages\nni\compression\pytorch\speedup\compressor.py", line 474, in replace_submodule
    leaf_module, auto_infer.get_masks())
  File "site-packages\nni\compression\pytorch\speedup\compress_modules.py", line 16, in <lambda>
    'Conv2d': lambda module, masks: replace_conv2d(module, masks),
  File "site-packages\nni\compression\pytorch\speedup\compress_modules.py", line 424, in replace_conv2d
    raise UnBalancedGroupError()
nni.compression.pytorch.speedup.error_code.UnBalancedGroupError: The number remained filters in each group is different`

The backbone called "model" is mobelnet v1, it can be pruned correctly. But the sub-model called 'cpm' which constructs of depthwise separable layers encounters this bug, the error layer is cpm.trunk[0].Conv2d which is not the first layer of 'cpm'.

So why?

HiddenMarkovModel avatar Jul 15 '22 11:07 HiddenMarkovModel

Could you please offer a runnable code that I can test it

Louis-J avatar Jul 19 '22 11:07 Louis-J

Could you please offer a runnable code that I can test it

Code Repository: https://github.com/HiddenMarkovModel/LightWeightOpenposePruning

You should build folder "checkpoints" in root, download model from https://drive.google.com/file/d/11h0tkTqpX76ikv4HotHgHb-eDDbL5NQj/view and then restore the model to "checkpoints".

Run "python LightOpenposePruning.py" to start to prune.

HiddenMarkovModel avatar Jul 22 '22 07:07 HiddenMarkovModel

reproduced the bug. need some time to fix it

Louis-J avatar Jul 26 '22 06:07 Louis-J

reproduced the bug. need some time to fix it

Please replay after you fixed it~~

HiddenMarkovModel avatar Aug 01 '22 08:08 HiddenMarkovModel

Hi, any progress in this ? @Louis-J

XiaoLaoDi avatar Aug 29 '22 03:08 XiaoLaoDi

Sorry for not replying these days, we postpone the fix to v2.9.1.

Louis-J avatar Sep 13 '22 06:09 Louis-J

Hi, any progress in this ? @Louis-J

nnzs9248 avatar Nov 16 '22 07:11 nnzs9248

will confirm that, please wait some time, thanks!

Lijiaoa avatar Nov 16 '22 09:11 Lijiaoa

sorry no progress so far. it's hard to fix, and we are now refactoring ModelSpeedup in 3.0. then we will try to fix it.

Louis-J avatar Nov 16 '22 10:11 Louis-J

Hi, any progress in this ? @Louis-J

Hello @nnzs9248 , could you give us a demo to reproduce your issue? so that we could make sure they are the same issue.

J-shang avatar Nov 18 '22 05:11 J-shang