nni
nni copied to clipboard
EmptyLayerError() or UnBalancedGroupError() during pruning depthwise separable convolution
Describe the issue: Errors when pruning depthwise separable convolution
when 'total_sparsity' has different value , it may encounter error as follows:
- EmptyLayerError
raise EmptyLayerError()
nni.compression.pytorch.speedup.error_code.EmptyLayerError: Pruning a Layer to empty is not legal
- UnBalancedGroupError
raise UnBalancedGroupError()
nni.compression.pytorch.speedup.error_code.UnBalancedGroupError: The number remained filters in each group is different
#4648 and #4796 seems solve the error, but it doesn't work in my code.
How can i solve it? Any help would be greatly appreciated, thanks!
Environment:
- NNI version: 2.8
- Training service (local|remote|pai|aml|etc):
- Client OS: ubuntu 18.04
- Server OS (for remote mode only):
- Python version: 3.8
- PyTorch/TensorFlow version: pytorch=1.10.1
- Is conda/virtualenv/venv used?: conda
- Is running in Docker?: no
How to reproduce it?:
import torch
from nni.compression.pytorch.pruning import L1NormPruner, L2NormPruner
from nni.compression.pytorch.speedup import ModelSpeedup
from torch import nn
model = nn.Sequential(
nn.Conv2d(128, 128, (3, 3), (1, 1), 1, groups=128, bias=False),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
)
config_list = [{'total_sparsity': 0.6, 'op_types': ['Conv2d']}]
dummy_input = torch.rand(5, 128, 256, 256)
pruner = L2NormPruner(model, config_list, mode='dependency_aware', dummy_input=dummy_input)
_, masks = pruner.compress()
pruner._unwrap_model()
ModelSpeedup(model, dummy_input, masks).speedup_model()
print(model)
Log message:
[2022-07-08 11:41:49] start to speedup the model
[2022-07-08 11:41:53] infer module masks...
[2022-07-08 11:41:53] Update mask for 0
[2022-07-08 11:41:55] Update mask for 1
[2022-07-08 11:41:57] Update mask for 2
[2022-07-08 11:41:59] Update the indirect sparsity for the 2
/home/sf/anaconda3/envs/nni_py38/lib/python3.8/site-packages/torch/_tensor.py:1013: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at /opt/conda/conda-bld/pytorch_1639180523671/work/build/aten/src/ATen/core/TensorBody.h:417.)
return self._grad
[2022-07-08 11:41:59] Update the indirect sparsity for the 1
[2022-07-08 11:42:00] Update the indirect sparsity for the 0
[2022-07-08 11:42:01] resolve the mask conflict
[2022-07-08 11:42:01] replace compressed modules...
[2022-07-08 11:42:01] replace module (name: 0, op_type: Conv2d)
Traceback (most recent call last):
File "ttttt.py", line 35, in
@louis-j
We will detect the layers where the in_chan and out_chan are not pruned peer-to-peer, and disable the pruning on these layers. I'm checking the detection.
We will detect the layers where the in_chan and out_chan are not pruned peer-to-peer, and disable the pruning on these layers. I'm checking the detection.
I use 'exclude' to disable the pruning on these layers manually.
There's a bug in the conflicts detection when the first layer with depthwise separable is set to be pruned. But I think first layer with depthwise separable should not be pruned because the in_chan cannot be pruned. Do you really need to prune the first layer, or the error didn't occur at the first layer in the actual model and this is just an example?
There's a bug in the conflicts detection when the first layer with depthwise separable is set to be pruned. But I think first layer with depthwise separable should not be pruned because the in_chan cannot be pruned. Do you really need to prune the first layer, or the error didn't occur at the first layer in the actual model and this is just an example?
The error occur in the actual model. I prune "Light weight openpose" model and encounter this bug.
Part of source codes are as folllows:
def conv_dw_no_bn(in_channels, out_channels, kernel_size=3, padding=1, stride=1, dilation=1):
return nn.Sequential(
nn.Conv2d(in_channels, in_channels, kernel_size, stride, padding, dilation=dilation, groups=in_channels, bias=False),
nn.ELU(inplace=True),
nn.Conv2d(in_channels, out_channels, 1, 1, 0, bias=False),
nn.ELU(inplace=True),
)
class Cpm(nn.Module):
"""
定义Cpm网络,包括conv和conv_dw_no_bn
由于mobilenet V1输出为512维,有一个cpm的降维层,降维到128维
"""
def __init__(self, in_channels, out_channels):
super().__init__()
self.align = conv(in_channels, out_channels, kernel_size=1, padding=0, bn=False)
self.trunk = nn.Sequential(
conv_dw_no_bn(out_channels, out_channels),
conv_dw_no_bn(out_channels, out_channels),
conv_dw_no_bn(out_channels, out_channels)
)
self.conv = conv(out_channels, out_channels, bn=False)
def forward(self, x):
x = self.align(x)
x = self.conv(x + self.trunk(x))
return x
class PoseEstimationNet(nn.Module):
def __init__(self, num_refinement_stages=1, num_channels=128, num_heatmaps=19, num_pafs=38):
"""
构建人体姿态估计的网络结构
:param num_refinement_stages: refinement阶段的个数
:param num_channels: 特征的个数. cpm阶段的输出channel数
:param num_heatmaps: 热图的个数.
:param num_pafs: num_pafs的个数. 其为(x,y)最表组合, 因此等于num_heatmaps*2
"""
super(PoseEstimationNet, self).__init__()
# 构建mobilenet的前conv5_5
self.model = nn.Sequential(
conv( 3, 32, stride=2, bias=False),
conv_dw( 32, 64),
conv_dw( 64, 128, stride=2),
conv_dw(128, 128),
conv_dw(128, 256, stride=2),
conv_dw(256, 256),
conv_dw(256, 512), # conv4_2
conv_dw(512, 512, dilation=2, padding=2),
conv_dw(512, 512),
conv_dw(512, 512),
conv_dw(512, 512),
conv_dw(512, 512) # conv5_5
)
# 使用cpm对特征进行进一步的整合
self.cpm = Cpm(512, num_channels)
# fixme
# other code ...
def forward(self, x):
# backbone
backbone_features = self.model(x)
# cpm
backbone_features = self.cpm(backbone_features)
# fixme
# other code ...
return backbone_features
The error log:
[2022-07-15 18:42:42] replace module (name: model.11.5, op_type: ReLU)
[2022-07-15 18:42:42] replace module (name: cpm.align.0, op_type: Conv2d)
[2022-07-15 18:42:42] replace module (name: cpm.align.1, op_type: ReLU)
[2022-07-15 18:42:42] replace module (name: cpm.trunk.0.0, op_type: Conv2d)
Traceback (most recent call last):
File "Pruning/LightOpenposePruning.py", line 38, in <module>
ModelSpeedup(model, dummy_input, mask).speedup_model()
File "site-packages\nni\compression\pytorch\speedup\compressor.py", line 543, in speedup_model
self.replace_compressed_modules()
File "site-packages\nni\compression\pytorch\speedup\compressor.py", line 402, in replace_compressed_modules
self.replace_submodule(unique_name)
File "site-packages\nni\compression\pytorch\speedup\compressor.py", line 474, in replace_submodule
leaf_module, auto_infer.get_masks())
File "site-packages\nni\compression\pytorch\speedup\compress_modules.py", line 16, in <lambda>
'Conv2d': lambda module, masks: replace_conv2d(module, masks),
File "site-packages\nni\compression\pytorch\speedup\compress_modules.py", line 424, in replace_conv2d
raise UnBalancedGroupError()
nni.compression.pytorch.speedup.error_code.UnBalancedGroupError: The number remained filters in each group is different`
The backbone called "model" is mobelnet v1, it can be pruned correctly. But the sub-model called 'cpm' which constructs of depthwise separable layers encounters this bug, the error layer is cpm.trunk[0].Conv2d which is not the first layer of 'cpm'.
So why?
Could you please offer a runnable code that I can test it
Could you please offer a runnable code that I can test it
Code Repository: https://github.com/HiddenMarkovModel/LightWeightOpenposePruning
You should build folder "checkpoints" in root, download model from https://drive.google.com/file/d/11h0tkTqpX76ikv4HotHgHb-eDDbL5NQj/view and then restore the model to "checkpoints".
Run "python LightOpenposePruning.py" to start to prune.
reproduced the bug. need some time to fix it
reproduced the bug. need some time to fix it
Please replay after you fixed it~~
Hi, any progress in this ? @Louis-J
Sorry for not replying these days, we postpone the fix to v2.9.1.
Hi, any progress in this ? @Louis-J
will confirm that, please wait some time, thanks!
sorry no progress so far. it's hard to fix, and we are now refactoring ModelSpeedup in 3.0. then we will try to fix it.
Hi, any progress in this ? @Louis-J
Hello @nnzs9248 , could you give us a demo to reproduce your issue? so that we could make sure they are the same issue.