yolact
yolact copied to clipboard
What will happen if I change the backbone network into MobileNet-v2 with FPN?
Hi, I want to try to change the backbone network into MobileNet-v2 with FPN. Is there any suggestions? THX!!!
That would be a good test, though I will warn that we've tried VGG16+FPN before and got sub-par performance (though maybe that's to be expected). If you want to change out the backbone head over to backbone.py and check out how I implemented Resnet and VGG. If you need any help, let me know.
@dbolya Hi, I have done a preliminary test on my own simple dataset. And the experiment result made me feel excited~! With the help of MobileNet-v2, I successfully reduced the model size from over 100 MB to 31 MB, but without a significant performance loss on my own dataset. Next, i will try it on MS-COCO dataset, and i hope for the fantastic performance^_^
Sounds good! If it ends up working well, I might just make an official MobileNet-v2 version for YOLACTv1.1. That seems like it would be nice to have.
@dbolya Hi, could you show me your train log file of resnet&coco?
This is my recently train log with mobilenetv2&coco...
[ 16] 59810 || B: 2.489 | C: 3.295 | M: 2.861 | S: 0.897 | T: 9.542 || ETA: 19 days, 23:21:26 || timer: 2.437
[ 16] 59820 || B: 2.505 | C: 3.321 | M: 2.840 | S: 0.898 | T: 9.565 || ETA: 19 days, 22:54:13 || timer: 2.125 [ 16] 59830 || B: 2.500 | C: 3.299 | M: 2.802 | S: 0.910 | T: 9.511 || ETA: 19 days, 22:34:59 || timer: 1.966 [ 16] 59840 || B: 2.504 | C: 3.271 | M: 2.787 | S: 0.900 | T: 9.461 || ETA: 19 days, 22:51:36 || timer: 2.071 [ 16] 59850 || B: 2.480 | C: 3.282 | M: 2.751 | S: 0.881 | T: 9.394 || ETA: 19 days, 22:43:14 || timer: 2.415 [ 16] 59860 || B: 2.462 | C: 3.247 | M: 2.752 | S: 0.883 | T: 9.344 || ETA: 19 days, 22:27:50 || timer: 2.144 [ 16] 59870 || B: 2.460 | C: 3.247 | M: 2.726 | S: 0.881 | T: 9.315 || ETA: 19 days, 22:22:21 || timer: 2.091 [ 16] 59880 || B: 2.489 | C: 3.290 | M: 2.696 | S: 0.891 | T: 9.366 || ETA: 19 days, 22:38:05 || timer: 2.101 [ 16] 59890 || B: 2.478 | C: 3.300 | M: 2.712 | S: 0.892 | T: 9.382 || ETA: 19 days, 22:48:17 || timer: 2.301 [ 16] 59900 || B: 2.485 | C: 3.299 | M: 2.711 | S: 0.889 | T: 9.384 || ETA: 19 days, 22:53:21 || timer: 2.046 [ 16] 59910 || B: 2.514 | C: 3.316 | M: 2.724 | S: 0.880 | T: 9.435 || ETA: 19 days, 23:27:14 || timer: 1.973 [ 16] 59920 || B: 2.505 | C: 3.308 | M: 2.721 | S: 0.877 | T: 9.410 || ETA: 19 days, 23:25:19 || timer: 2.257 [ 16] 59930 || B: 2.496 | C: 3.290 | M: 2.710 | S: 0.863 | T: 9.360 || ETA: 19 days, 23:15:29 || timer: 2.061 [ 16] 59940 || B: 2.467 | C: 3.269 | M: 2.694 | S: 0.864 | T: 9.293 || ETA: 19 days, 22:23:27 || timer: 1.924 [ 16] 59950 || B: 2.473 | C: 3.249 | M: 2.702 | S: 0.868 | T: 9.292 || ETA: 19 days, 22:55:27 || timer: 2.374 [ 16] 59960 || B: 2.497 | C: 3.261 | M: 2.702 | S: 0.858 | T: 9.318 || ETA: 19 days, 23:18:36 || timer: 2.055 [ 16] 59970 || B: 2.532 | C: 3.274 | M: 2.726 | S: 0.862 | T: 9.393 || ETA: 19 days, 23:51:10 || timer: 2.111 [ 16] 59980 || B: 2.493 | C: 3.262 | M: 2.723 | S: 0.861 | T: 9.339 || ETA: 19 days, 23:32:02 || timer: 1.991 [ 16] 59990 || B: 2.500 | C: 3.253 | M: 2.718 | S: 0.860 | T: 9.331 || ETA: 19 days, 23:44:44 || timer: 3.396
The loss value looks like going down. But the convergence speed looks so slowly. And i got a poor evaluated result:
box | 0.78 | 2.17 | 1.76 | 1.39 | 1.08 | 0.75 | 0.43 | 0.19 | 0.07 | 0.01 | 0.00 | box | 1.31 | 3.44 | 2.92 | 2.37 | 1.78 | 1.31 | 0.78 | 0.33 | 0.14 | 0.04 | 0.00 | box | 1.47 | 3.88 | 3.23 | 2.54 | 2.05 | 1.47 | 0.82 | 0.44 | 0.21 | 0.04 | 0.00 | box | 2.17 | 5.26 | 4.63 | 3.84 | 3.12 | 2.26 | 1.42 | 0.77 | 0.27 | 0.11 | 0.00 | box | 2.41 | 5.79 | 5.10 | 4.26 | 3.46 | 2.54 | 1.74 | 0.82 | 0.34 | 0.06 | 0.00 | box | 2.65 | 6.06 | 5.34 | 4.67 | 3.76 | 2.90 | 1.96 | 1.05 | 0.51 | 0.20 | 0.03 | box | 2.82 | 6.62 | 5.94 | 4.82 | 4.05 | 2.94 | 2.04 | 1.17 | 0.53 | 0.12 | 0.00 |
mask | 0.80 | 1.91 | 1.61 | 1.35 | 1.09 | 0.83 | 0.58 | 0.38 | 0.21 | 0.07 | 0.00 | mask | 1.44 | 3.13 | 2.71 | 2.28 | 1.95 | 1.62 | 1.20 | 0.81 | 0.55 | 0.12 | 0.01 | mask | 1.73 | 3.73 | 3.28 | 2.81 | 2.35 | 1.88 | 1.43 | 0.96 | 0.57 | 0.26 | 0.03 | mask | 2.38 | 4.86 | 4.39 | 3.80 | 3.27 | 2.65 | 2.09 | 1.49 | 0.95 | 0.29 | 0.03 | mask | 2.76 | 5.46 | 5.04 | 4.48 | 3.84 | 3.15 | 2.43 | 1.70 | 0.96 | 0.44 | 0.06 | mask | 3.01 | 5.81 | 5.34 | 4.82 | 4.16 | 3.53 | 2.72 | 1.97 | 1.10 | 0.56 | 0.07 | mask | 3.31 | 6.41 | 5.85 | 5.27 | 4.56 | 3.94 | 2.99 | 2.13 | 1.31 | 0.58 | 0.06 |
Hope to discuss more with you ^_^
Hmm with the mAP starting that low (< 1), it looks likes the behavior we get when we train from scratch. Are you using imagenet pretrained weights in your MobileNet implementation? If not, it'll take way longer to converge.
There's also a gotcha when loading pretrained weights because I had to set strict=False in the torch.load for loading backbone weights. Set that back to True if you are using pretrained weights but suspect it's not loading properly.
THX~! I used the implementation and the pretrained weights from this repo:https://github.com/tonylins/pytorch-mobilenet-v2
I did the experiment follow these steps bellow:
- Modify the network architecture by removing the final global average pooling layer and the final 1x1 conv layer;
- Make a custom
init_backbone()method in which i translated the node name and discarded the unuseful tensor; - Add a new config-dict to the 'config.py', and set some helpful parameters.
Here is my MobileNet-v2 implementation:
class MobileNetV2(nn.Module):
def __init__(self):
super(MobileNetV2, self).__init__()
block = InvertedResidual
input_channel = 32
interverted_residual_setting = [
# t, c, n, s
[1, 16, 1, 1],
[6, 24, 2, 2], # useful
[6, 32, 3, 2], # useful
[6, 64, 4, 2],
[6, 96, 3, 1], # useful
[6, 160, 3, 2],
[6, 320, 1, 1], # useful
]
# Select the output layers
# idx-1 => 56 * 56 * 24
# idx-2 => 28 * 28 * 32
# idx-4 => 14 * 14 * 96
# idx-6 => 7 * 7 * 320
select_idxs = [1, 2, 4, 6]
# building first layer
input_channel = int(input_channel)
self.conv2d = conv_bn(3, input_channel, 2)
self.layers = nn.ModuleList()
self.channels = []
# building inverted residual blocks
features = []
for idx, (t, c, n, s) in enumerate(interverted_residual_setting):
output_channel = int(c)
for i in range(n):
if i == 0:
features.append(block(input_channel, output_channel, s, expand_ratio=t))
else:
features.append(block(input_channel, output_channel, 1, expand_ratio=t))
input_channel = output_channel
if idx in select_idxs:
self.layers.append(nn.Sequential(*features))
self.channels.append(output_channel)
features = []
self.backbone_modules = [m for m in self.modules() if isinstance(m, nn.Conv2d)]
# random initial
self._initialize_weights()
def forward(self, x):
x = self.conv2d(x)
outs = []
for layer in self.layers:
x = layer(x)
outs.append(x)
return tuple(outs)
def _initialize_weights(self):
modules = self.modules()
for m in modules:
if isinstance(m, nn.Conv2d):
n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
m.weight.data.normal_(0, math.sqrt(2. / n))
if m.bias is not None:
m.bias.data.zero_()
elif isinstance(m, nn.BatchNorm2d):
m.weight.data.fill_(1)
m.bias.data.zero_()
def init_backbone(self, path):
""" Initializes the backbone weights for training. """
state_dict = torch.load(path, map_location='cpu')
# Replace featuresXXX -> layers.x.xx(/conv2d.x) etc.
keys = list(state_dict)
for key in keys:
if key.startswith('features'):
idx = int(key.split('.')[1])
if idx <= 17:
if idx == 0:
new_key = 'conv2d.' + key[11:]
elif (idx >= 1 and idx <= 3):
new_key = "layers.0." + str(idx - 1) + key[10:]
elif (idx >= 4 and idx <=6):
new_key = "layers.1." + str(idx - 4) + key[10:]
elif (idx >= 7 and idx <= 13):
if idx <= 9:
new_key = "layers.2." + str(idx - 7) + key[10:]
else:
new_key = "layers.2." + str(idx - 7) + key[11:]
else:
new_key = "layers.3." + str(idx - 14) + key[11:]
state_dict[new_key] = state_dict.pop(key)
else:
state_dict.pop(key)
else:
state_dict.pop(key)
# Note: Using strict=False is berry scary. Triple check this.nn.ModuleList()
self.load_state_dict(state_dict, strict=False)
And there is the corresponding config:
mobilenetv2_backbone = resnet50_backbone.copy({
'name': 'MobileNetV2',
'path': 'xxx.pth',
'type': MobileNetV2,
'args': (),
# 'transform': resnet_transform,
'selected_layers': [1, 2, 3],
'pred_scales': [[24], [48], [96], [192], [384]],
'pred_aspect_ratios': [[[1.414]]] * 5,
'use_pixel_scales': False,
})
yolact_mobilenetv2_coco_config = yolact_resnet50_config.copy({
'name': 'yolact_mobilenetv2_coco',
'backbone': mobilenetv2_backbone.copy(),
'use_semantic_segmentation_loss': True
})
haha, i found the problem...Just after i commented... When i do test on my own dataset, i removed some other anchors on purpose... and i forgot to recover them.
Hmm, can you still do a sanity check and change self.load_state_dict(state_dict, strict=False) to self.load_state_dict(state_dict, strict=True) and see if there's anything unexpected is missing? I think even with one anchor per layer, you shouldn't get < 1 mAP on the first validation epoch.
Thx, i'm just going to do it!
So,what's the accuracy and fps with mobilenet v2 now?@Wilbur529
@andeyeluguo It has run almost 200 epochs now, and the mAP are 14.36(box) and 15.02(mask) respectively. But I found that the model does not converge completely, so keep waiting...
@Wilbur529 wonderful work on the modification on the network backbone . keep going !!!!!!!!!!
@Wilbur529 hi is it possible to share your mobilenet implementation py ?? how good is the accuracy and fps compared to resenet 50,101
@abhigoku10 You could find the model and the accuracy in my previous answers, and i will do a benchmark test later. As far as I can see from the training log, i think it would be vary fast.
@Wilbur529 i was able to train the model using mobilenet architecture, thanks for sharing the reference implementation , when i tested it on images its able to detect with low accuracy but when i run it for a video the detection are not happening at all . Did u face the similar issue , should we make any changes with the code
@abhigoku10 Good job! I think it may not be a difficult problem. If you couldn't find the error code, why not write a for-loop upon your image inference function to simulate the video inference:)
@Wilbur529 yup that was my next step , but the video execution uses threading and mutiframe buffer functionality so just trying to check the feasibility on that , but did u face the same issues with the video
@abhigoku10 Sry i did not face this problem. Based on your description, i think the model has no mistake. So just pay more attention to the evalvideo method, and check the postprocess procedure, good luck for u~
@Wilbur529 sure i shall do it , how many iterations you had trained ur mobilenet model , since i trained with 2 lakh iterations the results on the validation set is not that great though the fps is high
@abhigoku10 I let it ran on only one single GPU, so it needed long time for training. At 200 epochs, the mAP are 14.36(box) and 15.02(mask).
@Wilbur529 thanks for the info i shall run it to 200 epochs and share my mAP, anyother architecture u tried besides mobilenet
@abhigoku10 EfficientNets and MobileNet-v3 is coming~~~
@Wilbur529 cool any plans on sharing the reference code
I am reading both of their paper, and i haven't found a suitable pytorch implementation of them~ Once I succeed in my experiment, i will show the modified parts.
@Wilbur529 thanks for the update eagers waiting for this
I am reading both of their paper, and i haven't found a suitable pytorch implementation of them~ Once I succeed in my experiment, i will show the modified parts.
Looking forward for your update~there is a reference that may be helpful
@Wilbur529 can you share the reference code for the efficient net so that i can test it on different data , what is the fps and accuracy u achieved
@abhigoku10 @dbolya EfficientNet is really efficient!!! I used the implementation and pretrained weights from EfficientNet-PyTorch. To validate the availability, i did two experiments:
-
EfficientNet-B0 has similar Top-5 Acc.(93.2%) with ResNet-50(93.0%) on ImageNet. When i change the backbone of YOLACT to EfficientNet-B0, the convergence rate and mAP between them almost is same. But the total model sizes are YOLACT550-EfficientNet-B0(47MB) and YOLACT550-ResNet-50(120MB).
-
EfficientNet-B4 has similar Top-5 Acc.(96.3%) with SENet(96.2%) on ImageNet, and its parameters size(19M) is less than ResNet-50(26M). When i change the backbone of YOLACT to EfficientNet-B4, i found it may achieve a better performance beyond the previous network. After 55 epochs training, now it have achieved 30.95 mAP on COCO test-dev, and it seems not converge completely. It is important to note that the YOLACT550-EfficientNet-B4 model size is only 101MB.
@Wilbur529 wonderful experimentation and anlaysis , would you be able to share the yolact-efficient code , i would be able to perform certain more experiments , since effcient-b0 model size is an interesting one THnaks in advance