yolact icon indicating copy to clipboard operation
yolact copied to clipboard

What will happen if I change the backbone network into MobileNet-v2 with FPN?

Open Wilbur529 opened this issue 6 years ago • 82 comments
trafficstars

Hi, I want to try to change the backbone network into MobileNet-v2 with FPN. Is there any suggestions? THX!!!

Wilbur529 avatar May 05 '19 09:05 Wilbur529

That would be a good test, though I will warn that we've tried VGG16+FPN before and got sub-par performance (though maybe that's to be expected). If you want to change out the backbone head over to backbone.py and check out how I implemented Resnet and VGG. If you need any help, let me know.

dbolya avatar May 06 '19 22:05 dbolya

@dbolya Hi, I have done a preliminary test on my own simple dataset. And the experiment result made me feel excited~! With the help of MobileNet-v2, I successfully reduced the model size from over 100 MB to 31 MB, but without a significant performance loss on my own dataset. Next, i will try it on MS-COCO dataset, and i hope for the fantastic performance^_^

Wilbur529 avatar May 07 '19 02:05 Wilbur529

Sounds good! If it ends up working well, I might just make an official MobileNet-v2 version for YOLACTv1.1. That seems like it would be nice to have.

dbolya avatar May 07 '19 03:05 dbolya

@dbolya Hi, could you show me your train log file of resnet&coco?

Wilbur529 avatar May 09 '19 02:05 Wilbur529

This is my recently train log with mobilenetv2&coco...

[ 16] 59810 || B: 2.489 | C: 3.295 | M: 2.861 | S: 0.897 | T: 9.542 || ETA: 19 days, 23:21:26 || timer: 2.437

[ 16] 59820 || B: 2.505 | C: 3.321 | M: 2.840 | S: 0.898 | T: 9.565 || ETA: 19 days, 22:54:13 || timer: 2.125 [ 16] 59830 || B: 2.500 | C: 3.299 | M: 2.802 | S: 0.910 | T: 9.511 || ETA: 19 days, 22:34:59 || timer: 1.966 [ 16] 59840 || B: 2.504 | C: 3.271 | M: 2.787 | S: 0.900 | T: 9.461 || ETA: 19 days, 22:51:36 || timer: 2.071 [ 16] 59850 || B: 2.480 | C: 3.282 | M: 2.751 | S: 0.881 | T: 9.394 || ETA: 19 days, 22:43:14 || timer: 2.415 [ 16] 59860 || B: 2.462 | C: 3.247 | M: 2.752 | S: 0.883 | T: 9.344 || ETA: 19 days, 22:27:50 || timer: 2.144 [ 16] 59870 || B: 2.460 | C: 3.247 | M: 2.726 | S: 0.881 | T: 9.315 || ETA: 19 days, 22:22:21 || timer: 2.091 [ 16] 59880 || B: 2.489 | C: 3.290 | M: 2.696 | S: 0.891 | T: 9.366 || ETA: 19 days, 22:38:05 || timer: 2.101 [ 16] 59890 || B: 2.478 | C: 3.300 | M: 2.712 | S: 0.892 | T: 9.382 || ETA: 19 days, 22:48:17 || timer: 2.301 [ 16] 59900 || B: 2.485 | C: 3.299 | M: 2.711 | S: 0.889 | T: 9.384 || ETA: 19 days, 22:53:21 || timer: 2.046 [ 16] 59910 || B: 2.514 | C: 3.316 | M: 2.724 | S: 0.880 | T: 9.435 || ETA: 19 days, 23:27:14 || timer: 1.973 [ 16] 59920 || B: 2.505 | C: 3.308 | M: 2.721 | S: 0.877 | T: 9.410 || ETA: 19 days, 23:25:19 || timer: 2.257 [ 16] 59930 || B: 2.496 | C: 3.290 | M: 2.710 | S: 0.863 | T: 9.360 || ETA: 19 days, 23:15:29 || timer: 2.061 [ 16] 59940 || B: 2.467 | C: 3.269 | M: 2.694 | S: 0.864 | T: 9.293 || ETA: 19 days, 22:23:27 || timer: 1.924 [ 16] 59950 || B: 2.473 | C: 3.249 | M: 2.702 | S: 0.868 | T: 9.292 || ETA: 19 days, 22:55:27 || timer: 2.374 [ 16] 59960 || B: 2.497 | C: 3.261 | M: 2.702 | S: 0.858 | T: 9.318 || ETA: 19 days, 23:18:36 || timer: 2.055 [ 16] 59970 || B: 2.532 | C: 3.274 | M: 2.726 | S: 0.862 | T: 9.393 || ETA: 19 days, 23:51:10 || timer: 2.111 [ 16] 59980 || B: 2.493 | C: 3.262 | M: 2.723 | S: 0.861 | T: 9.339 || ETA: 19 days, 23:32:02 || timer: 1.991 [ 16] 59990 || B: 2.500 | C: 3.253 | M: 2.718 | S: 0.860 | T: 9.331 || ETA: 19 days, 23:44:44 || timer: 3.396

The loss value looks like going down. But the convergence speed looks so slowly. And i got a poor evaluated result:

box | 0.78 | 2.17 | 1.76 | 1.39 | 1.08 | 0.75 | 0.43 | 0.19 | 0.07 | 0.01 | 0.00 | box | 1.31 | 3.44 | 2.92 | 2.37 | 1.78 | 1.31 | 0.78 | 0.33 | 0.14 | 0.04 | 0.00 | box | 1.47 | 3.88 | 3.23 | 2.54 | 2.05 | 1.47 | 0.82 | 0.44 | 0.21 | 0.04 | 0.00 | box | 2.17 | 5.26 | 4.63 | 3.84 | 3.12 | 2.26 | 1.42 | 0.77 | 0.27 | 0.11 | 0.00 | box | 2.41 | 5.79 | 5.10 | 4.26 | 3.46 | 2.54 | 1.74 | 0.82 | 0.34 | 0.06 | 0.00 | box | 2.65 | 6.06 | 5.34 | 4.67 | 3.76 | 2.90 | 1.96 | 1.05 | 0.51 | 0.20 | 0.03 | box | 2.82 | 6.62 | 5.94 | 4.82 | 4.05 | 2.94 | 2.04 | 1.17 | 0.53 | 0.12 | 0.00 |

mask | 0.80 | 1.91 | 1.61 | 1.35 | 1.09 | 0.83 | 0.58 | 0.38 | 0.21 | 0.07 | 0.00 | mask | 1.44 | 3.13 | 2.71 | 2.28 | 1.95 | 1.62 | 1.20 | 0.81 | 0.55 | 0.12 | 0.01 | mask | 1.73 | 3.73 | 3.28 | 2.81 | 2.35 | 1.88 | 1.43 | 0.96 | 0.57 | 0.26 | 0.03 | mask | 2.38 | 4.86 | 4.39 | 3.80 | 3.27 | 2.65 | 2.09 | 1.49 | 0.95 | 0.29 | 0.03 | mask | 2.76 | 5.46 | 5.04 | 4.48 | 3.84 | 3.15 | 2.43 | 1.70 | 0.96 | 0.44 | 0.06 | mask | 3.01 | 5.81 | 5.34 | 4.82 | 4.16 | 3.53 | 2.72 | 1.97 | 1.10 | 0.56 | 0.07 | mask | 3.31 | 6.41 | 5.85 | 5.27 | 4.56 | 3.94 | 2.99 | 2.13 | 1.31 | 0.58 | 0.06 |

Hope to discuss more with you ^_^

Wilbur529 avatar May 09 '19 02:05 Wilbur529

Hmm with the mAP starting that low (< 1), it looks likes the behavior we get when we train from scratch. Are you using imagenet pretrained weights in your MobileNet implementation? If not, it'll take way longer to converge.

There's also a gotcha when loading pretrained weights because I had to set strict=False in the torch.load for loading backbone weights. Set that back to True if you are using pretrained weights but suspect it's not loading properly.

dbolya avatar May 09 '19 07:05 dbolya

THX~! I used the implementation and the pretrained weights from this repo:https://github.com/tonylins/pytorch-mobilenet-v2

I did the experiment follow these steps bellow:

  1. Modify the network architecture by removing the final global average pooling layer and the final 1x1 conv layer;
  2. Make a custom init_backbone() method in which i translated the node name and discarded the unuseful tensor;
  3. Add a new config-dict to the 'config.py', and set some helpful parameters.

Here is my MobileNet-v2 implementation:

class MobileNetV2(nn.Module):
    def __init__(self):
        super(MobileNetV2, self).__init__()
        block = InvertedResidual
        input_channel = 32
        interverted_residual_setting = [
            # t, c, n, s
            [1, 16, 1, 1],
            [6, 24, 2, 2], # useful
            [6, 32, 3, 2], # useful
            [6, 64, 4, 2],
            [6, 96, 3, 1], # useful
            [6, 160, 3, 2],
            [6, 320, 1, 1], # useful
        ]

        # Select the output layers
        # idx-1 => 56 * 56 * 24
        # idx-2 => 28 * 28 * 32
        # idx-4 => 14 * 14 * 96
        # idx-6 => 7 * 7 * 320
        select_idxs = [1, 2, 4, 6]


        # building first layer
        input_channel = int(input_channel)
        self.conv2d = conv_bn(3, input_channel, 2)
        self.layers = nn.ModuleList()
        self.channels = []

        # building inverted residual blocks
        features = []
        for idx, (t, c, n, s) in enumerate(interverted_residual_setting):
            output_channel = int(c)

            for i in range(n):
                if i == 0:
                    features.append(block(input_channel, output_channel, s, expand_ratio=t))
                else:
                    features.append(block(input_channel, output_channel, 1, expand_ratio=t))
                input_channel = output_channel

            if idx in select_idxs:
                self.layers.append(nn.Sequential(*features))
                self.channels.append(output_channel)
                features = []

        self.backbone_modules = [m for m in self.modules() if isinstance(m, nn.Conv2d)]

        # random initial
        self._initialize_weights()

    def forward(self, x):
        x = self.conv2d(x)

        outs = []
        for layer in self.layers:
            x = layer(x)
            outs.append(x)

        return tuple(outs)

    def _initialize_weights(self):
        modules = self.modules()
        for m in modules:
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
                if m.bias is not None:
                    m.bias.data.zero_()
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()

    def init_backbone(self, path):
        """ Initializes the backbone weights for training. """
        state_dict = torch.load(path, map_location='cpu')

        # Replace featuresXXX -> layers.x.xx(/conv2d.x) etc.
        keys = list(state_dict)
        for key in keys:
            if key.startswith('features'):
                idx = int(key.split('.')[1])
                if idx <= 17:
                    if idx == 0:
                        new_key = 'conv2d.' + key[11:]
                    elif (idx >= 1 and idx <= 3):
                        new_key = "layers.0." + str(idx - 1) + key[10:]
                    elif (idx >= 4 and idx <=6):
                        new_key = "layers.1." + str(idx - 4) + key[10:]
                    elif (idx >= 7 and idx <= 13):
                        if idx <= 9:
                            new_key = "layers.2." + str(idx - 7) + key[10:]
                        else:
                            new_key = "layers.2." + str(idx - 7) + key[11:]
                    else:
                        new_key = "layers.3." + str(idx - 14) + key[11:]

                    state_dict[new_key] = state_dict.pop(key)
                else:
                    state_dict.pop(key)
            else:
                state_dict.pop(key)


        # Note: Using strict=False is berry scary. Triple check this.nn.ModuleList()
        self.load_state_dict(state_dict, strict=False)

And there is the corresponding config:

mobilenetv2_backbone = resnet50_backbone.copy({
    'name': 'MobileNetV2',
    'path': 'xxx.pth',
    'type': MobileNetV2,
    'args': (),
    # 'transform': resnet_transform,

    'selected_layers': [1, 2, 3],
    'pred_scales': [[24], [48], [96], [192], [384]],
    'pred_aspect_ratios': [[[1.414]]] * 5,
    'use_pixel_scales': False,
})
yolact_mobilenetv2_coco_config = yolact_resnet50_config.copy({
    'name': 'yolact_mobilenetv2_coco',

    'backbone': mobilenetv2_backbone.copy(),

    'use_semantic_segmentation_loss': True
})

Wilbur529 avatar May 09 '19 09:05 Wilbur529

haha, i found the problem...Just after i commented... When i do test on my own dataset, i removed some other anchors on purpose... and i forgot to recover them.

Wilbur529 avatar May 09 '19 09:05 Wilbur529

Hmm, can you still do a sanity check and change self.load_state_dict(state_dict, strict=False) to self.load_state_dict(state_dict, strict=True) and see if there's anything unexpected is missing? I think even with one anchor per layer, you shouldn't get < 1 mAP on the first validation epoch.

dbolya avatar May 09 '19 09:05 dbolya

Thx, i'm just going to do it!

Wilbur529 avatar May 09 '19 09:05 Wilbur529

So,what's the accuracy and fps with mobilenet v2 now?@Wilbur529

andeyeluguo avatar Jun 03 '19 09:06 andeyeluguo

@andeyeluguo It has run almost 200 epochs now, and the mAP are 14.36(box) and 15.02(mask) respectively. But I found that the model does not converge completely, so keep waiting...

Wilbur529 avatar Jun 03 '19 09:06 Wilbur529

@Wilbur529 wonderful work on the modification on the network backbone . keep going !!!!!!!!!!

abhigoku10 avatar Jun 04 '19 06:06 abhigoku10

@Wilbur529 hi is it possible to share your mobilenet implementation py ?? how good is the accuracy and fps compared to resenet 50,101

abhigoku10 avatar Jun 18 '19 16:06 abhigoku10

@abhigoku10 You could find the model and the accuracy in my previous answers, and i will do a benchmark test later. As far as I can see from the training log, i think it would be vary fast.

Wilbur529 avatar Jun 19 '19 02:06 Wilbur529

@Wilbur529 i was able to train the model using mobilenet architecture, thanks for sharing the reference implementation , when i tested it on images its able to detect with low accuracy but when i run it for a video the detection are not happening at all . Did u face the similar issue , should we make any changes with the code

abhigoku10 avatar Jun 27 '19 07:06 abhigoku10

@abhigoku10 Good job! I think it may not be a difficult problem. If you couldn't find the error code, why not write a for-loop upon your image inference function to simulate the video inference:)

Wilbur529 avatar Jun 28 '19 01:06 Wilbur529

@Wilbur529 yup that was my next step , but the video execution uses threading and mutiframe buffer functionality so just trying to check the feasibility on that , but did u face the same issues with the video

abhigoku10 avatar Jun 28 '19 03:06 abhigoku10

@abhigoku10 Sry i did not face this problem. Based on your description, i think the model has no mistake. So just pay more attention to the evalvideo method, and check the postprocess procedure, good luck for u~

Wilbur529 avatar Jun 28 '19 03:06 Wilbur529

@Wilbur529 sure i shall do it , how many iterations you had trained ur mobilenet model , since i trained with 2 lakh iterations the results on the validation set is not that great though the fps is high

abhigoku10 avatar Jun 28 '19 15:06 abhigoku10

@abhigoku10 I let it ran on only one single GPU, so it needed long time for training. At 200 epochs, the mAP are 14.36(box) and 15.02(mask).

Wilbur529 avatar Jun 29 '19 03:06 Wilbur529

@Wilbur529 thanks for the info i shall run it to 200 epochs and share my mAP, anyother architecture u tried besides mobilenet

abhigoku10 avatar Jul 01 '19 03:07 abhigoku10

@abhigoku10 EfficientNets and MobileNet-v3 is coming~~~

Wilbur529 avatar Jul 01 '19 06:07 Wilbur529

@Wilbur529 cool any plans on sharing the reference code

abhigoku10 avatar Jul 01 '19 07:07 abhigoku10

I am reading both of their paper, and i haven't found a suitable pytorch implementation of them~ Once I succeed in my experiment, i will show the modified parts.

Wilbur529 avatar Jul 01 '19 07:07 Wilbur529

@Wilbur529 thanks for the update eagers waiting for this

abhigoku10 avatar Jul 01 '19 08:07 abhigoku10

I am reading both of their paper, and i haven't found a suitable pytorch implementation of them~ Once I succeed in my experiment, i will show the modified parts.

Looking forward for your update~there is a reference that may be helpful

pandamax avatar Jul 09 '19 07:07 pandamax

@Wilbur529 can you share the reference code for the efficient net so that i can test it on different data , what is the fps and accuracy u achieved

abhigoku10 avatar Jul 27 '19 08:07 abhigoku10

@abhigoku10 @dbolya EfficientNet is really efficient!!! I used the implementation and pretrained weights from EfficientNet-PyTorch. To validate the availability, i did two experiments:

  1. EfficientNet-B0 has similar Top-5 Acc.(93.2%) with ResNet-50(93.0%) on ImageNet. When i change the backbone of YOLACT to EfficientNet-B0, the convergence rate and mAP between them almost is same. But the total model sizes are YOLACT550-EfficientNet-B0(47MB) and YOLACT550-ResNet-50(120MB).

  2. EfficientNet-B4 has similar Top-5 Acc.(96.3%) with SENet(96.2%) on ImageNet, and its parameters size(19M) is less than ResNet-50(26M). When i change the backbone of YOLACT to EfficientNet-B4, i found it may achieve a better performance beyond the previous network. After 55 epochs training, now it have achieved 30.95 mAP on COCO test-dev, and it seems not converge completely. It is important to note that the YOLACT550-EfficientNet-B4 model size is only 101MB.

Wilbur529 avatar Jul 29 '19 02:07 Wilbur529

@Wilbur529 wonderful experimentation and anlaysis , would you be able to share the yolact-efficient code , i would be able to perform certain more experiments , since effcient-b0 model size is an interesting one THnaks in advance

abhigoku10 avatar Jul 29 '19 04:07 abhigoku10