pytorch-image-models icon indicating copy to clipboard operation
pytorch-image-models copied to clipboard

Add Anti Aliasing for `InvertedResidual` Blocks, `MobileNetV3`

Open rsomani95 opened this issue 4 years ago • 14 comments

Addresses #599

Continuing the conversation from the issue here.

I've added BlurPool2d as the default aa_layer. Should I change this to AntiAliasDownsampleLayer? I saw your comment here about wanting to merge the two into the latter.


The default configs for the model have no url, but I checked if we could load the pretrained weights from other models, and that's possible with the following snippet:

from timm.models.mobilenetv3 import default_cfgs

pt_arch = 'mobilenetv3_large_100_miil'
default_cfgs['mobilenetv3_large_100_aa'] = default_cfgs[pt_arch]

model = mobilenetv3_large_100_aa(pretrained=True)

As you mentioned, this could come in handy for fine-tuning purposes.


BUT I'd want to see at least one model trained from scratch to better than original accuracy to merge

Speaking from personal experience, the consistency gained when predicting on consecutive video frames is a huge bonus, and worth the runtime penalty (even if there is a marginal accuracy drop). Given that these are separate archs, would you still not want to merge it?


I'll run some experiments on my machine, but need a couple of days to figure out downloading ImageNet. This will be my first time training a model on ImageNet from scratch. Would you recommend the same hparams as in the docs?:

./distributed_train.sh 3 /imagenet/ --model mobilenetv3_large_100 -b 512 --sched step --epochs 600 --decay-epochs 2.4 --decay-rate .973 --opt rmsproptf --opt-eps .001 -j 7 --warmup-lr 1e-6 --weight-decay 1e-5 --drop 0.2 --drop-connect 0.2 --model-ema --model-ema-decay 0.9999 --aa rand-m9-mstd0.5 --remode pixel --reprob 0.2 --amp --lr .064 --lr-noise 0.42 0.9

For reference, I'll be training on 3x 3090s using NGC 20.12

Thanks!

rsomani95 avatar May 01 '21 07:05 rsomani95

@rwightman After the brief hiccup in #606, I launched a run with the same hparams mentioned in the docs. It's quite interesting to see how EMA accuracy starts beating the unaveraged weights' acc post ~ epoch 75.

I'm at Epoch 96, and top-1 EMA accuracy has been steadily climbing and is already at 71.93% , while the non-EMA accuracy has been hovering around ~68% for a while now.

Is this in line with your experience training such models? I'm trying to understand what are reasonable expectations for the accuracy vs. epoch plot to look like, as this is set to run for 600 epochs.

rsomani95 avatar May 03 '21 07:05 rsomani95

@rsomani95 yes, that's fairly normal, EMA will race ahead for the middle part of training, large gains in early part, then it's painfully slow and sometimes goes down for quite awhile before a final uptick. The ema and non-ema results will converge later in training with EMA usually keeping a lead

rwightman avatar May 03 '21 18:05 rwightman

@rsomani95 reasonable chance you'll see the peak result in the 400-500 epoch timeframe, it can be difficult to judge 'when' though

rwightman avatar May 03 '21 18:05 rwightman

Got it. That's super interesting.

rsomani95 avatar May 04 '21 01:05 rsomani95

@rwightman converged to 75.92 top-1 at epoch 515 (EMA). I've uploaded it to Dropbox here

rsomani95 avatar May 06 '21 17:05 rsomani95

@rwightman I've included a url so that model weights are accessible with pretrained=True. As a sanity check, I ran the same model through the validate.py script, getting Acc@1 75.924 (24.076) Acc@5 92.516

Please LMK if there's anything I'm missing. Thanks!

rsomani95 avatar May 10 '21 08:05 rsomani95

@rsomani95 thanks for the update, sorry for the lag, I've been trying to focus on getting a few other things polished off so haven't had a chance to try this yet. Looks like a decent result for a mobilenetv3. You left the stem stride=2 un-blurred... was that intentional? A lot of the anti-aliased nets also cover the stem (or just focus on the stem). Might be worth trying a stem + block aa variant as a comparison (with exact same hparams) before finalizing....

rwightman avatar May 11 '21 17:05 rwightman

@rwightman no worries

You left the stem stride=2 un-blurred... was that intentional?

That was an oversight...

I have another training run for a separate task going on right now, but should be able to launch the blurred stem + blocks variant on Friday and have results by ~18 May. I've pushed a commit to define this variant, the function names feel a bit awkward - happy to change that before finalizing.

rsomani95 avatar May 11 '21 18:05 rsomani95

@rsomani95 I likely caused some merge conflicts w/ refactoring related to adding efficientnetv2 official impl. I can do the fixup/cleanup when it's ready to merge after your next runs.

rwightman avatar May 19 '21 18:05 rwightman

Noted @rwightman. I haven't been able to launch the next run yet as the system's been occupied with some client project tasks. Given the run takes 4 days, I hesitate to "fire and forget". Is it alright if it takes me another 2-2.5 weeks or so to launch the next run?

rsomani95 avatar May 20 '21 05:05 rsomani95

@rsomani95 no worries re timing, my stack of todos is ever increasing as well :)

rwightman avatar May 20 '21 22:05 rwightman

@rwightman I'm unsure what's the right approach towards downsampling the stem. If I understand correctly, the principle when applying blur pooling to a network is to ensure that the shapes of the output feature maps are respected. Typically, all conv2d layers with stride=2 are replaced with stride=1 followed by blur pooling; while this is done in all the Bottleneck / Residual blocks, the stem seems to be handled differently.

In the original mobilenet-v2 implementation by Adobe, they do not apply blur pooling to the stem (which does have stride=2):

from antialiased_cnns import mobilenet_v2 as mobilenet_v2_aa
mobilenet_v2_aa()

MobileNetV2(
  (features): Sequential(
    (0): ConvBNReLU(
      (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU6(inplace=True)
    )
    ...

Presumably, this is to prevent a massive slowdown given it's the first conv layer?

In the ResNet examples in this repo too, no blur pooling is applied after the first conv layer with stride=2:

timm.create_model('resnetblur50')

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (act1): ReLU(inplace=True)
  (maxpool): Sequential(
    (0): MaxPool2d(kernel_size=3, stride=1, padding=1, dilation=1, ceil_mode=False)
    (1): BlurPool2d(
      (padding): ReflectionPad2d([1, 1, 1, 1])
    )
  )
  ...

Given the above, should the stem for mobilenet-v2 be left untouched?

PS - I'd pre-emptively added an aa layer after the stem which is reflected in the last 1-2 commits; that implementation is incorrect and I'll correct it based on this conversation.

rsomani95 avatar Dec 11 '21 08:12 rsomani95

@rsomani95 for the other net archs, the first stride 2 conv is not anti-aliased either..ie for ResNet the first stride=2 is left alone, and only the maxpool is dealt with which is the second reduction in the net. As you say, probably because the accuracy gain vs speed tradeoff isn't there for the 1st reduction.

rwightman avatar Dec 13 '21 03:12 rwightman

@rwightman got it, in that case perhaps it makes sense not to add anti aliasing to the mobilenet-v3 stem?

timm.create_model('mobilenetv3_large_100_aa')

MobileNetV3(
  (conv_stem): Conv2d(3, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (bn1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (act1): HardSwishMe()

  # Add AA to following Conv layers with stride=2
  (blocks): Sequential(
      ...

rsomani95 avatar Dec 13 '21 23:12 rsomani95