pytorch-image-models icon indicating copy to clipboard operation
pytorch-image-models copied to clipboard

add simplenet architecture

Open Coderx7 opened this issue 2 years ago • 28 comments

This pull request add SimpleNet architecture. Simplenetv1 is a 2016 architecture comprised of only the most basic operators that comprises a plain CNN network. It outperformed many deeper and more complex architectures such as VGGNet,ResNet,etc on several benchmark datasets. This is its results on ImageNet dataset.

-added simplenet.py to timm/models -added simplenet.md to docs/models -added an entry to docs/models.md

Here are some more information concerning how they perform taken from our official pytorch repository:

Model #Params ImageNet ImageNet-Real-Labels
simplenetv1_9m_m2(36.3 MB) 9.5m 74.23 / 91.748 81.22 / 94.756
simplenetv1_5m_m2(22 MB) 5.7m 72.03 / 90.324 79.328/ 93.714
simplenetv1_small_m2_075(12.6 MB) 3m 68.506/ 88.15 76.283/ 92.02
simplenetv1_small_m2_05(5.78 MB) 1.5m 61.67 / 83.488 69.31 / 88.195

SimpleNet performs very decently, it outperforms VGGNet, variants of ResNet and MobileNets(1-3)
and its pretty fast as well! and its all using plain old CNN!.

Here's an example of benchmark run on small variants of simplenet and some other known architectures such as mobilenets.
Small variants of simplenet consistently achieve high performance/accuracy:

model samples_per_sec param_count top1 top5
simplenetv1_small_m1_05 3100.26 1.51 61.122 82.988
mobilenetv3_small_050 3082.85 1.59 57.89 80.194
lcnet_050 2713.02 1.88 63.1 84.382
simplenetv1_small_m2_05 2536.16 1.51 61.67 83.488
mobilenetv3_small_075 1793.42 2.04 65.242 85.438
tf_mobilenetv3_small_075 1689.53 2.04 65.714 86.134
simplenetv1_small_m1_075 1626.87 3.29 67.784 87.718
tf_mobilenetv3_small_minimal_100 1316.91 2.04 62.908 84.234
simplenetv1_small_m2_075 1313.6 3.29 68.506 88.15
mobilenetv3_small_100 1261.09 2.54 67.656 87.634
tf_mobilenetv3_small_100 1213.03 2.54 67.924 87.664
mnasnet_small 1089.33 2.03 66.206 86.508
mobilenetv2_050 857.66 1.97 65.942 86.082
dla46_c 537.08 1.3 64.866 86.294
dla46x_c 323.03 1.07 65.97 86.98
dla60x_c 301.71 1.32 67.892 88.426

and this is a sample for larger models:

model samples_per_sec param_count top1 top5
simplenetv1_small_m1_075 2893.91 3.29 67.784 87.718
simplenetv1_small_m2_075 2478.41 3.29 68.506 88.15
vit_tiny_r_s16_p8_224 2337.23 6.34 71.792 90.822
simplenetv1_5m_m1 2105.06 5.75 71.548 89.94
simplenetv1_5m_m2 1754.25 5.75 72.03 90.324
resnet18 1750.38 11.69 69.744 89.082
regnetx_006 1620.25 6.2 73.86 91.672
mobilenetv3_large_100 1491.86 5.48 75.766 92.544
tf_mobilenetv3_large_minimal_100 1476.29 3.92 72.25 90.63
tf_mobilenetv3_large_075 1474.77 3.99 73.436 91.344
ghostnet_100 1390.19 5.18 73.974 91.46
tinynet_b 1345.82 3.73 74.976 92.184
tf_mobilenetv3_large_100 1325.06 5.48 75.518 92.604
mnasnet_100 1183.69 4.38 74.658 92.112
mobilenetv2_100 1101.58 3.5 72.97 91.02
simplenetv1_9m_m1 1048.91 9.51 73.792 91.486
resnet34 1030.4 21.8 75.114 92.284
deit_tiny_patch16_224 990.85 5.72 72.172 91.114
efficientnet_lite0 977.76 4.65 75.476 92.512
simplenetv1_9m_m2 900.45 9.51 74.23 91.748
tf_efficientnet_lite0 876.66 4.65 74.832 92.174
dla34 834.35 15.74 74.62 92.072
mobilenetv2_110d 824.4 4.52 75.038 92.184
resnet26 771.1 16 75.3 92.578
repvgg_b0 751.01 15.82 75.16 92.418
crossvit_9_240 606.2 8.55 73.96 91.968
vgg11 576.32 132.86 69.028 88.626
vit_base_patch32_224_sam 561.99 88.22 73.694 91.01
vgg11_bn 504.29 132.87 70.36 89.802
densenet121 435.3 7.98 75.584 92.652
vgg13 363.69 133.05 69.926 89.246
vgg13_bn 315.85 133.05 71.594 90.376
vgg16 302.84 138.36 71.59 90.382
vgg16_bn 265.99 138.37 73.35 91.504
vgg19 259.82 143.67 72.366 90.87
vgg19_bn 229.77 143.68 74.214 91.848

Note:
These benchmarks are run on a PC with GTX1080, Pytorch 1.11, fp32 and nchw configuration.

I hope this is useful for the community.

Coderx7 avatar Feb 16 '23 19:02 Coderx7

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@Coderx7 thanks for the PR, looks like a decent lightweight model, but the big stack of layers in a sequential doesn't really line up with other timm models, makes it hard to support many default features like feature extraction at strided stage boundaries, layer grouping, block based grad checkpointing, etc....

Any chance you could organize the net into stem + stages[blocks[]] ?

rwightman avatar Feb 17 '23 05:02 rwightman

@rwightman my pleasure. I tried to follow your vgg implementation and implement everything that was there.
I'm not familiar with the stem+stages. could you elaborate a bit more on this?

Coderx7 avatar Feb 17 '23 06:02 Coderx7

@Coderx7 RexNet is probably the simplest example, ResNetV2 and RegNet are decent examples as well...

  • https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/rexnet.py
  • https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/resnetv2.py
  • https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/regnet.py

I also just refactored Levit to use stages (for feat extraction support), and it's similar to this net in that there aren't strided convs, but a 'downsample' layer that'd be at the start of strided stages.

  • https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/levit.py

rwightman avatar Feb 17 '23 07:02 rwightman

So looking at the net layout, two possible structures stand out:

stem:
      (128, 1, 0.0),
stage[0]
      (192, 1, 0.0),
      (192, 1, 0.0),
      (192, 1, 0.0),
      (192, 1, 0.0),
      (192, 1, 0.0),
stage[1]
      ("p", 2, 0.0), 
      (320, 1, 0.0),
      (320, 1, 0.0),
      (320, 1, 0.0),
      (640, 1, 0.0),
stage[2]
      ("p", 2, 0.0),
      (2560, 1, 0.0, "k1"),
      (320, 1, 0.0, "k1"),
      (320, 1, 0.0),
head:
stem:
      (128, 1, 0.0),
stage[0]
      (192, 1, 0.0),
      (192, 1, 0.0),
      (192, 1, 0.0),
      (192, 1, 0.0),
      (192, 1, 0.0),
stage[1]
      ("p", 2, 0.0),
      (320, 1, 0.0),
      (320, 1, 0.0),
      (320, 1, 0.0),
stage[2]
      (640, 1, 0.0),
stage[3]
      ("p", 2, 0.0),
      (2560, 1, 0.0, "k1"),
stage[4]
      (320, 1, 0.0, "k1"),
      (320, 1, 0.0),
head:
``

rwightman avatar Feb 17 '23 07:02 rwightman

@rwightman Thanks a lot for the examples. I guess I'll give resnext a try and hopefully get it refactored soon.

Coderx7 avatar Feb 17 '23 08:02 Coderx7

@rwightman : I got a bit confused doing the refactoring, do you mind if I ask you questions while I try to refactor the architecture? for the start, should the model look like this? also how does timm handle the conversion of previous weights (model state_dict) to the new form?

SimpleNet(
  (stem): Sequential(
    (0): Sequential(
      (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (1): BatchNorm2d(64, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
      (3): Dropout2d(p=0.0, inplace=False)
    )
  )
  (features): Sequential(
    (stage_0): SimpleBlock(
      (block): Sequential(
        (ConvBlock_0): Sequential(
          (0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
        (ConvBlock_1): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
        (ConvBlock_2): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
        (ConvBlock_3): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
        (ConvBlock_4): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
      )
    )
    (stage_1): SimpleBlock(
      (block): Sequential(
        (maxpool_0): Sequential(
          (0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (1): Dropout2d(p=0.0, inplace=True)
        )
        (ConvBlock_1): Sequential(
          (0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
        (ConvBlock_2): Sequential(
          (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
        (ConvBlock_3): Sequential(
          (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
      )
    )
    (stage_2): SimpleBlock(
      (block): Sequential(
        (ConvBlock_0): Sequential(
          (0): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(512, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
      )
    )
    (stage_3): SimpleBlock(
      (block): Sequential(
        (maxpool_0): Sequential(
          (0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (1): Dropout2d(p=0.0, inplace=True)
        )
        (ConvBlock_1): Sequential(
          (0): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(2048, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
      )
    )
    (stage_4): SimpleBlock(
      (block): Sequential(
        (ConvBlock_0): Sequential(
          (0): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
        (ConvBlock_1): Sequential(
          (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
      )
    )
  )
  (head): ClassifierHead(
    (global_pool): SelectAdaptivePool2d (pool_type=max, flatten=Flatten(start_dim=1, end_dim=-1))
    (fc): Linear(in_features=256, out_features=1000, bias=True)
    (flatten): Identity()
  )
)

Coderx7 avatar Feb 17 '23 17:02 Coderx7

@Coderx7 structure looks nice

for conversion I usually write a fn called checkpoint_filter_fn

See:

  • https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientformer.py#L473
  • https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/levit.py#L696
  • https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/edgenext.py#L482
  • https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/convnext.py#L387

Mapping a purely line 0..num_model_layers to stages is going to be a bit of fun, probably need to use regex, finding a rule that you can increment stage_idx on (ie every time outdim changes). Last ditch is just to iterate both state dicts together like the levit example and assume they line up (they should), assert that the num elements matches...

rwightman avatar Feb 17 '23 18:02 rwightman

that checkpoint filter should be passed to the builder ie https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/levit.py#L765

rwightman avatar Feb 17 '23 18:02 rwightman

@rwightman I got the checkpoint working, however for some reason when I try the features_only argument during model creation, it crashes and complains the return layers are not present in model :

AssertionError: Return layers ({'features.stage_0.block.ConvBlock_0', 'features.stage_3.block.maxpool', 'features.stage_0.block.ConvBlock_2', 'features.stage_1.block.maxpool'}) are not present in model

what should I specify in module name in feature_info list, what is it looking for? if it helps this is how the model looks like :

SimpleNet(
  (stem): ConvBNReLU(
    (conv): Conv2d(3, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (bn): BatchNorm2d(64, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
    (dropout): Dropout2d(p=0.0, inplace=False)
    (relu): ReLU(inplace=True)
  )
  (features): Sequential(
    (stage_0): SimpleBlock(
      (block): Sequential(
        (ConvBlock_0): ConvBNReLU(
          (conv): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
          (bn): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
        (ConvBlock_1): ConvBNReLU(
          (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
        (ConvBlock_2): ConvBNReLU(
          (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
          (bn): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
        (ConvBlock_3): ConvBNReLU(
          (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
        (ConvBlock_4): ConvBNReLU(
          (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
      )
    )
    (stage_1): SimpleBlock(
      (block): Sequential(
        (maxpool): Sequential(
          (0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (1): Dropout2d(p=0.0, inplace=True)
        )
        (ConvBlock_0): ConvBNReLU(
          (conv): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
        (ConvBlock_1): ConvBNReLU(
          (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
        (ConvBlock_2): ConvBNReLU(
          (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
      )
    )
    (stage_2): SimpleBlock(
      (block): Sequential(
        (ConvBlock_0): ConvBNReLU(
          (conv): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(512, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
      )
    )
    (stage_3): SimpleBlock(
      (block): Sequential(
        (maxpool): Sequential(
          (0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (1): Dropout2d(p=0.0, inplace=True)
        )
        (ConvBlock_0): ConvBNReLU(
          (conv): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(2048, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
      )
    )
    (stage_4): SimpleBlock(
      (block): Sequential(
        (ConvBlock_0): ConvBNReLU(
          (conv): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
        (ConvBlock_1): ConvBNReLU(
          (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
      )
    )
  )
  (head): ClassifierHead(
    (global_pool): SelectAdaptivePool2d (pool_type=max, flatten=Flatten(start_dim=1, end_dim=-1))
    (fc): Linear(in_features=256, out_features=1000, bias=True)
    (flatten): Identity()
  )
)

feature_info:

[{'num_chs': 64, 'reduction': 2, 'module': 'stem'},
 {'num_chs': 128,
  'reduction': 4,
  'module': 'features.stage_0.block.ConvBlock_0'},
 {'num_chs': 128,
  'reduction': 8,
  'module': 'features.stage_0.block.ConvBlock_2'},
 {'num_chs': 128, 'reduction': 16, 'module': 'features.stage_1.block.maxpool'},
 {'num_chs': 512, 'reduction': 32, 'module': 'features.stage_3.block.maxpool'}]

and this is the state_key.keys():

stem.conv.weight
stem.conv.bias
stem.bn.weight
stem.bn.bias
stem.bn.running_mean
stem.bn.running_var
stem.bn.num_batches_tracked
features.stage_0.block.ConvBlock_0.conv.weight
features.stage_0.block.ConvBlock_0.conv.bias
features.stage_0.block.ConvBlock_0.bn.weight
features.stage_0.block.ConvBlock_0.bn.bias
features.stage_0.block.ConvBlock_0.bn.running_mean
features.stage_0.block.ConvBlock_0.bn.running_var
features.stage_0.block.ConvBlock_0.bn.num_batches_tracked
features.stage_0.block.ConvBlock_1.conv.weight
features.stage_0.block.ConvBlock_1.conv.bias
features.stage_0.block.ConvBlock_1.bn.weight
features.stage_0.block.ConvBlock_1.bn.bias
features.stage_0.block.ConvBlock_1.bn.running_mean
features.stage_0.block.ConvBlock_1.bn.running_var
features.stage_0.block.ConvBlock_1.bn.num_batches_tracked
features.stage_0.block.ConvBlock_2.conv.weight
features.stage_0.block.ConvBlock_2.conv.bias
features.stage_0.block.ConvBlock_2.bn.weight
features.stage_0.block.ConvBlock_2.bn.bias
features.stage_0.block.ConvBlock_2.bn.running_mean
features.stage_0.block.ConvBlock_2.bn.running_var
features.stage_0.block.ConvBlock_2.bn.num_batches_tracked
features.stage_0.block.ConvBlock_3.conv.weight
features.stage_0.block.ConvBlock_3.conv.bias
features.stage_0.block.ConvBlock_3.bn.weight
features.stage_0.block.ConvBlock_3.bn.bias
features.stage_0.block.ConvBlock_3.bn.running_mean
features.stage_0.block.ConvBlock_3.bn.running_var
features.stage_0.block.ConvBlock_3.bn.num_batches_tracked
features.stage_0.block.ConvBlock_4.conv.weight
features.stage_0.block.ConvBlock_4.conv.bias
features.stage_0.block.ConvBlock_4.bn.weight
features.stage_0.block.ConvBlock_4.bn.bias
features.stage_0.block.ConvBlock_4.bn.running_mean
features.stage_0.block.ConvBlock_4.bn.running_var
features.stage_0.block.ConvBlock_4.bn.num_batches_tracked
features.stage_1.block.ConvBlock_0.conv.weight
features.stage_1.block.ConvBlock_0.conv.bias
features.stage_1.block.ConvBlock_0.bn.weight
features.stage_1.block.ConvBlock_0.bn.bias
features.stage_1.block.ConvBlock_0.bn.running_mean
features.stage_1.block.ConvBlock_0.bn.running_var
features.stage_1.block.ConvBlock_0.bn.num_batches_tracked
features.stage_1.block.ConvBlock_1.conv.weight
features.stage_1.block.ConvBlock_1.conv.bias
features.stage_1.block.ConvBlock_1.bn.weight
features.stage_1.block.ConvBlock_1.bn.bias
features.stage_1.block.ConvBlock_1.bn.running_mean
features.stage_1.block.ConvBlock_1.bn.running_var
features.stage_1.block.ConvBlock_1.bn.num_batches_tracked
features.stage_1.block.ConvBlock_2.conv.weight
features.stage_1.block.ConvBlock_2.conv.bias
features.stage_1.block.ConvBlock_2.bn.weight
features.stage_1.block.ConvBlock_2.bn.bias
features.stage_1.block.ConvBlock_2.bn.running_mean
features.stage_1.block.ConvBlock_2.bn.running_var
features.stage_1.block.ConvBlock_2.bn.num_batches_tracked
features.stage_2.block.ConvBlock_0.conv.weight
features.stage_2.block.ConvBlock_0.conv.bias
features.stage_2.block.ConvBlock_0.bn.weight
features.stage_2.block.ConvBlock_0.bn.bias
features.stage_2.block.ConvBlock_0.bn.running_mean
features.stage_2.block.ConvBlock_0.bn.running_var
features.stage_2.block.ConvBlock_0.bn.num_batches_tracked
features.stage_3.block.ConvBlock_0.conv.weight
features.stage_3.block.ConvBlock_0.conv.bias
features.stage_3.block.ConvBlock_0.bn.weight
features.stage_3.block.ConvBlock_0.bn.bias
features.stage_3.block.ConvBlock_0.bn.running_mean
features.stage_3.block.ConvBlock_0.bn.running_var
features.stage_3.block.ConvBlock_0.bn.num_batches_tracked
features.stage_4.block.ConvBlock_0.conv.weight
features.stage_4.block.ConvBlock_0.conv.bias
features.stage_4.block.ConvBlock_0.bn.weight
features.stage_4.block.ConvBlock_0.bn.bias
features.stage_4.block.ConvBlock_0.bn.running_mean
features.stage_4.block.ConvBlock_0.bn.running_var
features.stage_4.block.ConvBlock_0.bn.num_batches_tracked
features.stage_4.block.ConvBlock_1.conv.weight
features.stage_4.block.ConvBlock_1.conv.bias
features.stage_4.block.ConvBlock_1.bn.weight
features.stage_4.block.ConvBlock_1.bn.bias
features.stage_4.block.ConvBlock_1.bn.running_mean
features.stage_4.block.ConvBlock_1.bn.running_var
features.stage_4.block.ConvBlock_1.bn.num_batches_tracked
head.fc.weight
head.fc.bias

Coderx7 avatar Feb 18 '23 17:02 Coderx7

@Coderx7 feature info should be filled with the module name of the 'deepest' layer for a given stride, so usually the nn.Module before a dowsample layer. In this case, you'd want stem, features.stage_0, features.stage_2, features.stage_4 ...aaand I just noticed there is a stride 2 on ConvBlock_2 of stage_0, if that's supposed to be there, that should split into a diff stage (stages deliminted by stride layers and in many cases, shifts in width)

rwightman avatar Feb 19 '23 06:02 rwightman

@rwightman Thanks. but there are two things here, first I believe I did just that but still got that same error anyway! I'll give that another try ans see how it goes. and 2. concerning the stages, this architecture uses dynamic strides for any layers basically, but especially the first 4. (I can remove it and make it static as there are only two pretrained variants with two stride modes!) the two trained variants use mode 1 and mode 2 strides which basically downsamples the early layers at specific rate so during imagenet training you can have some kind of leverage on performance/accuracy ratio at its simplest form.
like here, it uses strides of 2,2,1,2 and another variant uses 2,2,2 and the rest are 1s.
if I create stages based on the downsampling of features, stem, layer1,layer3 all should be in unique stages right? like stage1 to stage 2(excluding stem)?

Coderx7 avatar Feb 19 '23 07:02 Coderx7

In the model create helper you should enable the flatten_sequential and ensure the default # of out indices matches the net

    out_indices = kwargs.pop('out_indices', (0, 1, 2, 3))
    model = build_model_with_cfg(
        EfficientFormerV2, variant, pretrained,
        feature_cfg=dict(flatten_sequential=True, out_indices=out_indices),
        **kwargs)

Most models have some sort of pattern and systematic spacing between the strided layers so figured that'd be the same here for the configs, I realize they could be put anywhere but doesn't seem that useful to have no depth between strides.

The concept of the stage is essentially encapsulate the layers at the same stride, and sometimes there are stages w/o any stride but a different width, conv type (depthwise vs not), or other trait in common with all layer repeats in the stage.

rwightman avatar Feb 19 '23 07:02 rwightman

@rwightman Thanks a lot. thats a fair point, however this was never meant to scale that way. it was designed with something completely different in mind. it was meant to show how one could maximize a networks performance under constraint (fixed_param count, depth and basic operators) while keeping everything simple and not resorting to any complex strategies.

having that said, thankfully I seem to pretty much have done everything and the only thing that seems to still be an issue is that the last stage has a bigger featuremap size(thus smaller stride) than its previous counter part. it seems timm has issues with it. currently this is how my feature_info looks like:

[{'num_chs': 64, 'reduction': 2, 'module': 'stem'},
 {'num_chs': 128, 'reduction': 4, 'module': 'features.stage_0'},
 {'num_chs': 128, 'reduction': 8, 'module': 'features.stage_1'},
 {'num_chs': 512, 'reduction': 16, 'module': 'features.stage_2'},
 {'num_chs': 2048, 'reduction': 24, 'module': 'features.stage_3'},
 {'num_chs': 256, 'reduction': 20, 'module': 'features.stage_4'}]

How should I be handling this other than merging the last two stages? thanks a lot in advance

Coderx7 avatar Feb 19 '23 12:02 Coderx7

@rwightman would you kindly have a look here and tell me what to do for the last part? thanks

Coderx7 avatar Feb 21 '23 05:02 Coderx7

@Coderx7 reduction is spatial reduction (from the input image size), it's only complained about if it decreases, it's not used directly by timm but some downstream users want to know that for calculating interpolation ratios.

If you look at the rexnet example it should *=2 every time there is a strided layer, majority of imagnet networks ar stride 32, num chs does not have any restrictions for increasing/decreasing although

rwightman avatar Feb 21 '23 06:02 rwightman

@rwightman I thought the idea was to provide featuremaps of different sizes for downstream usage not capturing only the strides of 2 per say.
currently if the assert in

assert 'reduction' in fi and fi['reduction'] >= prev_reduction

is not disabled this wont work. so I need to do one of the following :

  1. to have 4 stages and only have reduction rates for 3 stages (that is don't include the last reduction rate for the last stage in feature_info
  2. to have 3 stages and merge the last 2 stages (3 and 4 and only have 3 stages in total with 3 reduction rates for each
  3. the featureinfo class is altered to have a new argument which allows cases like this,
  • the issue with the first option is users will lose the last two layers of the network if they opt out to use features_only, but other than than normal usage stays the same.
  • the issue with the second option is, users cant fully experiment with the stage 4, so they have to manually do this which nullifies the purpose of features_only I guess.
  • the last option seems like a good idea to me as with a default value that works for all current model, the current behavior is maintained, but it allows for cases such as this to also be usable. unless that check has more significance and affects lots of other parts of the library which I'm not aware of yet.

so which option should I take and hopefully finish this up? Thanks a lot in advance.

Coderx7 avatar Feb 21 '23 08:02 Coderx7

@rwightman I'd really appreciate if you could kindly have a look and decide on the next step so I can finalize the changes accordingly and have it finished.

Coderx7 avatar Feb 23 '23 05:02 Coderx7

@Coderx7 sorry I have a lot on my plate right now, wrapping a up a few things before I'm on vacation for a bit. I'm going to have to leave this one hanging for a bit as I don't think we're on the same page.

The net is simple as per its name and I didn't see any merging, or upscaling or anything that could result in a feature map increasing in size, it's reducing by 2 at each downscale. I feel we're lost in semantics.

rwightman avatar Feb 23 '23 06:02 rwightman

@rwightman out of the last three conv layers, two (2048 and 256) have kernel size=1 and they use padding of 1, that causes the featuremap size to increase from previously 7x7 (after the down sampling) to 9x9(after conv 1x1), the next conv1x1 layer increases that to 11x11, thus causing the effective reduction to vary that way. Ok no problem, please take your time and let's continue this when you are free. I really do appreciate you taking the time despite all of your busy schedule.

Coderx7 avatar Feb 23 '23 07:02 Coderx7

@rwightman May I ask if your vacation is over and if we can hopefully get this last step worked out?

Coderx7 avatar Mar 13 '23 19:03 Coderx7

@Coderx7 been trying to get on top of my own tasks since getting back. I looked at this a bit more, not really liking the padding issue that is the reason for the expanding dim... having a padding of 1 for a 1x1 conv makes zero sense to me. It's adding data to the signal path that's not meaningful. So, I'm hesitant to add alltogether with quirks like that present...

rwightman avatar Mar 16 '23 20:03 rwightman

@rwightman Thanks, I really appreciate it knowing how busy your schedule is. its not really any different than using (zero)-padding on the input. This happened by accident, but after I noticed it, in a few experiments that I did afterward, I noticed they perform better than the no padding versions, looked to me as if it creates a kind of regularization effect. I can run more experiments to further validate this point (or lack thereof, if that happens to be the case ultimately), if that's your concern. my main concern is that, it takes a lot of time to train these models again, (it took me several months to train these models as I don't have access to anything powerful, just a single gpu). but I try my best to see how I can address your concerns.

Coderx7 avatar Mar 17 '23 08:03 Coderx7

@Coderx7 in deep learning it would seem almost any extra activations (or parameters) can/will be used to improve the loss in optimization, but I'd argue not particularly useful ones (and possibly harmful for segmentation/obj detection as they'd add a 'border' effect at the feature level). They get blended back into the signal via the subsequent 3x3 conv. I did test these and per the goal of running faster, the extra padding does have a measurable speed impact (not significant but there).

The rest of the net is fine, simple as per the name which isn't bad to have in timm as they can be the best option for some tasks. If the padding issue fixed (padding == kernel_size//2 should do fine for this net) and retrained I'd definitely include with the tweaks mentioned.

Do you have hparams for these? I have two idle 2x Titan RTX machines right now, I could put them to work if you push any outstanding changes re arch to this PR.

rwightman avatar Mar 17 '23 16:03 rwightman

@rwightman That would be great, thanks a lot :) I was actually planning to test them on detection/segmentation as well, to see how that affects the result in different scenarios. I'm glad you kindly shared your experience with me. I really really appreciate it. :)

OK I try to push all the changes so far, and update the changes. the model in refactored in stages, and use no padding for the 1x1convs . everything should hopefully work. concerning the hparams, I didnt have much to test, I went with nearly the same exact hparams for all architectures which I say in a moment.

They all use the same settings and only differ in weight decay and dropouts.

I start off with training all models without any dropouts, and a side from keeping top 10 checkpoints, save every 50 checkpoints (i.e. like 100th, 150th, 200th,... checkpoints) so that later on I use these checkpoints to resume the model with dropout so I get faster and better results. Then at the end, I take the average of the best checkpoints.
if I don't get the accuracy I like, I try resuming with no label smoothing and finally average the best of these with to get final results. thats all. The basic command arguments look like this, and only the model name , weightdecay and dropout changes.

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_5m_m1 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00003 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999

For example for the 5m variants, this is how I start the training: 5m:

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_5m_m1 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00003 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_5m_m2 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00003 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999

when model stops improving, I resume from checkpoing 250 and this time use dropout rates like this : --weight-decay 0.00002 --model-kwargs drop_rates='{"11":0.02,"12":0.05,"13":0.05}' this is the same for both variants.

9m: For m1 variant, I trained it with --weight-decay 0.000035 --model-kwargs --drop-rates '{"12":0.05,"13":0.05} from the start, and when it plateaued, I resumed from epoch 300 with slightly more dropout (and less wd) --weight-decay 0.00003 --model-kwargs --drop-rates '{"11":0.05,"12":0.05,"13":0.05}' and achieved 73.38.

For m2 variant, I started with wd of 0.000035 for 200 epochs, then continued with --weight-decay 0.00003 --model-kwargs --drop-rates '{"12":0.05,"13":0.05}' until it stoped improving (at epoch 413 it achieved 73.678), then I resumed from epoch 330 with slightly more dropout rates and less wd just like the m1 variant : --weight-decay 0.00003 --model-kwargs --drop-rates '{"11":0.05,"12":0.05,"13":0.05}' this got me from 73.678 to 73.95, then I did this again(i.e. repeated the last run) without label smoothing and did an average on the best models and chose the one with the higher accuracy. this gave me 74.17

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_9m_m1 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.000035 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999 --model-kwargs  --drop-rates '{"12":0.05,"13":0.05} 

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_9m_m2 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.000035 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999

3m: For this variant, I used no dropout, simply trained normally until it plateaued, then, I resumed from checkpoint 332 (basically around the best checkpoint up to that point) and resume with no label smoothing and finally average the best models to get higher accuracy.

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_small_m1_075 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00003 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_small_m2_075 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00003 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999

1.5m: For this variant, I simply train normally with wd of 0.00001. the m2 variant achieves 61.386 normally. the m1 variant achieves 60.814 with no label smoothing.(with label smoothing it gets 60.58, so it could be margin of error as I only trained them once.) I then take the average of the best models for each.

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_small_m1_05 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00001 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_small_m2_05 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00001 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999

Coderx7 avatar Mar 17 '23 19:03 Coderx7

OK, I pushed the changes, hope I didnt do anything wrong.

Coderx7 avatar Mar 17 '23 20:03 Coderx7

@rwightman Hi, hope you are doing great, I finally finished training the new weights and I just updated the pr. would you please kindly tell me what you think. Thanks a lot in advance.

Coderx7 avatar Apr 14 '23 19:04 Coderx7

@rwightman its been a few months since my last changes, could you kindly please tell me if everything is OK or I'm missing sth here? Id really like to make this happen if you will of course. Thanks a lot in advance

Coderx7 avatar Jul 25 '23 06:07 Coderx7