pytorch-image-models add simplenet architecture

This pull request add SimpleNet architecture. Simplenetv1 is a 2016 architecture comprised of only the most basic operators that comprises a plain CNN network. It outperformed many deeper and more complex architectures such as VGGNet,ResNet,etc on several benchmark datasets. This is its results on ImageNet dataset.

-added simplenet.py to timm/models -added simplenet.md to docs/models -added an entry to docs/models.md

Here are some more information concerning how they perform taken from our official pytorch repository:

Model	#Params	ImageNet	ImageNet-Real-Labels
simplenetv1_9m_m2(36.3 MB)	9.5m	74.23 / 91.748	81.22 / 94.756
simplenetv1_5m_m2(22 MB)	5.7m	72.03 / 90.324	79.328/ 93.714
simplenetv1_small_m2_075(12.6 MB)	3m	68.506/ 88.15	76.283/ 92.02
simplenetv1_small_m2_05(5.78 MB)	1.5m	61.67 / 83.488	69.31 / 88.195

SimpleNet performs very decently, it outperforms VGGNet, variants of ResNet and MobileNets(1-3)
and its pretty fast as well! and its all using plain old CNN!.

Here's an example of benchmark run on small variants of simplenet and some other known architectures such as mobilenets.
Small variants of simplenet consistently achieve high performance/accuracy:

model	samples_per_sec	param_count	top1	top5
simplenetv1_small_m1_05	3100.26	1.51	61.122	82.988
mobilenetv3_small_050	3082.85	1.59	57.89	80.194
lcnet_050	2713.02	1.88	63.1	84.382
simplenetv1_small_m2_05	2536.16	1.51	61.67	83.488
mobilenetv3_small_075	1793.42	2.04	65.242	85.438
tf_mobilenetv3_small_075	1689.53	2.04	65.714	86.134
simplenetv1_small_m1_075	1626.87	3.29	67.784	87.718
tf_mobilenetv3_small_minimal_100	1316.91	2.04	62.908	84.234
simplenetv1_small_m2_075	1313.6	3.29	68.506	88.15
mobilenetv3_small_100	1261.09	2.54	67.656	87.634
tf_mobilenetv3_small_100	1213.03	2.54	67.924	87.664
mnasnet_small	1089.33	2.03	66.206	86.508
mobilenetv2_050	857.66	1.97	65.942	86.082
dla46_c	537.08	1.3	64.866	86.294
dla46x_c	323.03	1.07	65.97	86.98
dla60x_c	301.71	1.32	67.892	88.426

and this is a sample for larger models:

model	samples_per_sec	param_count	top1	top5
simplenetv1_small_m1_075	2893.91	3.29	67.784	87.718
simplenetv1_small_m2_075	2478.41	3.29	68.506	88.15
vit_tiny_r_s16_p8_224	2337.23	6.34	71.792	90.822
simplenetv1_5m_m1	2105.06	5.75	71.548	89.94
simplenetv1_5m_m2	1754.25	5.75	72.03	90.324
resnet18	1750.38	11.69	69.744	89.082
regnetx_006	1620.25	6.2	73.86	91.672
mobilenetv3_large_100	1491.86	5.48	75.766	92.544
tf_mobilenetv3_large_minimal_100	1476.29	3.92	72.25	90.63
tf_mobilenetv3_large_075	1474.77	3.99	73.436	91.344
ghostnet_100	1390.19	5.18	73.974	91.46
tinynet_b	1345.82	3.73	74.976	92.184
tf_mobilenetv3_large_100	1325.06	5.48	75.518	92.604
mnasnet_100	1183.69	4.38	74.658	92.112
mobilenetv2_100	1101.58	3.5	72.97	91.02
simplenetv1_9m_m1	1048.91	9.51	73.792	91.486
resnet34	1030.4	21.8	75.114	92.284
deit_tiny_patch16_224	990.85	5.72	72.172	91.114
efficientnet_lite0	977.76	4.65	75.476	92.512
simplenetv1_9m_m2	900.45	9.51	74.23	91.748
tf_efficientnet_lite0	876.66	4.65	74.832	92.174
dla34	834.35	15.74	74.62	92.072
mobilenetv2_110d	824.4	4.52	75.038	92.184
resnet26	771.1	16	75.3	92.578
repvgg_b0	751.01	15.82	75.16	92.418
crossvit_9_240	606.2	8.55	73.96	91.968
vgg11	576.32	132.86	69.028	88.626
vit_base_patch32_224_sam	561.99	88.22	73.694	91.01
vgg11_bn	504.29	132.87	70.36	89.802
densenet121	435.3	7.98	75.584	92.652
vgg13	363.69	133.05	69.926	89.246
vgg13_bn	315.85	133.05	71.594	90.376
vgg16	302.84	138.36	71.59	90.382
vgg16_bn	265.99	138.37	73.35	91.504
vgg19	259.82	143.67	72.366	90.87
vgg19_bn	229.77	143.68	74.214	91.848

Note:
These benchmarks are run on a PC with GTX1080, Pytorch 1.11, fp32 and nchw configuration.

I hope this is useful for the community.

Feb 16 '23 19:02 Coderx7

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Feb 16 '23 21:02 HuggingFaceDocBuilderDev

@Coderx7 thanks for the PR, looks like a decent lightweight model, but the big stack of layers in a sequential doesn't really line up with other timm models, makes it hard to support many default features like feature extraction at strided stage boundaries, layer grouping, block based grad checkpointing, etc....

Any chance you could organize the net into stem + stages[blocks[]] ?

Feb 17 '23 05:02 rwightman

@rwightman my pleasure. I tried to follow your vgg implementation and implement everything that was there.
I'm not familiar with the stem+stages. could you elaborate a bit more on this?

Feb 17 '23 06:02 Coderx7

@Coderx7 RexNet is probably the simplest example, ResNetV2 and RegNet are decent examples as well...

https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/rexnet.py
https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/resnetv2.py
https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/regnet.py

I also just refactored Levit to use stages (for feat extraction support), and it's similar to this net in that there aren't strided convs, but a 'downsample' layer that'd be at the start of strided stages.

https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/levit.py

Feb 17 '23 07:02 rwightman

So looking at the net layout, two possible structures stand out:

stem:
      (128, 1, 0.0),
stage[0]
      (192, 1, 0.0),
      (192, 1, 0.0),
      (192, 1, 0.0),
      (192, 1, 0.0),
      (192, 1, 0.0),
stage[1]
      ("p", 2, 0.0), 
      (320, 1, 0.0),
      (320, 1, 0.0),
      (320, 1, 0.0),
      (640, 1, 0.0),
stage[2]
      ("p", 2, 0.0),
      (2560, 1, 0.0, "k1"),
      (320, 1, 0.0, "k1"),
      (320, 1, 0.0),
head:

stem:
      (128, 1, 0.0),
stage[0]
      (192, 1, 0.0),
      (192, 1, 0.0),
      (192, 1, 0.0),
      (192, 1, 0.0),
      (192, 1, 0.0),
stage[1]
      ("p", 2, 0.0),
      (320, 1, 0.0),
      (320, 1, 0.0),
      (320, 1, 0.0),
stage[2]
      (640, 1, 0.0),
stage[3]
      ("p", 2, 0.0),
      (2560, 1, 0.0, "k1"),
stage[4]
      (320, 1, 0.0, "k1"),
      (320, 1, 0.0),
head:
``

Feb 17 '23 07:02 rwightman

@rwightman Thanks a lot for the examples. I guess I'll give resnext a try and hopefully get it refactored soon.

Feb 17 '23 08:02 Coderx7

@rwightman : I got a bit confused doing the refactoring, do you mind if I ask you questions while I try to refactor the architecture? for the start, should the model look like this? also how does timm handle the conversion of previous weights (model state_dict) to the new form?

SimpleNet(
  (stem): Sequential(
    (0): Sequential(
      (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (1): BatchNorm2d(64, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
      (2): ReLU(inplace=True)
      (3): Dropout2d(p=0.0, inplace=False)
    )
  )
  (features): Sequential(
    (stage_0): SimpleBlock(
      (block): Sequential(
        (ConvBlock_0): Sequential(
          (0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
        (ConvBlock_1): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
        (ConvBlock_2): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
        (ConvBlock_3): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
        (ConvBlock_4): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
      )
    )
    (stage_1): SimpleBlock(
      (block): Sequential(
        (maxpool_0): Sequential(
          (0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (1): Dropout2d(p=0.0, inplace=True)
        )
        (ConvBlock_1): Sequential(
          (0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
        (ConvBlock_2): Sequential(
          (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
        (ConvBlock_3): Sequential(
          (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
      )
    )
    (stage_2): SimpleBlock(
      (block): Sequential(
        (ConvBlock_0): Sequential(
          (0): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(512, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
      )
    )
    (stage_3): SimpleBlock(
      (block): Sequential(
        (maxpool_0): Sequential(
          (0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (1): Dropout2d(p=0.0, inplace=True)
        )
        (ConvBlock_1): Sequential(
          (0): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(2048, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
      )
    )
    (stage_4): SimpleBlock(
      (block): Sequential(
        (ConvBlock_0): Sequential(
          (0): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
        (ConvBlock_1): Sequential(
          (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (2): ReLU(inplace=True)
          (3): Dropout2d(p=0.0, inplace=False)
        )
      )
    )
  )
  (head): ClassifierHead(
    (global_pool): SelectAdaptivePool2d (pool_type=max, flatten=Flatten(start_dim=1, end_dim=-1))
    (fc): Linear(in_features=256, out_features=1000, bias=True)
    (flatten): Identity()
  )
)

Feb 17 '23 17:02 Coderx7

@Coderx7 structure looks nice

for conversion I usually write a fn called checkpoint_filter_fn

See:

https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientformer.py#L473
https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/levit.py#L696
https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/edgenext.py#L482
https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/convnext.py#L387

Mapping a purely line 0..num_model_layers to stages is going to be a bit of fun, probably need to use regex, finding a rule that you can increment stage_idx on (ie every time outdim changes). Last ditch is just to iterate both state dicts together like the levit example and assume they line up (they should), assert that the num elements matches...

Feb 17 '23 18:02 rwightman

that checkpoint filter should be passed to the builder ie https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/levit.py#L765

Feb 17 '23 18:02 rwightman

@rwightman I got the checkpoint working, however for some reason when I try the features_only argument during model creation, it crashes and complains the return layers are not present in model :

AssertionError: Return layers ({'features.stage_0.block.ConvBlock_0', 'features.stage_3.block.maxpool', 'features.stage_0.block.ConvBlock_2', 'features.stage_1.block.maxpool'}) are not present in model

what should I specify in module name in feature_info list, what is it looking for? if it helps this is how the model looks like :

SimpleNet(
  (stem): ConvBNReLU(
    (conv): Conv2d(3, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (bn): BatchNorm2d(64, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
    (dropout): Dropout2d(p=0.0, inplace=False)
    (relu): ReLU(inplace=True)
  )
  (features): Sequential(
    (stage_0): SimpleBlock(
      (block): Sequential(
        (ConvBlock_0): ConvBNReLU(
          (conv): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
          (bn): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
        (ConvBlock_1): ConvBNReLU(
          (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
        (ConvBlock_2): ConvBNReLU(
          (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
          (bn): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
        (ConvBlock_3): ConvBNReLU(
          (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
        (ConvBlock_4): ConvBNReLU(
          (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
      )
    )
    (stage_1): SimpleBlock(
      (block): Sequential(
        (maxpool): Sequential(
          (0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (1): Dropout2d(p=0.0, inplace=True)
        )
        (ConvBlock_0): ConvBNReLU(
          (conv): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
        (ConvBlock_1): ConvBNReLU(
          (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
        (ConvBlock_2): ConvBNReLU(
          (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
      )
    )
    (stage_2): SimpleBlock(
      (block): Sequential(
        (ConvBlock_0): ConvBNReLU(
          (conv): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(512, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
      )
    )
    (stage_3): SimpleBlock(
      (block): Sequential(
        (maxpool): Sequential(
          (0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
          (1): Dropout2d(p=0.0, inplace=True)
        )
        (ConvBlock_0): ConvBNReLU(
          (conv): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(2048, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
      )
    )
    (stage_4): SimpleBlock(
      (block): Sequential(
        (ConvBlock_0): ConvBNReLU(
          (conv): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
        (ConvBlock_1): ConvBNReLU(
          (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (bn): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
          (dropout): Dropout2d(p=0.0, inplace=False)
          (relu): ReLU(inplace=True)
        )
      )
    )
  )
  (head): ClassifierHead(
    (global_pool): SelectAdaptivePool2d (pool_type=max, flatten=Flatten(start_dim=1, end_dim=-1))
    (fc): Linear(in_features=256, out_features=1000, bias=True)
    (flatten): Identity()
  )
)

feature_info:

[{'num_chs': 64, 'reduction': 2, 'module': 'stem'},
 {'num_chs': 128,
  'reduction': 4,
  'module': 'features.stage_0.block.ConvBlock_0'},
 {'num_chs': 128,
  'reduction': 8,
  'module': 'features.stage_0.block.ConvBlock_2'},
 {'num_chs': 128, 'reduction': 16, 'module': 'features.stage_1.block.maxpool'},
 {'num_chs': 512, 'reduction': 32, 'module': 'features.stage_3.block.maxpool'}]

and this is the state_key.keys():

stem.conv.weight
stem.conv.bias
stem.bn.weight
stem.bn.bias
stem.bn.running_mean
stem.bn.running_var
stem.bn.num_batches_tracked
features.stage_0.block.ConvBlock_0.conv.weight
features.stage_0.block.ConvBlock_0.conv.bias
features.stage_0.block.ConvBlock_0.bn.weight
features.stage_0.block.ConvBlock_0.bn.bias
features.stage_0.block.ConvBlock_0.bn.running_mean
features.stage_0.block.ConvBlock_0.bn.running_var
features.stage_0.block.ConvBlock_0.bn.num_batches_tracked
features.stage_0.block.ConvBlock_1.conv.weight
features.stage_0.block.ConvBlock_1.conv.bias
features.stage_0.block.ConvBlock_1.bn.weight
features.stage_0.block.ConvBlock_1.bn.bias
features.stage_0.block.ConvBlock_1.bn.running_mean
features.stage_0.block.ConvBlock_1.bn.running_var
features.stage_0.block.ConvBlock_1.bn.num_batches_tracked
features.stage_0.block.ConvBlock_2.conv.weight
features.stage_0.block.ConvBlock_2.conv.bias
features.stage_0.block.ConvBlock_2.bn.weight
features.stage_0.block.ConvBlock_2.bn.bias
features.stage_0.block.ConvBlock_2.bn.running_mean
features.stage_0.block.ConvBlock_2.bn.running_var
features.stage_0.block.ConvBlock_2.bn.num_batches_tracked
features.stage_0.block.ConvBlock_3.conv.weight
features.stage_0.block.ConvBlock_3.conv.bias
features.stage_0.block.ConvBlock_3.bn.weight
features.stage_0.block.ConvBlock_3.bn.bias
features.stage_0.block.ConvBlock_3.bn.running_mean
features.stage_0.block.ConvBlock_3.bn.running_var
features.stage_0.block.ConvBlock_3.bn.num_batches_tracked
features.stage_0.block.ConvBlock_4.conv.weight
features.stage_0.block.ConvBlock_4.conv.bias
features.stage_0.block.ConvBlock_4.bn.weight
features.stage_0.block.ConvBlock_4.bn.bias
features.stage_0.block.ConvBlock_4.bn.running_mean
features.stage_0.block.ConvBlock_4.bn.running_var
features.stage_0.block.ConvBlock_4.bn.num_batches_tracked
features.stage_1.block.ConvBlock_0.conv.weight
features.stage_1.block.ConvBlock_0.conv.bias
features.stage_1.block.ConvBlock_0.bn.weight
features.stage_1.block.ConvBlock_0.bn.bias
features.stage_1.block.ConvBlock_0.bn.running_mean
features.stage_1.block.ConvBlock_0.bn.running_var
features.stage_1.block.ConvBlock_0.bn.num_batches_tracked
features.stage_1.block.ConvBlock_1.conv.weight
features.stage_1.block.ConvBlock_1.conv.bias
features.stage_1.block.ConvBlock_1.bn.weight
features.stage_1.block.ConvBlock_1.bn.bias
features.stage_1.block.ConvBlock_1.bn.running_mean
features.stage_1.block.ConvBlock_1.bn.running_var
features.stage_1.block.ConvBlock_1.bn.num_batches_tracked
features.stage_1.block.ConvBlock_2.conv.weight
features.stage_1.block.ConvBlock_2.conv.bias
features.stage_1.block.ConvBlock_2.bn.weight
features.stage_1.block.ConvBlock_2.bn.bias
features.stage_1.block.ConvBlock_2.bn.running_mean
features.stage_1.block.ConvBlock_2.bn.running_var
features.stage_1.block.ConvBlock_2.bn.num_batches_tracked
features.stage_2.block.ConvBlock_0.conv.weight
features.stage_2.block.ConvBlock_0.conv.bias
features.stage_2.block.ConvBlock_0.bn.weight
features.stage_2.block.ConvBlock_0.bn.bias
features.stage_2.block.ConvBlock_0.bn.running_mean
features.stage_2.block.ConvBlock_0.bn.running_var
features.stage_2.block.ConvBlock_0.bn.num_batches_tracked
features.stage_3.block.ConvBlock_0.conv.weight
features.stage_3.block.ConvBlock_0.conv.bias
features.stage_3.block.ConvBlock_0.bn.weight
features.stage_3.block.ConvBlock_0.bn.bias
features.stage_3.block.ConvBlock_0.bn.running_mean
features.stage_3.block.ConvBlock_0.bn.running_var
features.stage_3.block.ConvBlock_0.bn.num_batches_tracked
features.stage_4.block.ConvBlock_0.conv.weight
features.stage_4.block.ConvBlock_0.conv.bias
features.stage_4.block.ConvBlock_0.bn.weight
features.stage_4.block.ConvBlock_0.bn.bias
features.stage_4.block.ConvBlock_0.bn.running_mean
features.stage_4.block.ConvBlock_0.bn.running_var
features.stage_4.block.ConvBlock_0.bn.num_batches_tracked
features.stage_4.block.ConvBlock_1.conv.weight
features.stage_4.block.ConvBlock_1.conv.bias
features.stage_4.block.ConvBlock_1.bn.weight
features.stage_4.block.ConvBlock_1.bn.bias
features.stage_4.block.ConvBlock_1.bn.running_mean
features.stage_4.block.ConvBlock_1.bn.running_var
features.stage_4.block.ConvBlock_1.bn.num_batches_tracked
head.fc.weight
head.fc.bias

Feb 18 '23 17:02 Coderx7

@Coderx7 feature info should be filled with the module name of the 'deepest' layer for a given stride, so usually the nn.Module before a dowsample layer. In this case, you'd want stem, features.stage_0, features.stage_2, features.stage_4 ...aaand I just noticed there is a stride 2 on ConvBlock_2 of stage_0, if that's supposed to be there, that should split into a diff stage (stages deliminted by stride layers and in many cases, shifts in width)

Feb 19 '23 06:02 rwightman

@rwightman Thanks. but there are two things here, first I believe I did just that but still got that same error anyway! I'll give that another try ans see how it goes. and 2. concerning the stages, this architecture uses dynamic strides for any layers basically, but especially the first 4. (I can remove it and make it static as there are only two pretrained variants with two stride modes!) the two trained variants use mode 1 and mode 2 strides which basically downsamples the early layers at specific rate so during imagenet training you can have some kind of leverage on performance/accuracy ratio at its simplest form.
like here, it uses strides of 2,2,1,2 and another variant uses 2,2,2 and the rest are 1s.
if I create stages based on the downsampling of features, stem, layer1,layer3 all should be in unique stages right? like stage1 to stage 2(excluding stem)?

Feb 19 '23 07:02 Coderx7

In the model create helper you should enable the flatten_sequential and ensure the default # of out indices matches the net

    out_indices = kwargs.pop('out_indices', (0, 1, 2, 3))
    model = build_model_with_cfg(
        EfficientFormerV2, variant, pretrained,
        feature_cfg=dict(flatten_sequential=True, out_indices=out_indices),
        **kwargs)

Most models have some sort of pattern and systematic spacing between the strided layers so figured that'd be the same here for the configs, I realize they could be put anywhere but doesn't seem that useful to have no depth between strides.

The concept of the stage is essentially encapsulate the layers at the same stride, and sometimes there are stages w/o any stride but a different width, conv type (depthwise vs not), or other trait in common with all layer repeats in the stage.

Feb 19 '23 07:02 rwightman

@rwightman Thanks a lot. thats a fair point, however this was never meant to scale that way. it was designed with something completely different in mind. it was meant to show how one could maximize a networks performance under constraint (fixed_param count, depth and basic operators) while keeping everything simple and not resorting to any complex strategies.

having that said, thankfully I seem to pretty much have done everything and the only thing that seems to still be an issue is that the last stage has a bigger featuremap size(thus smaller stride) than its previous counter part. it seems timm has issues with it. currently this is how my feature_info looks like:

[{'num_chs': 64, 'reduction': 2, 'module': 'stem'},
 {'num_chs': 128, 'reduction': 4, 'module': 'features.stage_0'},
 {'num_chs': 128, 'reduction': 8, 'module': 'features.stage_1'},
 {'num_chs': 512, 'reduction': 16, 'module': 'features.stage_2'},
 {'num_chs': 2048, 'reduction': 24, 'module': 'features.stage_3'},
 {'num_chs': 256, 'reduction': 20, 'module': 'features.stage_4'}]

How should I be handling this other than merging the last two stages? thanks a lot in advance

Feb 19 '23 12:02 Coderx7

@rwightman would you kindly have a look here and tell me what to do for the last part? thanks

Feb 21 '23 05:02 Coderx7

@Coderx7 reduction is spatial reduction (from the input image size), it's only complained about if it decreases, it's not used directly by timm but some downstream users want to know that for calculating interpolation ratios.

If you look at the rexnet example it should *=2 every time there is a strided layer, majority of imagnet networks ar stride 32, num chs does not have any restrictions for increasing/decreasing although

Feb 21 '23 06:02 rwightman

@rwightman I thought the idea was to provide featuremaps of different sizes for downstream usage not capturing only the strides of 2 per say.
currently if the assert in

assert 'reduction' in fi and fi['reduction'] >= prev_reduction

is not disabled this wont work. so I need to do one of the following :

to have 4 stages and only have reduction rates for 3 stages (that is don't include the last reduction rate for the last stage in feature_info
to have 3 stages and merge the last 2 stages (3 and 4 and only have 3 stages in total with 3 reduction rates for each
the featureinfo class is altered to have a new argument which allows cases like this,

the issue with the first option is users will lose the last two layers of the network if they opt out to use features_only, but other than than normal usage stays the same.
the issue with the second option is, users cant fully experiment with the stage 4, so they have to manually do this which nullifies the purpose of features_only I guess.
the last option seems like a good idea to me as with a default value that works for all current model, the current behavior is maintained, but it allows for cases such as this to also be usable. unless that check has more significance and affects lots of other parts of the library which I'm not aware of yet.

so which option should I take and hopefully finish this up? Thanks a lot in advance.

Feb 21 '23 08:02 Coderx7

@rwightman I'd really appreciate if you could kindly have a look and decide on the next step so I can finalize the changes accordingly and have it finished.

Feb 23 '23 05:02 Coderx7

@Coderx7 sorry I have a lot on my plate right now, wrapping a up a few things before I'm on vacation for a bit. I'm going to have to leave this one hanging for a bit as I don't think we're on the same page.

The net is simple as per its name and I didn't see any merging, or upscaling or anything that could result in a feature map increasing in size, it's reducing by 2 at each downscale. I feel we're lost in semantics.

Feb 23 '23 06:02 rwightman

@rwightman out of the last three conv layers, two (2048 and 256) have kernel size=1 and they use padding of 1, that causes the featuremap size to increase from previously 7x7 (after the down sampling) to 9x9(after conv 1x1), the next conv1x1 layer increases that to 11x11, thus causing the effective reduction to vary that way. Ok no problem, please take your time and let's continue this when you are free. I really do appreciate you taking the time despite all of your busy schedule.

Feb 23 '23 07:02 Coderx7

@rwightman May I ask if your vacation is over and if we can hopefully get this last step worked out?

Mar 13 '23 19:03 Coderx7

@Coderx7 been trying to get on top of my own tasks since getting back. I looked at this a bit more, not really liking the padding issue that is the reason for the expanding dim... having a padding of 1 for a 1x1 conv makes zero sense to me. It's adding data to the signal path that's not meaningful. So, I'm hesitant to add alltogether with quirks like that present...

Mar 16 '23 20:03 rwightman

@rwightman Thanks, I really appreciate it knowing how busy your schedule is. its not really any different than using (zero)-padding on the input. This happened by accident, but after I noticed it, in a few experiments that I did afterward, I noticed they perform better than the no padding versions, looked to me as if it creates a kind of regularization effect. I can run more experiments to further validate this point (or lack thereof, if that happens to be the case ultimately), if that's your concern. my main concern is that, it takes a lot of time to train these models again, (it took me several months to train these models as I don't have access to anything powerful, just a single gpu). but I try my best to see how I can address your concerns.

Mar 17 '23 08:03 Coderx7

@Coderx7 in deep learning it would seem almost any extra activations (or parameters) can/will be used to improve the loss in optimization, but I'd argue not particularly useful ones (and possibly harmful for segmentation/obj detection as they'd add a 'border' effect at the feature level). They get blended back into the signal via the subsequent 3x3 conv. I did test these and per the goal of running faster, the extra padding does have a measurable speed impact (not significant but there).

The rest of the net is fine, simple as per the name which isn't bad to have in timm as they can be the best option for some tasks. If the padding issue fixed (padding == kernel_size//2 should do fine for this net) and retrained I'd definitely include with the tweaks mentioned.

Do you have hparams for these? I have two idle 2x Titan RTX machines right now, I could put them to work if you push any outstanding changes re arch to this PR.

Mar 17 '23 16:03 rwightman

@rwightman That would be great, thanks a lot :) I was actually planning to test them on detection/segmentation as well, to see how that affects the result in different scenarios. I'm glad you kindly shared your experience with me. I really really appreciate it. :)

OK I try to push all the changes so far, and update the changes. the model in refactored in stages, and use no padding for the 1x1convs . everything should hopefully work. concerning the hparams, I didnt have much to test, I went with nearly the same exact hparams for all architectures which I say in a moment.

They all use the same settings and only differ in weight decay and dropouts.

I start off with training all models without any dropouts, and a side from keeping top 10 checkpoints, save every 50 checkpoints (i.e. like 100th, 150th, 200th,... checkpoints) so that later on I use these checkpoints to resume the model with dropout so I get faster and better results. Then at the end, I take the average of the best checkpoints.
if I don't get the accuracy I like, I try resuming with no label smoothing and finally average the best of these with to get final results. thats all. The basic command arguments look like this, and only the model name , weightdecay and dropout changes.

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_5m_m1 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00003 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999

For example for the 5m variants, this is how I start the training: 5m:

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_5m_m1 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00003 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_5m_m2 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00003 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999

when model stops improving, I resume from checkpoing 250 and this time use dropout rates like this : --weight-decay 0.00002 --model-kwargs drop_rates='{"11":0.02,"12":0.05,"13":0.05}' this is the same for both variants.

9m: For m1 variant, I trained it with --weight-decay 0.000035 --model-kwargs --drop-rates '{"12":0.05,"13":0.05} from the start, and when it plateaued, I resumed from epoch 300 with slightly more dropout (and less wd) --weight-decay 0.00003 --model-kwargs --drop-rates '{"11":0.05,"12":0.05,"13":0.05}' and achieved 73.38.

For m2 variant, I started with wd of 0.000035 for 200 epochs, then continued with --weight-decay 0.00003 --model-kwargs --drop-rates '{"12":0.05,"13":0.05}' until it stoped improving (at epoch 413 it achieved 73.678), then I resumed from epoch 330 with slightly more dropout rates and less wd just like the m1 variant : --weight-decay 0.00003 --model-kwargs --drop-rates '{"11":0.05,"12":0.05,"13":0.05}' this got me from 73.678 to 73.95, then I did this again(i.e. repeated the last run) without label smoothing and did an average on the best models and chose the one with the higher accuracy. this gave me 74.17

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_9m_m1 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.000035 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999 --model-kwargs  --drop-rates '{"12":0.05,"13":0.05} 

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_9m_m2 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.000035 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999

3m: For this variant, I used no dropout, simply trained normally until it plateaued, then, I resumed from checkpoint 332 (basically around the best checkpoint up to that point) and resume with no label smoothing and finally average the best models to get higher accuracy.

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_small_m1_075 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00003 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_small_m2_075 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00003 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999

1.5m: For this variant, I simply train normally with wd of 0.00001. the m2 variant achieves 61.386 normally. the m1 variant achieves 60.814 with no label smoothing.(with label smoothing it gets 60.58, so it could be margin of error as I only trained them once.) I then take the average of the best models for each.

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_small_m1_05 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00001 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999

./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_small_m2_05 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00001 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999

Mar 17 '23 19:03 Coderx7

OK, I pushed the changes, hope I didnt do anything wrong.

Mar 17 '23 20:03 Coderx7

@rwightman Hi, hope you are doing great, I finally finished training the new weights and I just updated the pr. would you please kindly tell me what you think. Thanks a lot in advance.

Apr 14 '23 19:04 Coderx7

@rwightman its been a few months since my last changes, could you kindly please tell me if everything is OK or I'm missing sth here? Id really like to make this happen if you will of course. Thanks a lot in advance

Jul 25 '23 06:07 Coderx7

pytorch-image-models pytorch-image-models copied to clipboard

add simplenet architecture

pytorch-image-models
pytorch-image-models copied to clipboard