pytorch-image-models
pytorch-image-models copied to clipboard
add simplenet architecture
This pull request add SimpleNet architecture. Simplenetv1 is a 2016 architecture comprised of only the most basic operators that comprises a plain CNN network. It outperformed many deeper and more complex architectures such as VGGNet,ResNet,etc on several benchmark datasets. This is its results on ImageNet dataset.
-added simplenet.py to timm/models -added simplenet.md to docs/models -added an entry to docs/models.md
Here are some more information concerning how they perform taken from our official pytorch repository:
| Model | #Params | ImageNet | ImageNet-Real-Labels |
|---|---|---|---|
| simplenetv1_9m_m2(36.3 MB) | 9.5m | 74.23 / 91.748 | 81.22 / 94.756 |
| simplenetv1_5m_m2(22 MB) | 5.7m | 72.03 / 90.324 | 79.328/ 93.714 |
| simplenetv1_small_m2_075(12.6 MB) | 3m | 68.506/ 88.15 | 76.283/ 92.02 |
| simplenetv1_small_m2_05(5.78 MB) | 1.5m | 61.67 / 83.488 | 69.31 / 88.195 |
SimpleNet performs very decently, it outperforms VGGNet, variants of ResNet and MobileNets(1-3)
and its pretty fast as well! and its all using plain old CNN!.
Here's an example of benchmark run on small variants of simplenet and some other known architectures such as mobilenets.
Small variants of simplenet consistently achieve high performance/accuracy:
| model | samples_per_sec | param_count | top1 | top5 |
|---|---|---|---|---|
| simplenetv1_small_m1_05 | 3100.26 | 1.51 | 61.122 | 82.988 |
| mobilenetv3_small_050 | 3082.85 | 1.59 | 57.89 | 80.194 |
| lcnet_050 | 2713.02 | 1.88 | 63.1 | 84.382 |
| simplenetv1_small_m2_05 | 2536.16 | 1.51 | 61.67 | 83.488 |
| mobilenetv3_small_075 | 1793.42 | 2.04 | 65.242 | 85.438 |
| tf_mobilenetv3_small_075 | 1689.53 | 2.04 | 65.714 | 86.134 |
| simplenetv1_small_m1_075 | 1626.87 | 3.29 | 67.784 | 87.718 |
| tf_mobilenetv3_small_minimal_100 | 1316.91 | 2.04 | 62.908 | 84.234 |
| simplenetv1_small_m2_075 | 1313.6 | 3.29 | 68.506 | 88.15 |
| mobilenetv3_small_100 | 1261.09 | 2.54 | 67.656 | 87.634 |
| tf_mobilenetv3_small_100 | 1213.03 | 2.54 | 67.924 | 87.664 |
| mnasnet_small | 1089.33 | 2.03 | 66.206 | 86.508 |
| mobilenetv2_050 | 857.66 | 1.97 | 65.942 | 86.082 |
| dla46_c | 537.08 | 1.3 | 64.866 | 86.294 |
| dla46x_c | 323.03 | 1.07 | 65.97 | 86.98 |
| dla60x_c | 301.71 | 1.32 | 67.892 | 88.426 |
and this is a sample for larger models:
| model | samples_per_sec | param_count | top1 | top5 |
|---|---|---|---|---|
| simplenetv1_small_m1_075 | 2893.91 | 3.29 | 67.784 | 87.718 |
| simplenetv1_small_m2_075 | 2478.41 | 3.29 | 68.506 | 88.15 |
| vit_tiny_r_s16_p8_224 | 2337.23 | 6.34 | 71.792 | 90.822 |
| simplenetv1_5m_m1 | 2105.06 | 5.75 | 71.548 | 89.94 |
| simplenetv1_5m_m2 | 1754.25 | 5.75 | 72.03 | 90.324 |
| resnet18 | 1750.38 | 11.69 | 69.744 | 89.082 |
| regnetx_006 | 1620.25 | 6.2 | 73.86 | 91.672 |
| mobilenetv3_large_100 | 1491.86 | 5.48 | 75.766 | 92.544 |
| tf_mobilenetv3_large_minimal_100 | 1476.29 | 3.92 | 72.25 | 90.63 |
| tf_mobilenetv3_large_075 | 1474.77 | 3.99 | 73.436 | 91.344 |
| ghostnet_100 | 1390.19 | 5.18 | 73.974 | 91.46 |
| tinynet_b | 1345.82 | 3.73 | 74.976 | 92.184 |
| tf_mobilenetv3_large_100 | 1325.06 | 5.48 | 75.518 | 92.604 |
| mnasnet_100 | 1183.69 | 4.38 | 74.658 | 92.112 |
| mobilenetv2_100 | 1101.58 | 3.5 | 72.97 | 91.02 |
| simplenetv1_9m_m1 | 1048.91 | 9.51 | 73.792 | 91.486 |
| resnet34 | 1030.4 | 21.8 | 75.114 | 92.284 |
| deit_tiny_patch16_224 | 990.85 | 5.72 | 72.172 | 91.114 |
| efficientnet_lite0 | 977.76 | 4.65 | 75.476 | 92.512 |
| simplenetv1_9m_m2 | 900.45 | 9.51 | 74.23 | 91.748 |
| tf_efficientnet_lite0 | 876.66 | 4.65 | 74.832 | 92.174 |
| dla34 | 834.35 | 15.74 | 74.62 | 92.072 |
| mobilenetv2_110d | 824.4 | 4.52 | 75.038 | 92.184 |
| resnet26 | 771.1 | 16 | 75.3 | 92.578 |
| repvgg_b0 | 751.01 | 15.82 | 75.16 | 92.418 |
| crossvit_9_240 | 606.2 | 8.55 | 73.96 | 91.968 |
| vgg11 | 576.32 | 132.86 | 69.028 | 88.626 |
| vit_base_patch32_224_sam | 561.99 | 88.22 | 73.694 | 91.01 |
| vgg11_bn | 504.29 | 132.87 | 70.36 | 89.802 |
| densenet121 | 435.3 | 7.98 | 75.584 | 92.652 |
| vgg13 | 363.69 | 133.05 | 69.926 | 89.246 |
| vgg13_bn | 315.85 | 133.05 | 71.594 | 90.376 |
| vgg16 | 302.84 | 138.36 | 71.59 | 90.382 |
| vgg16_bn | 265.99 | 138.37 | 73.35 | 91.504 |
| vgg19 | 259.82 | 143.67 | 72.366 | 90.87 |
| vgg19_bn | 229.77 | 143.68 | 74.214 | 91.848 |
Note:
These benchmarks are run on a PC with GTX1080, Pytorch 1.11, fp32 and nchw configuration.
I hope this is useful for the community.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.
@Coderx7 thanks for the PR, looks like a decent lightweight model, but the big stack of layers in a sequential doesn't really line up with other timm models, makes it hard to support many default features like feature extraction at strided stage boundaries, layer grouping, block based grad checkpointing, etc....
Any chance you could organize the net into stem + stages[blocks[]] ?
@rwightman my pleasure. I tried to follow your vgg implementation and implement everything that was there.
I'm not familiar with the stem+stages. could you elaborate a bit more on this?
@Coderx7 RexNet is probably the simplest example, ResNetV2 and RegNet are decent examples as well...
- https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/rexnet.py
- https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/resnetv2.py
- https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/regnet.py
I also just refactored Levit to use stages (for feat extraction support), and it's similar to this net in that there aren't strided convs, but a 'downsample' layer that'd be at the start of strided stages.
- https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/levit.py
So looking at the net layout, two possible structures stand out:
stem:
(128, 1, 0.0),
stage[0]
(192, 1, 0.0),
(192, 1, 0.0),
(192, 1, 0.0),
(192, 1, 0.0),
(192, 1, 0.0),
stage[1]
("p", 2, 0.0),
(320, 1, 0.0),
(320, 1, 0.0),
(320, 1, 0.0),
(640, 1, 0.0),
stage[2]
("p", 2, 0.0),
(2560, 1, 0.0, "k1"),
(320, 1, 0.0, "k1"),
(320, 1, 0.0),
head:
stem:
(128, 1, 0.0),
stage[0]
(192, 1, 0.0),
(192, 1, 0.0),
(192, 1, 0.0),
(192, 1, 0.0),
(192, 1, 0.0),
stage[1]
("p", 2, 0.0),
(320, 1, 0.0),
(320, 1, 0.0),
(320, 1, 0.0),
stage[2]
(640, 1, 0.0),
stage[3]
("p", 2, 0.0),
(2560, 1, 0.0, "k1"),
stage[4]
(320, 1, 0.0, "k1"),
(320, 1, 0.0),
head:
``
@rwightman Thanks a lot for the examples. I guess I'll give resnext a try and hopefully get it refactored soon.
@rwightman : I got a bit confused doing the refactoring, do you mind if I ask you questions while I try to refactor the architecture? for the start, should the model look like this? also how does timm handle the conversion of previous weights (model state_dict) to the new form?
SimpleNet(
(stem): Sequential(
(0): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(1): BatchNorm2d(64, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Dropout2d(p=0.0, inplace=False)
)
)
(features): Sequential(
(stage_0): SimpleBlock(
(block): Sequential(
(ConvBlock_0): Sequential(
(0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Dropout2d(p=0.0, inplace=False)
)
(ConvBlock_1): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Dropout2d(p=0.0, inplace=False)
)
(ConvBlock_2): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Dropout2d(p=0.0, inplace=False)
)
(ConvBlock_3): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Dropout2d(p=0.0, inplace=False)
)
(ConvBlock_4): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Dropout2d(p=0.0, inplace=False)
)
)
)
(stage_1): SimpleBlock(
(block): Sequential(
(maxpool_0): Sequential(
(0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(1): Dropout2d(p=0.0, inplace=True)
)
(ConvBlock_1): Sequential(
(0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Dropout2d(p=0.0, inplace=False)
)
(ConvBlock_2): Sequential(
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Dropout2d(p=0.0, inplace=False)
)
(ConvBlock_3): Sequential(
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Dropout2d(p=0.0, inplace=False)
)
)
)
(stage_2): SimpleBlock(
(block): Sequential(
(ConvBlock_0): Sequential(
(0): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(512, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Dropout2d(p=0.0, inplace=False)
)
)
)
(stage_3): SimpleBlock(
(block): Sequential(
(maxpool_0): Sequential(
(0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(1): Dropout2d(p=0.0, inplace=True)
)
(ConvBlock_1): Sequential(
(0): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(2048, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Dropout2d(p=0.0, inplace=False)
)
)
)
(stage_4): SimpleBlock(
(block): Sequential(
(ConvBlock_0): Sequential(
(0): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Dropout2d(p=0.0, inplace=False)
)
(ConvBlock_1): Sequential(
(0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Dropout2d(p=0.0, inplace=False)
)
)
)
)
(head): ClassifierHead(
(global_pool): SelectAdaptivePool2d (pool_type=max, flatten=Flatten(start_dim=1, end_dim=-1))
(fc): Linear(in_features=256, out_features=1000, bias=True)
(flatten): Identity()
)
)
@Coderx7 structure looks nice
for conversion I usually write a fn called checkpoint_filter_fn
See:
- https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/efficientformer.py#L473
- https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/levit.py#L696
- https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/edgenext.py#L482
- https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/convnext.py#L387
Mapping a purely line 0..num_model_layers to stages is going to be a bit of fun, probably need to use regex, finding a rule that you can increment stage_idx on (ie every time outdim changes). Last ditch is just to iterate both state dicts together like the levit example and assume they line up (they should), assert that the num elements matches...
that checkpoint filter should be passed to the builder ie https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/levit.py#L765
@rwightman I got the checkpoint working, however for some reason when I try the features_only argument during model creation, it crashes and complains the return layers are not present in model :
AssertionError: Return layers ({'features.stage_0.block.ConvBlock_0', 'features.stage_3.block.maxpool', 'features.stage_0.block.ConvBlock_2', 'features.stage_1.block.maxpool'}) are not present in model
what should I specify in module name in feature_info list, what is it looking for? if it helps this is how the model looks like :
SimpleNet(
(stem): ConvBNReLU(
(conv): Conv2d(3, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(bn): BatchNorm2d(64, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(dropout): Dropout2d(p=0.0, inplace=False)
(relu): ReLU(inplace=True)
)
(features): Sequential(
(stage_0): SimpleBlock(
(block): Sequential(
(ConvBlock_0): ConvBNReLU(
(conv): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(bn): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(dropout): Dropout2d(p=0.0, inplace=False)
(relu): ReLU(inplace=True)
)
(ConvBlock_1): ConvBNReLU(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(dropout): Dropout2d(p=0.0, inplace=False)
(relu): ReLU(inplace=True)
)
(ConvBlock_2): ConvBNReLU(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(bn): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(dropout): Dropout2d(p=0.0, inplace=False)
(relu): ReLU(inplace=True)
)
(ConvBlock_3): ConvBNReLU(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(dropout): Dropout2d(p=0.0, inplace=False)
(relu): ReLU(inplace=True)
)
(ConvBlock_4): ConvBNReLU(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn): BatchNorm2d(128, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(dropout): Dropout2d(p=0.0, inplace=False)
(relu): ReLU(inplace=True)
)
)
)
(stage_1): SimpleBlock(
(block): Sequential(
(maxpool): Sequential(
(0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(1): Dropout2d(p=0.0, inplace=True)
)
(ConvBlock_0): ConvBNReLU(
(conv): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(dropout): Dropout2d(p=0.0, inplace=False)
(relu): ReLU(inplace=True)
)
(ConvBlock_1): ConvBNReLU(
(conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(dropout): Dropout2d(p=0.0, inplace=False)
(relu): ReLU(inplace=True)
)
(ConvBlock_2): ConvBNReLU(
(conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(dropout): Dropout2d(p=0.0, inplace=False)
(relu): ReLU(inplace=True)
)
)
)
(stage_2): SimpleBlock(
(block): Sequential(
(ConvBlock_0): ConvBNReLU(
(conv): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn): BatchNorm2d(512, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(dropout): Dropout2d(p=0.0, inplace=False)
(relu): ReLU(inplace=True)
)
)
)
(stage_3): SimpleBlock(
(block): Sequential(
(maxpool): Sequential(
(0): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(1): Dropout2d(p=0.0, inplace=True)
)
(ConvBlock_0): ConvBNReLU(
(conv): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), padding=(1, 1))
(bn): BatchNorm2d(2048, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(dropout): Dropout2d(p=0.0, inplace=False)
(relu): ReLU(inplace=True)
)
)
)
(stage_4): SimpleBlock(
(block): Sequential(
(ConvBlock_0): ConvBNReLU(
(conv): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1), padding=(1, 1))
(bn): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(dropout): Dropout2d(p=0.0, inplace=False)
(relu): ReLU(inplace=True)
)
(ConvBlock_1): ConvBNReLU(
(conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn): BatchNorm2d(256, eps=1e-05, momentum=0.05, affine=True, track_running_stats=True)
(dropout): Dropout2d(p=0.0, inplace=False)
(relu): ReLU(inplace=True)
)
)
)
)
(head): ClassifierHead(
(global_pool): SelectAdaptivePool2d (pool_type=max, flatten=Flatten(start_dim=1, end_dim=-1))
(fc): Linear(in_features=256, out_features=1000, bias=True)
(flatten): Identity()
)
)
feature_info:
[{'num_chs': 64, 'reduction': 2, 'module': 'stem'},
{'num_chs': 128,
'reduction': 4,
'module': 'features.stage_0.block.ConvBlock_0'},
{'num_chs': 128,
'reduction': 8,
'module': 'features.stage_0.block.ConvBlock_2'},
{'num_chs': 128, 'reduction': 16, 'module': 'features.stage_1.block.maxpool'},
{'num_chs': 512, 'reduction': 32, 'module': 'features.stage_3.block.maxpool'}]
and this is the state_key.keys():
stem.conv.weight
stem.conv.bias
stem.bn.weight
stem.bn.bias
stem.bn.running_mean
stem.bn.running_var
stem.bn.num_batches_tracked
features.stage_0.block.ConvBlock_0.conv.weight
features.stage_0.block.ConvBlock_0.conv.bias
features.stage_0.block.ConvBlock_0.bn.weight
features.stage_0.block.ConvBlock_0.bn.bias
features.stage_0.block.ConvBlock_0.bn.running_mean
features.stage_0.block.ConvBlock_0.bn.running_var
features.stage_0.block.ConvBlock_0.bn.num_batches_tracked
features.stage_0.block.ConvBlock_1.conv.weight
features.stage_0.block.ConvBlock_1.conv.bias
features.stage_0.block.ConvBlock_1.bn.weight
features.stage_0.block.ConvBlock_1.bn.bias
features.stage_0.block.ConvBlock_1.bn.running_mean
features.stage_0.block.ConvBlock_1.bn.running_var
features.stage_0.block.ConvBlock_1.bn.num_batches_tracked
features.stage_0.block.ConvBlock_2.conv.weight
features.stage_0.block.ConvBlock_2.conv.bias
features.stage_0.block.ConvBlock_2.bn.weight
features.stage_0.block.ConvBlock_2.bn.bias
features.stage_0.block.ConvBlock_2.bn.running_mean
features.stage_0.block.ConvBlock_2.bn.running_var
features.stage_0.block.ConvBlock_2.bn.num_batches_tracked
features.stage_0.block.ConvBlock_3.conv.weight
features.stage_0.block.ConvBlock_3.conv.bias
features.stage_0.block.ConvBlock_3.bn.weight
features.stage_0.block.ConvBlock_3.bn.bias
features.stage_0.block.ConvBlock_3.bn.running_mean
features.stage_0.block.ConvBlock_3.bn.running_var
features.stage_0.block.ConvBlock_3.bn.num_batches_tracked
features.stage_0.block.ConvBlock_4.conv.weight
features.stage_0.block.ConvBlock_4.conv.bias
features.stage_0.block.ConvBlock_4.bn.weight
features.stage_0.block.ConvBlock_4.bn.bias
features.stage_0.block.ConvBlock_4.bn.running_mean
features.stage_0.block.ConvBlock_4.bn.running_var
features.stage_0.block.ConvBlock_4.bn.num_batches_tracked
features.stage_1.block.ConvBlock_0.conv.weight
features.stage_1.block.ConvBlock_0.conv.bias
features.stage_1.block.ConvBlock_0.bn.weight
features.stage_1.block.ConvBlock_0.bn.bias
features.stage_1.block.ConvBlock_0.bn.running_mean
features.stage_1.block.ConvBlock_0.bn.running_var
features.stage_1.block.ConvBlock_0.bn.num_batches_tracked
features.stage_1.block.ConvBlock_1.conv.weight
features.stage_1.block.ConvBlock_1.conv.bias
features.stage_1.block.ConvBlock_1.bn.weight
features.stage_1.block.ConvBlock_1.bn.bias
features.stage_1.block.ConvBlock_1.bn.running_mean
features.stage_1.block.ConvBlock_1.bn.running_var
features.stage_1.block.ConvBlock_1.bn.num_batches_tracked
features.stage_1.block.ConvBlock_2.conv.weight
features.stage_1.block.ConvBlock_2.conv.bias
features.stage_1.block.ConvBlock_2.bn.weight
features.stage_1.block.ConvBlock_2.bn.bias
features.stage_1.block.ConvBlock_2.bn.running_mean
features.stage_1.block.ConvBlock_2.bn.running_var
features.stage_1.block.ConvBlock_2.bn.num_batches_tracked
features.stage_2.block.ConvBlock_0.conv.weight
features.stage_2.block.ConvBlock_0.conv.bias
features.stage_2.block.ConvBlock_0.bn.weight
features.stage_2.block.ConvBlock_0.bn.bias
features.stage_2.block.ConvBlock_0.bn.running_mean
features.stage_2.block.ConvBlock_0.bn.running_var
features.stage_2.block.ConvBlock_0.bn.num_batches_tracked
features.stage_3.block.ConvBlock_0.conv.weight
features.stage_3.block.ConvBlock_0.conv.bias
features.stage_3.block.ConvBlock_0.bn.weight
features.stage_3.block.ConvBlock_0.bn.bias
features.stage_3.block.ConvBlock_0.bn.running_mean
features.stage_3.block.ConvBlock_0.bn.running_var
features.stage_3.block.ConvBlock_0.bn.num_batches_tracked
features.stage_4.block.ConvBlock_0.conv.weight
features.stage_4.block.ConvBlock_0.conv.bias
features.stage_4.block.ConvBlock_0.bn.weight
features.stage_4.block.ConvBlock_0.bn.bias
features.stage_4.block.ConvBlock_0.bn.running_mean
features.stage_4.block.ConvBlock_0.bn.running_var
features.stage_4.block.ConvBlock_0.bn.num_batches_tracked
features.stage_4.block.ConvBlock_1.conv.weight
features.stage_4.block.ConvBlock_1.conv.bias
features.stage_4.block.ConvBlock_1.bn.weight
features.stage_4.block.ConvBlock_1.bn.bias
features.stage_4.block.ConvBlock_1.bn.running_mean
features.stage_4.block.ConvBlock_1.bn.running_var
features.stage_4.block.ConvBlock_1.bn.num_batches_tracked
head.fc.weight
head.fc.bias
@Coderx7 feature info should be filled with the module name of the 'deepest' layer for a given stride, so usually the nn.Module before a dowsample layer. In this case, you'd want stem, features.stage_0, features.stage_2, features.stage_4 ...aaand I just noticed there is a stride 2 on ConvBlock_2 of stage_0, if that's supposed to be there, that should split into a diff stage (stages deliminted by stride layers and in many cases, shifts in width)
@rwightman Thanks. but there are two things here, first I believe I did just that but still got that same error anyway! I'll give that another try ans see how it goes.
and 2. concerning the stages, this architecture uses dynamic strides for any layers basically, but especially the first 4. (I can remove it and make it static as there are only two pretrained variants with two stride modes!)
the two trained variants use mode 1 and mode 2 strides which basically downsamples the early layers at specific rate so during imagenet training you can have some kind of leverage on performance/accuracy ratio at its simplest form.
like here, it uses strides of 2,2,1,2 and another variant uses 2,2,2 and the rest are 1s.
if I create stages based on the downsampling of features, stem, layer1,layer3 all should be in unique stages right? like stage1 to stage 2(excluding stem)?
In the model create helper you should enable the flatten_sequential and ensure the default # of out indices matches the net
out_indices = kwargs.pop('out_indices', (0, 1, 2, 3))
model = build_model_with_cfg(
EfficientFormerV2, variant, pretrained,
feature_cfg=dict(flatten_sequential=True, out_indices=out_indices),
**kwargs)
Most models have some sort of pattern and systematic spacing between the strided layers so figured that'd be the same here for the configs, I realize they could be put anywhere but doesn't seem that useful to have no depth between strides.
The concept of the stage is essentially encapsulate the layers at the same stride, and sometimes there are stages w/o any stride but a different width, conv type (depthwise vs not), or other trait in common with all layer repeats in the stage.
@rwightman Thanks a lot. thats a fair point, however this was never meant to scale that way. it was designed with something completely different in mind. it was meant to show how one could maximize a networks performance under constraint (fixed_param count, depth and basic operators) while keeping everything simple and not resorting to any complex strategies.
having that said, thankfully I seem to pretty much have done everything and the only thing that seems to still be an issue is that the last stage has a bigger featuremap size(thus smaller stride) than its previous counter part. it seems timm has issues with it. currently this is how my feature_info looks like:
[{'num_chs': 64, 'reduction': 2, 'module': 'stem'},
{'num_chs': 128, 'reduction': 4, 'module': 'features.stage_0'},
{'num_chs': 128, 'reduction': 8, 'module': 'features.stage_1'},
{'num_chs': 512, 'reduction': 16, 'module': 'features.stage_2'},
{'num_chs': 2048, 'reduction': 24, 'module': 'features.stage_3'},
{'num_chs': 256, 'reduction': 20, 'module': 'features.stage_4'}]
How should I be handling this other than merging the last two stages? thanks a lot in advance
@rwightman would you kindly have a look here and tell me what to do for the last part? thanks
@Coderx7 reduction is spatial reduction (from the input image size), it's only complained about if it decreases, it's not used directly by timm but some downstream users want to know that for calculating interpolation ratios.
If you look at the rexnet example it should *=2 every time there is a strided layer, majority of imagnet networks ar stride 32, num chs does not have any restrictions for increasing/decreasing although
@rwightman I thought the idea was to provide featuremaps of different sizes for downstream usage not capturing only the strides of 2 per say.
currently if the assert in
assert 'reduction' in fi and fi['reduction'] >= prev_reduction
is not disabled this wont work. so I need to do one of the following :
- to have 4 stages and only have reduction rates for 3 stages (that is don't include the last reduction rate for the last stage in feature_info
- to have 3 stages and merge the last 2 stages (3 and 4 and only have 3 stages in total with 3 reduction rates for each
- the featureinfo class is altered to have a new argument which allows cases like this,
- the issue with the first option is users will lose the last two layers of the network if they opt out to use
features_only, but other than than normal usage stays the same. - the issue with the second option is, users cant fully experiment with the stage 4, so they have to manually do this which nullifies the purpose of features_only I guess.
- the last option seems like a good idea to me as with a default value that works for all current model, the current behavior is maintained, but it allows for cases such as this to also be usable. unless that check has more significance and affects lots of other parts of the library which I'm not aware of yet.
so which option should I take and hopefully finish this up? Thanks a lot in advance.
@rwightman I'd really appreciate if you could kindly have a look and decide on the next step so I can finalize the changes accordingly and have it finished.
@Coderx7 sorry I have a lot on my plate right now, wrapping a up a few things before I'm on vacation for a bit. I'm going to have to leave this one hanging for a bit as I don't think we're on the same page.
The net is simple as per its name and I didn't see any merging, or upscaling or anything that could result in a feature map increasing in size, it's reducing by 2 at each downscale. I feel we're lost in semantics.
@rwightman out of the last three conv layers, two (2048 and 256) have kernel size=1 and they use padding of 1, that causes the featuremap size to increase from previously 7x7 (after the down sampling) to 9x9(after conv 1x1), the next conv1x1 layer increases that to 11x11, thus causing the effective reduction to vary that way. Ok no problem, please take your time and let's continue this when you are free. I really do appreciate you taking the time despite all of your busy schedule.
@rwightman May I ask if your vacation is over and if we can hopefully get this last step worked out?
@Coderx7 been trying to get on top of my own tasks since getting back. I looked at this a bit more, not really liking the padding issue that is the reason for the expanding dim... having a padding of 1 for a 1x1 conv makes zero sense to me. It's adding data to the signal path that's not meaningful. So, I'm hesitant to add alltogether with quirks like that present...
@rwightman Thanks, I really appreciate it knowing how busy your schedule is. its not really any different than using (zero)-padding on the input. This happened by accident, but after I noticed it, in a few experiments that I did afterward, I noticed they perform better than the no padding versions, looked to me as if it creates a kind of regularization effect. I can run more experiments to further validate this point (or lack thereof, if that happens to be the case ultimately), if that's your concern. my main concern is that, it takes a lot of time to train these models again, (it took me several months to train these models as I don't have access to anything powerful, just a single gpu). but I try my best to see how I can address your concerns.
@Coderx7 in deep learning it would seem almost any extra activations (or parameters) can/will be used to improve the loss in optimization, but I'd argue not particularly useful ones (and possibly harmful for segmentation/obj detection as they'd add a 'border' effect at the feature level). They get blended back into the signal via the subsequent 3x3 conv. I did test these and per the goal of running faster, the extra padding does have a measurable speed impact (not significant but there).
The rest of the net is fine, simple as per the name which isn't bad to have in timm as they can be the best option for some tasks. If the padding issue fixed (padding == kernel_size//2 should do fine for this net) and retrained I'd definitely include with the tweaks mentioned.
Do you have hparams for these? I have two idle 2x Titan RTX machines right now, I could put them to work if you push any outstanding changes re arch to this PR.
@rwightman That would be great, thanks a lot :) I was actually planning to test them on detection/segmentation as well, to see how that affects the result in different scenarios. I'm glad you kindly shared your experience with me. I really really appreciate it. :)
OK I try to push all the changes so far, and update the changes. the model in refactored in stages, and use no padding for the 1x1convs . everything should hopefully work. concerning the hparams, I didnt have much to test, I went with nearly the same exact hparams for all architectures which I say in a moment.
They all use the same settings and only differ in weight decay and dropouts.
I start off with training all models without any dropouts, and a side from keeping top 10 checkpoints, save every 50 checkpoints (i.e. like 100th, 150th, 200th,... checkpoints) so that later on I use these checkpoints to resume the model with dropout so I get faster and better results. Then at the end, I take the average of the best checkpoints.
if I don't get the accuracy I like, I try resuming with no label smoothing and finally average the best of these with to get final results.
thats all.
The basic command arguments look like this, and only the model name , weightdecay and dropout changes.
./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_5m_m1 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00003 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999
For example for the 5m variants, this is how I start the training: 5m:
./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_5m_m1 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00003 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999
./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_5m_m2 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00003 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999
when model stops improving, I resume from checkpoing 250 and this time use dropout rates like this :
--weight-decay 0.00002 --model-kwargs drop_rates='{"11":0.02,"12":0.05,"13":0.05}' this is the same for both variants.
9m:
For m1 variant, I trained it with --weight-decay 0.000035 --model-kwargs --drop-rates '{"12":0.05,"13":0.05} from the start, and when it plateaued, I resumed from epoch 300 with slightly more dropout (and less wd) --weight-decay 0.00003 --model-kwargs --drop-rates '{"11":0.05,"12":0.05,"13":0.05}' and achieved 73.38.
For m2 variant, I started with wd of 0.000035 for 200 epochs, then continued with --weight-decay 0.00003 --model-kwargs --drop-rates '{"12":0.05,"13":0.05}' until it stoped improving (at epoch 413 it achieved 73.678), then I resumed from epoch 330 with slightly more dropout rates and less wd just like the m1 variant :
--weight-decay 0.00003 --model-kwargs --drop-rates '{"11":0.05,"12":0.05,"13":0.05}' this got me from 73.678 to 73.95, then I did this again(i.e. repeated the last run) without label smoothing and did an average on the best models and chose the one with the higher accuracy. this gave me 74.17
./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_9m_m1 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.000035 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999 --model-kwargs --drop-rates '{"12":0.05,"13":0.05}
./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_9m_m2 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.000035 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999
3m: For this variant, I used no dropout, simply trained normally until it plateaued, then, I resumed from checkpoint 332 (basically around the best checkpoint up to that point) and resume with no label smoothing and finally average the best models to get higher accuracy.
./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_small_m1_075 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00003 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999
./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_small_m2_075 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00003 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999
1.5m: For this variant, I simply train normally with wd of 0.00001. the m2 variant achieves 61.386 normally. the m1 variant achieves 60.814 with no label smoothing.(with label smoothing it gets 60.58, so it could be margin of error as I only trained them once.) I then take the average of the best models for each.
./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_small_m1_05 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00001 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999
./distributed_train.sh 1 /media/hossein/SSD_IMG/ImageNet_DataSet/ --model simplenetv1_small_m2_05 -b 256 --sched step --epochs 900 --decay-epochs 1 --decay-rate 0.981 --opt rmsproptf --opt-eps .001 -j 20 --warmup-lr 1e-3 --weight-decay 0.00001 --drop 0.0 --amp --lr .0195 --pin-mem --channels-last --model-ema --model-ema-decay 0.9999
OK, I pushed the changes, hope I didnt do anything wrong.
@rwightman Hi, hope you are doing great, I finally finished training the new weights and I just updated the pr. would you please kindly tell me what you think. Thanks a lot in advance.
@rwightman its been a few months since my last changes, could you kindly please tell me if everything is OK or I'm missing sth here? Id really like to make this happen if you will of course. Thanks a lot in advance