mmpretrain [Bug] Problem saving epoch checkpoint when fine tuning Efficientnet-b0

Describe the bug

When trying to fine tune Efficient B0, just by applying minimal changes in the Getting Started Colab notebook, when saving the first checkpoint after first epoch completes, an error runtime is raised:

"RuntimeError: Given groups=1, weight of size [32, 3, 3, 3], expected input[32, 224, 225, 5] to have 3 channels, but got 224 channels instead"

To Reproduce

Use Google Colab MMClassification Getting started notebook JUST changing mobilenetV2 checkpoint file and config files to those from efficientnet-b0 zoo as shown:

config_file = 'configs/efficientnet/efficientnet-b0_8xb32_in1k.py' checkpoint_file = 'https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32_in1k_20220119-a7e2a0b1.pth'

Post related information

The output of pip list | grep "mmcv\|mmcls\|^torch"

mmcls                         0.23.0                /content/mmclassification
mmcv                          1.5.0
torch                         1.11.0+cu113
torchaudio                    0.11.0+cu113
torchsummary                  1.5.1
torchtext                     0.12.0
torchvision                   0.12.0+cu113

Your config file if you modified it or created a new one. Nothing modified from Google Colab MMClassification getting started => Fine tune section.
Your train log file if you meet the problem during training.

2022-05-04 09:03:37,380 - mmcls - INFO - workflow: [('train', 1)], max: 2 epochs
2022-05-04 09:03:37,383 - mmcls - INFO - Checkpoints will be saved to /content/mmclassification/work_dirs/cats_dogs_dataset by HardDiskBackend.
2022-05-04 09:03:44,796 - mmcls - INFO - Epoch [1][10/201]	lr: 5.000e-03, eta: 0:04:44, time: 0.725, data_time: 0.252, memory: 3653, loss: 0.6385
2022-05-04 09:03:49,460 - mmcls - INFO - Epoch [1][20/201]	lr: 5.000e-03, eta: 0:03:47, time: 0.466, data_time: 0.016, memory: 3653, loss: 0.4478
2022-05-04 09:03:54,131 - mmcls - INFO - Epoch [1][30/201]	lr: 5.000e-03, eta: 0:03:25, time: 0.467, data_time: 0.016, memory: 3653, loss: 0.3196
2022-05-04 09:03:58,821 - mmcls - INFO - Epoch [1][40/201]	lr: 5.000e-03, eta: 0:03:12, time: 0.469, data_time: 0.016, memory: 3653, loss: 0.2780
2022-05-04 09:04:03,520 - mmcls - INFO - Epoch [1][50/201]	lr: 5.000e-03, eta: 0:03:02, time: 0.470, data_time: 0.016, memory: 3653, loss: 0.2618
2022-05-04 09:04:08,239 - mmcls - INFO - Epoch [1][60/201]	lr: 5.000e-03, eta: 0:02:54, time: 0.472, data_time: 0.016, memory: 3653, loss: 0.2120
2022-05-04 09:04:13,059 - mmcls - INFO - Epoch [1][70/201]	lr: 5.000e-03, eta: 0:02:48, time: 0.482, data_time: 0.019, memory: 3653, loss: 0.1787
2022-05-04 09:04:17,811 - mmcls - INFO - Epoch [1][80/201]	lr: 5.000e-03, eta: 0:02:42, time: 0.475, data_time: 0.017, memory: 3653, loss: 0.1877
2022-05-04 09:04:22,604 - mmcls - INFO - Epoch [1][90/201]	lr: 5.000e-03, eta: 0:02:36, time: 0.479, data_time: 0.019, memory: 3653, loss: 0.1741
2022-05-04 09:04:27,354 - mmcls - INFO - Epoch [1][100/201]	lr: 5.000e-03, eta: 0:02:30, time: 0.475, data_time: 0.016, memory: 3653, loss: 0.1909
2022-05-04 09:04:32,111 - mmcls - INFO - Epoch [1][110/201]	lr: 5.000e-03, eta: 0:02:24, time: 0.476, data_time: 0.017, memory: 3653, loss: 0.1907
2022-05-04 09:04:36,872 - mmcls - INFO - Epoch [1][120/201]	lr: 5.000e-03, eta: 0:02:19, time: 0.476, data_time: 0.016, memory: 3653, loss: 0.1520
2022-05-04 09:04:41,645 - mmcls - INFO - Epoch [1][130/201]	lr: 5.000e-03, eta: 0:02:14, time: 0.477, data_time: 0.016, memory: 3653, loss: 0.2102
2022-05-04 09:04:46,422 - mmcls - INFO - Epoch [1][140/201]	lr: 5.000e-03, eta: 0:02:08, time: 0.478, data_time: 0.016, memory: 3653, loss: 0.1830
2022-05-04 09:04:51,202 - mmcls - INFO - Epoch [1][150/201]	lr: 5.000e-03, eta: 0:02:03, time: 0.478, data_time: 0.016, memory: 3653, loss: 0.1848
2022-05-04 09:04:56,005 - mmcls - INFO - Epoch [1][160/201]	lr: 5.000e-03, eta: 0:01:58, time: 0.480, data_time: 0.018, memory: 3653, loss: 0.1488
2022-05-04 09:05:00,775 - mmcls - INFO - Epoch [1][170/201]	lr: 5.000e-03, eta: 0:01:53, time: 0.477, data_time: 0.016, memory: 3653, loss: 0.1551
2022-05-04 09:05:05,549 - mmcls - INFO - Epoch [1][180/201]	lr: 5.000e-03, eta: 0:01:48, time: 0.477, data_time: 0.017, memory: 3653, loss: 0.1437
2022-05-04 09:05:10,317 - mmcls - INFO - Epoch [1][190/201]	lr: 5.000e-03, eta: 0:01:43, time: 0.477, data_time: 0.016, memory: 3653, loss: 0.1606
2022-05-04 09:05:15,096 - mmcls - INFO - Epoch [1][200/201]	lr: 5.000e-03, eta: 0:01:38, time: 0.478, data_time: 0.016, memory: 3653, loss: 0.1266
2022-05-04 09:05:15,185 - mmcls - INFO - Saving checkpoint at 1 epochs
[                                                  ] 0/1601, elapsed: 0s, ETA:
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-9-23436e39b2a0>](https://localhost:8080/#) in <module>()
     33     validate=True,
     34     timestamp=time.strftime('%Y%m%d_%H%M%S', time.localtime()),
---> 35     meta=dict())

21 frames
[/usr/local/lib/python3.7/dist-packages/mmcv/cnn/bricks/conv2d_adaptive_padding.py](https://localhost:8080/#) in forward(self, x)
     60             ])
     61         return F.conv2d(x, self.weight, self.bias, self.stride, self.padding,
---> 62                         self.dilation, self.groups)

RuntimeError: Given groups=1, weight of size [32, 3, 3, 3], expected input[32, 224, 225, 5] to have 3 channels, but got 224 channels instead
5. Other code you modified in the `mmcls` folder.
Nothing

Additional context

Add any other context about the problem here.

Nothing

May 04 '22 09:05 bsense-rius

It seems there is an error in evalutionHook config, your input of the model is the shape of [32, 224, 225, 5], And excepted input shep is [32, 224, 224, 3]. Can you show me your data.val config?

May 05 '22 01:05 Ezra-Yu

I have not changed anything in the validation scheme that works for other classifiers. The output of the cfg.data.val

import json
print(json.dumps(cfg.data.val, indent=2))

{
  "type": "ImageNet",
  "data_prefix": "data/cats_dogs_dataset/val_set/val_set",
  "ann_file": "data/cats_dogs_dataset/val.txt",
  "pipeline": [
    {
      "type": "LoadImageFromFile"
    },
    {
      "type": "Resize",
      "size": [
        256,
        -1
      ],
      "backend": "pillow"
    },
    {
      "type": "CenterCrop",
      "crop_size": 224
    },
    {
      "type": "Normalize",
      "mean": [
        124.508,
        116.05,
        106.438
      ],
      "std": [
        58.577,
        57.31,
        57.437
      ],
      "to_rgb": true
    },
    {
      "type": "ImageToTensor",
      "keys": [
        "img"
      ]
    },
    {
      "type": "Collect",
      "keys": [
        "img"
      ]
    }
  ],
  "classes": "data/cats_dogs_dataset/classes.txt"
}

May 12 '22 06:05 bsense-rius

This issue will be closed as it is inactive, feel free to re-open it if necessary.

Dec 12 '22 15:12 tonysy

mmpretrain mmpretrain copied to clipboard

[Bug] Problem saving epoch checkpoint when fine tuning Efficientnet-b0

Describe the bug

To Reproduce

Post related information

Additional context

mmpretrain
mmpretrain copied to clipboard