mmpretrain icon indicating copy to clipboard operation
mmpretrain copied to clipboard

[Bug] Problem saving epoch checkpoint when fine tuning Efficientnet-b0

Open bsense-rius opened this issue 2 years ago • 2 comments

Describe the bug

When trying to fine tune Efficient B0, just by applying minimal changes in the Getting Started Colab notebook, when saving the first checkpoint after first epoch completes, an error runtime is raised:

"RuntimeError: Given groups=1, weight of size [32, 3, 3, 3], expected input[32, 224, 225, 5] to have 3 channels, but got 224 channels instead"

To Reproduce

Use Google Colab MMClassification Getting started notebook JUST changing mobilenetV2 checkpoint file and config files to those from efficientnet-b0 zoo as shown:

config_file = 'configs/efficientnet/efficientnet-b0_8xb32_in1k.py' checkpoint_file = 'https://download.openmmlab.com/mmclassification/v0/efficientnet/efficientnet-b0_3rdparty_8xb32_in1k_20220119-a7e2a0b1.pth'

Post related information

  1. The output of pip list | grep "mmcv\|mmcls\|^torch"
mmcls                         0.23.0                /content/mmclassification
mmcv                          1.5.0
torch                         1.11.0+cu113
torchaudio                    0.11.0+cu113
torchsummary                  1.5.1
torchtext                     0.12.0
torchvision                   0.12.0+cu113
  1. Your config file if you modified it or created a new one. Nothing modified from Google Colab MMClassification getting started => Fine tune section.

  2. Your train log file if you meet the problem during training.

2022-05-04 09:03:37,380 - mmcls - INFO - workflow: [('train', 1)], max: 2 epochs
2022-05-04 09:03:37,383 - mmcls - INFO - Checkpoints will be saved to /content/mmclassification/work_dirs/cats_dogs_dataset by HardDiskBackend.
2022-05-04 09:03:44,796 - mmcls - INFO - Epoch [1][10/201]	lr: 5.000e-03, eta: 0:04:44, time: 0.725, data_time: 0.252, memory: 3653, loss: 0.6385
2022-05-04 09:03:49,460 - mmcls - INFO - Epoch [1][20/201]	lr: 5.000e-03, eta: 0:03:47, time: 0.466, data_time: 0.016, memory: 3653, loss: 0.4478
2022-05-04 09:03:54,131 - mmcls - INFO - Epoch [1][30/201]	lr: 5.000e-03, eta: 0:03:25, time: 0.467, data_time: 0.016, memory: 3653, loss: 0.3196
2022-05-04 09:03:58,821 - mmcls - INFO - Epoch [1][40/201]	lr: 5.000e-03, eta: 0:03:12, time: 0.469, data_time: 0.016, memory: 3653, loss: 0.2780
2022-05-04 09:04:03,520 - mmcls - INFO - Epoch [1][50/201]	lr: 5.000e-03, eta: 0:03:02, time: 0.470, data_time: 0.016, memory: 3653, loss: 0.2618
2022-05-04 09:04:08,239 - mmcls - INFO - Epoch [1][60/201]	lr: 5.000e-03, eta: 0:02:54, time: 0.472, data_time: 0.016, memory: 3653, loss: 0.2120
2022-05-04 09:04:13,059 - mmcls - INFO - Epoch [1][70/201]	lr: 5.000e-03, eta: 0:02:48, time: 0.482, data_time: 0.019, memory: 3653, loss: 0.1787
2022-05-04 09:04:17,811 - mmcls - INFO - Epoch [1][80/201]	lr: 5.000e-03, eta: 0:02:42, time: 0.475, data_time: 0.017, memory: 3653, loss: 0.1877
2022-05-04 09:04:22,604 - mmcls - INFO - Epoch [1][90/201]	lr: 5.000e-03, eta: 0:02:36, time: 0.479, data_time: 0.019, memory: 3653, loss: 0.1741
2022-05-04 09:04:27,354 - mmcls - INFO - Epoch [1][100/201]	lr: 5.000e-03, eta: 0:02:30, time: 0.475, data_time: 0.016, memory: 3653, loss: 0.1909
2022-05-04 09:04:32,111 - mmcls - INFO - Epoch [1][110/201]	lr: 5.000e-03, eta: 0:02:24, time: 0.476, data_time: 0.017, memory: 3653, loss: 0.1907
2022-05-04 09:04:36,872 - mmcls - INFO - Epoch [1][120/201]	lr: 5.000e-03, eta: 0:02:19, time: 0.476, data_time: 0.016, memory: 3653, loss: 0.1520
2022-05-04 09:04:41,645 - mmcls - INFO - Epoch [1][130/201]	lr: 5.000e-03, eta: 0:02:14, time: 0.477, data_time: 0.016, memory: 3653, loss: 0.2102
2022-05-04 09:04:46,422 - mmcls - INFO - Epoch [1][140/201]	lr: 5.000e-03, eta: 0:02:08, time: 0.478, data_time: 0.016, memory: 3653, loss: 0.1830
2022-05-04 09:04:51,202 - mmcls - INFO - Epoch [1][150/201]	lr: 5.000e-03, eta: 0:02:03, time: 0.478, data_time: 0.016, memory: 3653, loss: 0.1848
2022-05-04 09:04:56,005 - mmcls - INFO - Epoch [1][160/201]	lr: 5.000e-03, eta: 0:01:58, time: 0.480, data_time: 0.018, memory: 3653, loss: 0.1488
2022-05-04 09:05:00,775 - mmcls - INFO - Epoch [1][170/201]	lr: 5.000e-03, eta: 0:01:53, time: 0.477, data_time: 0.016, memory: 3653, loss: 0.1551
2022-05-04 09:05:05,549 - mmcls - INFO - Epoch [1][180/201]	lr: 5.000e-03, eta: 0:01:48, time: 0.477, data_time: 0.017, memory: 3653, loss: 0.1437
2022-05-04 09:05:10,317 - mmcls - INFO - Epoch [1][190/201]	lr: 5.000e-03, eta: 0:01:43, time: 0.477, data_time: 0.016, memory: 3653, loss: 0.1606
2022-05-04 09:05:15,096 - mmcls - INFO - Epoch [1][200/201]	lr: 5.000e-03, eta: 0:01:38, time: 0.478, data_time: 0.016, memory: 3653, loss: 0.1266
2022-05-04 09:05:15,185 - mmcls - INFO - Saving checkpoint at 1 epochs
[                                                  ] 0/1601, elapsed: 0s, ETA:
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-9-23436e39b2a0>](https://localhost:8080/#) in <module>()
     33     validate=True,
     34     timestamp=time.strftime('%Y%m%d_%H%M%S', time.localtime()),
---> 35     meta=dict())

21 frames
[/usr/local/lib/python3.7/dist-packages/mmcv/cnn/bricks/conv2d_adaptive_padding.py](https://localhost:8080/#) in forward(self, x)
     60             ])
     61         return F.conv2d(x, self.weight, self.bias, self.stride, self.padding,
---> 62                         self.dilation, self.groups)

RuntimeError: Given groups=1, weight of size [32, 3, 3, 3], expected input[32, 224, 225, 5] to have 3 channels, but got 224 channels instead
5. Other code you modified in the `mmcls` folder.
Nothing

Additional context

Add any other context about the problem here.

Nothing

bsense-rius avatar May 04 '22 09:05 bsense-rius

It seems there is an error in evalutionHook config, your input of the model is the shape of [32, 224, 225, 5], And excepted input shep is [32, 224, 224, 3]. Can you show me your data.val config?

Ezra-Yu avatar May 05 '22 01:05 Ezra-Yu

I have not changed anything in the validation scheme that works for other classifiers. The output of the cfg.data.val

import json
print(json.dumps(cfg.data.val, indent=2))

{
  "type": "ImageNet",
  "data_prefix": "data/cats_dogs_dataset/val_set/val_set",
  "ann_file": "data/cats_dogs_dataset/val.txt",
  "pipeline": [
    {
      "type": "LoadImageFromFile"
    },
    {
      "type": "Resize",
      "size": [
        256,
        -1
      ],
      "backend": "pillow"
    },
    {
      "type": "CenterCrop",
      "crop_size": 224
    },
    {
      "type": "Normalize",
      "mean": [
        124.508,
        116.05,
        106.438
      ],
      "std": [
        58.577,
        57.31,
        57.437
      ],
      "to_rgb": true
    },
    {
      "type": "ImageToTensor",
      "keys": [
        "img"
      ]
    },
    {
      "type": "Collect",
      "keys": [
        "img"
      ]
    }
  ],
  "classes": "data/cats_dogs_dataset/classes.txt"
}

bsense-rius avatar May 12 '22 06:05 bsense-rius

This issue will be closed as it is inactive, feel free to re-open it if necessary.

tonysy avatar Dec 12 '22 15:12 tonysy