mmselfsup icon indicating copy to clipboard operation
mmselfsup copied to clipboard

Meeting error when trained MAE model by using the CIFAR

Open Jia-Baos opened this issue 2 years ago • 11 comments

This is the config file:

>>>>>>>>>>>>>>>>>>>>> Start of Changed >>>>>>>>>>>>>>>>>>>>>>>>>

base = [ '../base/models/mae_vit-base-p16.py', # '../base/datasets/cifar.py', '../base/schedules/adamw_coslr-200e_in1k.py', '../base/default_runtime.py', ]

dataset settings

#data_root = 'data/cifar' #file_client_args = dict(backend='disk') data_source = 'CIFAR10' dataset_type = 'SingleViewDataset' img_norm_cfg = dict(mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.201]) train_pipeline = [ dict(type='RandomCrop', size=32, padding=4), dict(type='RandomHorizontalFlip'), ] test_pipeline = []

prefetch

prefetch = False if not prefetch: train_pipeline.extend( [dict(type='ToTensor'), dict(type='Normalize', **img_norm_cfg)]) test_pipeline.extend( [dict(type='ToTensor'), dict(type='Normalize', **img_norm_cfg)])

dataset summary

data = dict( samples_per_gpu=128, workers_per_gpu=2, train=dict( type=dataset_type, data_source=dict( type=data_source, data_prefix='data/cifar', ), pipeline=train_pipeline, prefetch=prefetch), val=dict( type=dataset_type, data_source=dict( type=data_source, data_prefix='data/cifar', ), pipeline=test_pipeline, prefetch=prefetch), test=dict( type=dataset_type, data_source=dict( type=data_source, data_prefix='data/cifar', ), pipeline=test_pipeline, prefetch=prefetch)) #evaluation = dict(interval=10, topk=(1, 5))

dataset 8 x 512

train_dataloader = dict(batch_size=128, num_workers=8)

<<<<<<<<<<<<<<<<<<<<<< End of Changed <<<<<<<<<<<<<<<<<<<<<<<<<<<

optimizer wrapper

optimizer = dict( type='AdamW', lr=1.5e-4 * 4096 / 256, betas=(0.9, 0.95), weight_decay=0.05) optim_wrapper = dict( type='OptimWrapper', optimizer=optimizer, paramwise_cfg=dict( custom_keys={ 'ln': dict(decay_mult=0.0), 'bias': dict(decay_mult=0.0), 'pos_embed': dict(decay_mult=0.), 'mask_token': dict(decay_mult=0.), 'cls_token': dict(decay_mult=0.) }))

learning rate scheduler

param_scheduler = [ dict( type='LinearLR', start_factor=1e-4, by_epoch=True, begin=0, end=40, convert_to_iter_based=True), dict( type='CosineAnnealingLR', T_max=360, by_epoch=True, begin=40, end=400, convert_to_iter_based=True) ]

runtime settings

pre-train for 400 epochs

train_cfg = dict(max_epochs=3) default_hooks = dict( logger=dict(type='LoggerHook', interval=100), # only keeps the latest 3 checkpoints checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))

randomness

randomness = dict(seed=0, diff_rank_seed=True) resume = True

This is the log:

During handling of the above exception, another exception occurred: 2023/02/20 01:19:00 - mmengine - INFO -

System environment: sys.platform: linux Python: 3.8.16 (default, Jan 17 2023, 23:13:24) [GCC 11.2.0] CUDA available: True numpy_random_seed: 0 GPU 0: NVIDIA GeForce RTX 3090 CUDA_HOME: /data/apps/cuda/11.1 NVCC: Cuda compilation tools, release 11.1, V11.1.74 GCC: gcc (GCC) 7.3.0 PyTorch: 1.10.0+cu111 PyTorch compiling details: PyTorch built with:

  • GCC 7.3

  • C++ Version: 201402

  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications

  • Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)

  • OpenMP 201511 (a.k.a. OpenMP 4.5)

  • LAPACK is enabled (usually provided by MKL)

  • NNPACK is enabled

  • CPU capability usage: AVX2

  • CUDA Runtime 11.1

  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86

  • CuDNN 8.0.5

  • Magma 2.5.2

  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

    TorchVision: 0.11.0+cu111 OpenCV: 4.7.0 MMEngine: 0.5.0

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 0 diff_rank_seed: True Distributed launcher: none Distributed training: False GPU number: 1

2023/02/20 01:19:00 - mmengine - INFO - Config: model = dict( type='MAE', data_preprocessor=dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], bgr_to_rgb=True), backbone=dict(type='MAEViT', arch='b', patch_size=16, mask_ratio=0.75), neck=dict( type='MAEPretrainDecoder', patch_size=16, in_chans=3, embed_dim=768, decoder_embed_dim=512, decoder_depth=8, decoder_num_heads=16, mlp_ratio=4.0), head=dict( type='MAEPretrainHead', norm_pix=True, patch_size=16, loss=dict(type='MAEReconstructionLoss')), init_cfg=[ dict(type='Xavier', distribution='uniform', layer='Linear'), dict(type='Constant', layer='LayerNorm', val=1.0, bias=0.0) ]) optimizer = dict(type='AdamW', lr=0.0024, betas=(0.9, 0.95), weight_decay=0.05) optim_wrapper = dict( type='OptimWrapper', optimizer=dict( type='AdamW', lr=0.0024, betas=(0.9, 0.95), weight_decay=0.05), paramwise_cfg=dict( custom_keys=dict( ln=dict(decay_mult=0.0), bias=dict(decay_mult=0.0), pos_embed=dict(decay_mult=0.0), mask_token=dict(decay_mult=0.0), cls_token=dict(decay_mult=0.0)))) param_scheduler = [ dict( type='LinearLR', start_factor=0.0001, by_epoch=True, begin=0, end=40, convert_to_iter_based=True), dict( type='CosineAnnealingLR', T_max=360, by_epoch=True, begin=40, end=400, convert_to_iter_based=True) ] train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=3) default_scope = 'mmselfsup' default_hooks = dict( runtime_info=dict(type='RuntimeInfoHook'), timer=dict(type='IterTimerHook'), logger=dict(type='LoggerHook', interval=100), param_scheduler=dict(type='ParamSchedulerHook'), checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3), sampler_seed=dict(type='DistSamplerSeedHook')) env_cfg = dict( cudnn_benchmark=False, mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0), dist_cfg=dict(backend='nccl')) log_processor = dict( window_size=10, custom_cfg=[dict(data_src='', method='mean', window_size='global')]) vis_backends = [dict(type='LocalVisBackend')] visualizer = dict( type='SelfSupVisualizer', vis_backends=[dict(type='LocalVisBackend')], name='visualizer') log_level = 'INFO' load_from = None resume = True data_source = 'CIFAR10' dataset_type = 'SingleViewDataset' img_norm_cfg = dict(mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.201]) train_pipeline = [ dict(type='RandomCrop', size=32, padding=4), dict(type='RandomHorizontalFlip'), dict(type='ToTensor'), dict( type='Normalize', mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.201]) ] test_pipeline = [ dict(type='ToTensor'), dict( type='Normalize', mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.201]) ] prefetch = False data = dict( samples_per_gpu=128, workers_per_gpu=2, train=dict( type='SingleViewDataset', data_source=dict(type='CIFAR10', data_prefix='data/cifar'), pipeline=[ dict(type='RandomCrop', size=32, padding=4), dict(type='RandomHorizontalFlip'), dict(type='ToTensor'), dict( type='Normalize', mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.201]) ], prefetch=False), val=dict( type='SingleViewDataset', data_source=dict(type='CIFAR10', data_prefix='data/cifar'), pipeline=[ dict(type='ToTensor'), dict( type='Normalize', mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.201]) ], prefetch=False), test=dict( type='SingleViewDataset', data_source=dict(type='CIFAR10', data_prefix='data/cifar'), pipeline=[ dict(type='ToTensor'), dict( type='Normalize', mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.201]) ], prefetch=False)) train_dataloader = dict(batch_size=128, num_workers=8) randomness = dict(seed=0, diff_rank_seed=True) launcher = 'none' work_dir = 'work'

2023/02/20 01:19:00 - mmengine - WARNING - The "visualizer" registry in mmselfsup did not set import location. Fallback to call mmselfsup.utils.register_all_modules instead. 2023/02/20 01:19:00 - mmengine - WARNING - The "vis_backend" registry in mmselfsup did not set import location. Fallback to call mmselfsup.utils.register_all_modules instead. 2023/02/20 01:19:01 - mmengine - WARNING - The "model" registry in mmselfsup did not set import location. Fallback to call mmselfsup.utils.register_all_modules instead. 2023/02/20 01:19:06 - mmengine - INFO - Distributed training is not used, all SyncBatchNorm (SyncBN) layers in the model will be automatically reverted to BatchNormXd layers if they are used. 2023/02/20 01:19:06 - mmengine - WARNING - The "hook" registry in mmselfsup did not set import location. Fallback to call mmselfsup.utils.register_all_modules instead. 2023/02/20 01:19:06 - mmengine - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) RuntimeInfoHook
(BELOW_NORMAL) LoggerHook

before_train: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(VERY_LOW ) CheckpointHook

before_train_epoch: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DistSamplerSeedHook

before_train_iter: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook

after_train_iter: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook

after_train_epoch: (NORMAL ) IterTimerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook

before_val_epoch: (NORMAL ) IterTimerHook

before_val_iter: (NORMAL ) IterTimerHook

after_val_iter: (NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook

after_val_epoch: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook

before_test_epoch: (NORMAL ) IterTimerHook

before_test_iter: (NORMAL ) IterTimerHook

after_test_iter: (NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook

after_test_epoch: (VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook

after_run: (BELOW_NORMAL) LoggerHook

2023/02/20 01:19:07 - mmengine - WARNING - The "loop" registry in mmselfsup did not set import location. Fallback to call mmselfsup.utils.register_all_modules instead.

Traceback (most recent call last): File "/HOME/scz3182/.conda/envs/openmmlab/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg obj = obj_cls(**args) # type: ignore File "/HOME/scz3182/.conda/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 43, in init super().init(runner, dataloader) File "/HOME/scz3182/.conda/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/base_loop.py", line 26, in init self.dataloader = runner.build_dataloader( File "/HOME/scz3182/.conda/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1331, in build_dataloader dataset_cfg = dataloader_cfg.pop('dataset') KeyError: 'dataset'

Traceback (most recent call last): File "tools/train.py", line 99, in main() File "tools/train.py", line 95, in main runner.train() File "/HOME/scz3182/.conda/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1656, in train self._train_loop = self.build_train_loop( File "/HOME/scz3182/.conda/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1448, in build_train_loop loop = LOOPS.build( File "/HOME/scz3182/.conda/envs/openmmlab/lib/python3.8/site-packages/mmengine/registry/registry.py", line 521, in build return self.build_func(cfg, *args, **kwargs, registry=self) File "/HOME/scz3182/.conda/envs/openmmlab/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 135, in build_from_cfg raise type(e)( KeyError: "class EpochBasedTrainLoop in mmengine/runner/loops.py: 'dataset'"

Jia-Baos avatar Feb 19 '23 17:02 Jia-Baos

It seems your config is from MMSelfSup 0.x version, but the log shows you use MMEngine to start your training job. Try to pull and checkout to MMSelfSup 1.x branch, and use the new config of MAE.

YuanLiuuuuuu avatar Feb 20 '23 03:02 YuanLiuuuuuu

emm, thsat's a great idea. using MMSelfSup 1.x branch to train MAE on my own dataset(refer: https://mmselfsup.readthedocs.io/zh_CN/dev-1.x/user_guides/4_pretrain_custom_dataset.html), it has worked, but now i want to train MAE on CIFAR10, how to change the dataset part of config, i can't find an example from (refer: https://github.com/open-mmlab/mmselfsup/tree/1.x/configs/selfsup/base/datasets), which is all about ImageNet

Jia-Baos avatar Feb 20 '23 04:02 Jia-Baos

You can refer to this doc. Before you pre-train your model on CIFAR10, you should refactor the folder of CIFAR10 to the style of ImageNet1K and also create a ImageNet1K-style annotation file. After that, you can make few changes to the config, but replace data_root and ann_file with your own settings.

YuanLiuuuuuu avatar Feb 20 '23 04:02 YuanLiuuuuuu

Thanks~

Jia-Baos avatar Feb 20 '23 04:02 Jia-Baos

Hi, Have you solved the problem? I also wan to use CIFAR in MMselfsup, Can you provide your CIFAR directory structure? Thanks!

mrFocusXin avatar Feb 21 '23 01:02 mrFocusXin

emm, i didn't use the CIFAR10, just organized my own datasets into the style of ImageNet1K and also create a ImageNet1K-style annotation file. If you want to use the CIFAR, a good idea is to change the CIFAR to ImageNet1K style or mmcls.CustomDataset style (refer: https://mmselfsup.readthedocs.io/zh_CN/dev-1.x/user_guides/4_pretrain_custom_dataset.html).

Jia-Baos avatar Feb 22 '23 06:02 Jia-Baos

@Jia-Baos Thank you for your reply! What is the specific format of ImageNet1K-style? I found the ImageNet1K dataset on google, but not the meta folder, and I only found this picture in mmselfsup doc, but I don't know the specific structure, for example, what should be under the meta folder? Now I prepared my own data set, but I don't know exactly what ImageNet1K-style is, would you mind giving me an example or documentation? Thank you very much. Your reply is very helpful to me! image

mrFocusXin avatar Feb 23 '23 02:02 mrFocusXin

I‘m very pleased that i could be help, you can refer this file 北京超算30区使用MMClassification训练花卉图片分类模型.pdf in https://github.com/Jia-Baos/OpenMM.

Jia-Baos avatar Feb 23 '23 03:02 Jia-Baos

Thank you very much for your reply. It's works!!

mrFocusXin avatar Feb 23 '23 08:02 mrFocusXin

I‘m very pleased that i could be help, you can refer this file 北京超算30区使用MMClassification训练花卉图片分类模型.pdf in https://github.com/Jia-Baos/OpenMM.

hello , i can not find such pdf file. also i faced some issue while training my model on my custom dataset. i got such error: KeyError: 'SelfSupVisualizer is not in the visualizer registry. Please check whether the value of SelfSupVisualizer is correct or it was registered as expected. More details can be found at https://mmengine.readthedocs.io/en/latest/advanced_tutorials/config.html#import-the-custom-module'

alaa-shubbak avatar Apr 22 '23 18:04 alaa-shubbak

Emm, you can find this file in my repository openMM. As for the error, I have turned to optical flow estimation, so that i can't help you more......

Jia-Baos avatar Apr 23 '23 02:04 Jia-Baos