mmpretrain icon indicating copy to clipboard operation
mmpretrain copied to clipboard

about the loss of Mocov3, no decreasing?

Open RobinHan24 opened this issue 1 year ago • 0 comments

分支

main 分支 (mmpretrain 版本)

描述该错误

I trained my own dataset with MocoV3-resnet50, but the loss decreased from 27 to 23, holding on the number of 23, why?

12/20 09:34:59 - mmengine - INFO - Saving checkpoint at 3562 epochs /mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( 12/20 09:35:11 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815 12/20 09:35:11 - mmengine - INFO - Epoch(train) [3563][3/3] lr: 2.4223e+00 eta: 0:56:54 time: 2.4053 data_time: 1.6347 memory: 18037 loss: 23.5969 12/20 09:35:11 - mmengine - INFO - Saving checkpoint at 3563 epochs /mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( 12/20 09:35:22 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815 12/20 09:35:22 - mmengine - INFO - Epoch(train) [3564][3/3] lr: 2.4127e+00 eta: 0:56:47 time: 2.3942 data_time: 1.6182 memory: 18037 loss: 23.5970 12/20 09:35:22 - mmengine - INFO - Saving checkpoint at 3564 epochs /mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( 12/20 09:35:33 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815 12/20 09:35:33 - mmengine - INFO - Epoch(train) [3565][3/3] lr: 2.4032e+00 eta: 0:56:39 time: 2.3664 data_time: 1.5931 memory: 18037 loss: 23.5975 12/20 09:35:33 - mmengine - INFO - Saving checkpoint at 3565 epochs /mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( 12/20 09:35:44 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815 12/20 09:35:44 - mmengine - INFO - Epoch(train) [3566][3/3] lr: 2.3936e+00 eta: 0:56:31 time: 2.3551 data_time: 1.5844 memory: 18037 loss: 23.5980 12/20 09:35:44 - mmengine - INFO - Saving checkpoint at 3566 epochs /mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( 12/20 09:35:56 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815 12/20 09:35:56 - mmengine - INFO - Epoch(train) [3567][3/3] lr: 2.3841e+00 eta: 0:56:23 time: 2.3336 data_time: 1.5614 memory: 18037 loss: 23.5950 12/20 09:35:56 - mmengine - INFO - Saving checkpoint at 3567 epochs /mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( 12/20 09:36:07 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815 12/20 09:36:07 - mmengine - INFO - Epoch(train) [3568][3/3] lr: 2.3745e+00 eta: 0:56:15 time: 2.4241 data_time: 1.6476 memory: 18037 loss: 23.5938 12/20 09:36:07 - mmengine - INFO - Saving checkpoint at 3568 epochs /mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( 12/20 09:36:18 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815 12/20 09:36:18 - mmengine - INFO - Epoch(train) [3569][3/3] lr: 2.3650e+00 eta: 0:56:08 time: 2.4099 data_time: 1.6249 memory: 18037 loss: 23.5949 12/20 09:36:18 - mmengine - INFO - Saving checkpoint at 3569 epochs /mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( 12/20 09:36:29 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815 12/20 09:36:29 - mmengine - INFO - Epoch(train) [3570][3/3] lr: 2.3555e+00 eta: 0:56:00 time: 2.3381 data_time: 1.5578 memory: 18037 loss: 23.5944 12/20 09:36:29 - mmengine - INFO - Saving checkpoint at 3570 epochs /mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( 12/20 09:36:39 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815 12/20 09:36:39 - mmengine - INFO - Epoch(train) [3571][3/3] lr: 2.3459e+00 eta: 0:55:52 time: 2.1778 data_time: 1.4037 memory: 18037 loss: 23.5939 12/20 09:36:39 - mmengine - INFO - Saving checkpoint at 3571 epochs /mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( 12/20 09:36:50 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815 12/20 09:36:50 - mmengine - INFO - Epoch(train) [3572][3/3] lr: 2.3364e+00 eta: 0:55:44 time: 2.2672 data_time: 1.4953 memory: 18037 loss: 23.5931 12/20 09:36:50 - mmengine - INFO - Saving checkpoint at 3572 epochs /mnt/sda/qilibin/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( 12/20 09:37:02 - mmengine - INFO - Exp name: mocov3_resnet50_8xb512-amp-coslr-800e_in1k_20231219_222815 12/20 09:37:02 - mmengine - INFO - Epoch(train) [3573][3/3] lr: 2.3268e+00 eta: 0:55:36 time: 2.3623 data_time: 1.5885 memory: 18037 loss: 23.5929 12/20 09:37:02 - mmengine - INFO - Saving checkpoint at 3573 epochs

环境信息

{'sys.platform': 'linux', 'Python': '3.9.0 (default, Nov 15 2020, 14:28:56) [GCC 7.3.0]', 'CUDA available': True, 'numpy_random_seed': 2147483648, 'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A10', 'CUDA_HOME': '/usr/local/cuda-11.7', 'NVCC': 'Cuda compilation tools, release 11.7, V11.7.99', 'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0', 'PyTorch': '1.13.1+cu117', 'TorchVision': '0.14.1+cu117', 'OpenCV': '4.8.0', 'MMEngine': '0.7.3', 'MMCV': '2.0.0', 'MMPreTrain': '1.0.0rc7+e80418a'}

其他信息

my configure file: mocov3_resnet50_8xb512-amp-coslr-800e_in1k.py base = [ 'imagenet_bs512_mocov3.py', 'default_runtime.py', ]

model settings

temperature = 1.0 model = dict( type='MoCoV3', base_momentum=0.004, # 0.01 for 100e and 300e, 0.004 for 800 and 1000e backbone=dict( type='ResNet', depth=50, norm_cfg=dict(type='SyncBN'), zero_init_residual=False), neck=dict( type='NonLinearNeck', in_channels=2048, hid_channels=4096, out_channels=256, num_layers=2, with_bias=False, with_last_bn=True, with_last_bn_affine=False, with_last_bias=False, with_avg_pool=True), head=dict( type='MoCoV3Head', predictor=dict( type='NonLinearNeck', in_channels=256, hid_channels=4096, out_channels=256, num_layers=2, with_bias=False, with_last_bn=False, with_last_bn_affine=False, with_last_bias=False, with_avg_pool=False), loss=dict(type='CrossEntropyLoss', loss_weight=2 * temperature), temperature=temperature))

optimizer

optim_wrapper = dict( type='AmpOptimWrapper', loss_scale='dynamic', optimizer=dict(type='LARS', lr=4.8, weight_decay=1.5e-6, momentum=0.9), paramwise_cfg=dict( custom_keys={ 'bn': dict(decay_mult=0, lars_exclude=True), 'bias': dict(decay_mult=0, lars_exclude=True), # bn layer in ResNet block downsample module 'downsample.1': dict(decay_mult=0, lars_exclude=True), }), )

learning rate scheduler

param_scheduler = [ dict( type='LinearLR', start_factor=1e-4, by_epoch=True, begin=0, end=10, convert_to_iter_based=True), dict( type='CosineAnnealingLR', T_max=790, by_epoch=True, begin=10, end=4000, convert_to_iter_based=True) ]

runtime settings

train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=4000)

only keeps the latest 3 checkpoints

default_hooks = dict(checkpoint=dict(max_keep_ckpts=3))

NOTE: auto_scale_lr is for automatically scaling LR

based on the actual training batch size.

auto_scale_lr = dict(base_batch_size=4096)

imagenet_bs512_mocov3.py

dataset settings

dataset_type = 'CustomDataset' data_root = 'data/yf5class_old/' data_preprocessor = dict( type='SelfSupDataPreprocessor', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)

view_pipeline1 = [ dict( type='RandomResizedCrop', scale=224, crop_ratio_range=(0.2, 1.), backend='pillow'), dict( type='RandomApply', transforms=[ dict( type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.2, hue=0.1) ], prob=0.8), dict( type='RandomGrayscale', prob=0.2, keep_channels=True, channel_weights=(0.114, 0.587, 0.2989)), dict( type='GaussianBlur', magnitude_range=(0.1, 2.0), magnitude_std='inf', prob=1.), dict(type='Solarize', thr=128, prob=0.), dict(type='RandomFlip', prob=0.5), ] view_pipeline2 = [ dict( type='RandomResizedCrop', scale=224, crop_ratio_range=(0.2, 1.), backend='pillow'), dict( type='RandomApply', transforms=[ dict( type='ColorJitter', brightness=0.4, contrast=0.4, saturation=0.2, hue=0.1) ], prob=0.8), dict( type='RandomGrayscale', prob=0.2, keep_channels=True, channel_weights=(0.114, 0.587, 0.2989)), dict( type='GaussianBlur', magnitude_range=(0.1, 2.0), magnitude_std='inf', prob=0.1), dict(type='Solarize', thr=128, prob=0.2), dict(type='RandomFlip', prob=0.5), ] train_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiView', num_views=[1, 1], transforms=[view_pipeline1, view_pipeline2]), dict(type='PackInputs') ]

train_dataloader = dict( batch_size=192, num_workers=8, persistent_workers=True, pin_memory=True, sampler=dict(type='DefaultSampler', shuffle=True), collate_fn=dict(type='default_collate'), dataset=dict( type='CustomDataset', data_root=data_root, ann_file='', # 我们假定使用子文件夹格式,因此需要将标注文件置空 data_prefix='train', pipeline=train_pipeline))

RobinHan24 avatar Dec 20 '23 01:12 RobinHan24