self._epoch = checkpoint['meta']['epoch'] KeyError: 'meta'

Oct 13 '23 10:10 LL-XSJ

Hello, I found that there are no attributes other than "state_dict" in the weight you provided, but when training your own dataset, you need to load the attributes of checkpoint ['meta ']. What should I do in this situation?

Oct 13 '23 10:10 LL-XSJ

Can you show me your training config?

Oct 14 '23 06:10 TempleX98

Please use load_from to load the weights

Oct 14 '23 07:10 TempleX98

Hello author, there was an error in the tool/train.py file. He said that without a model [meta], it does not exist. The model does not have a meta attribute, and I have checked that the pre training weight model does not contain a meta attribute. It only has one state_ Dict attribute, why is this? How should I modify it? The code error occurred in the following document: Train_ Detector（ Model, Datasets, Cfg, Distributed=distributed, Validate=(not args. no_validate), Timestamp=timestamp, Meta=meta)

Oct 16 '23 02:10 LL-XSJ

Just like this

Oct 16 '23 03:10 LL-XSJ

Please check the resume_from argument in your training config. The error is raised since you try to resume the training from a checkpoint without a meta key. If you want to train a new model with the pre-trained weights, just set resume_from=None and use load_from to load the weights.

Oct 16 '23 03:10 TempleX98

Which file to use load_from loading the weight file

Oct 22 '23 02:10 LL-XSJ

Hello author, what is the reason for the error in distributed training?

Oct 23 '23 14:10 LL-XSJ

Just like this

Oct 23 '23 14:10 LL-XSJ

May I ask how to solve it

Oct 23 '23 14:10 LL-XSJ

Please show me your training config

Oct 23 '23 14:10 TempleX98

base = [ '../base/datasets/coco_detection.py', '../base/default_runtime.py' ]

model settings

num_dec_layer = 6 lambda_2 = 2.0

在修改配置文件训练自己的数据集的时候，一共有3个地方需要修改类别数量

model = dict( type='CoDETR', backbone=dict( type='ResNet', depth=50, num_stages=4, out_indices=(1, 2, 3), frozen_stages=1, norm_cfg=dict(type='BN', requires_grad=False), norm_eval=True, style='pytorch', init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')), neck=dict( type='ChannelMapper', in_channels=[512, 1024, 2048], kernel_size=1, out_channels=256, act_cfg=None, norm_cfg=dict(type='GN', num_groups=32), num_outs=4), rpn_head=dict( type='RPNHead', in_channels=256, feat_channels=256, anchor_generator=dict( type='AnchorGenerator', octave_base_scale=4, scales_per_octave=3, ratios=[0.5, 1.0, 2.0], strides=[8, 16, 32, 64, 128]), bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[.0, .0, .0, .0], target_stds=[1.0, 1.0, 1.0, 1.0]), loss_cls=dict( type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0num_dec_layerlambda_2), loss_bbox=dict(type='L1Loss', loss_weight=1.0num_dec_layerlambda_2)), query_head=dict( type='CoDeformDETRHead', num_query=300, # 修改为自己数据集的类别个数，由80->1，80代表的是coco数据集的类别数量，而1是自己数据集的类别数量 num_classes=1, in_channels=2048, sync_cls_avg_factor=True, with_box_refine=True, as_two_stage=True, mixed_selection=True, transformer=dict( type='CoDeformableDetrTransformer', num_co_heads=2, encoder=dict( type='DetrTransformerEncoder', num_layers=6, transformerlayers=dict( type='BaseTransformerLayer', attn_cfgs=dict( type='MultiScaleDeformableAttention', embed_dims=256, dropout=0.0), feedforward_channels=2048, ffn_dropout=0.0, operation_order=('self_attn', 'norm', 'ffn', 'norm'))), decoder=dict( type='CoDeformableDetrTransformerDecoder', num_layers=num_dec_layer, return_intermediate=True, look_forward_twice=True, transformerlayers=dict( type='DetrTransformerDecoderLayer', attn_cfgs=[ dict( type='MultiheadAttention', embed_dims=256, num_heads=8, dropout=0.0), dict( type='MultiScaleDeformableAttention', embed_dims=256, dropout=0.0) ], feedforward_channels=2048, ffn_dropout=0.0, operation_order=('self_attn', 'norm', 'cross_attn', 'norm', 'ffn', 'norm')))), positional_encoding=dict( type='SinePositionalEncoding', num_feats=128, normalize=True, offset=-0.5), loss_cls=dict( type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, loss_weight=2.0), loss_bbox=dict(type='L1Loss', loss_weight=5.0), loss_iou=dict(type='GIoULoss', loss_weight=2.0)), roi_head=[dict( type='CoStandardRoIHead', bbox_roi_extractor=dict( type='SingleRoIExtractor', roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0), out_channels=256, featmap_strides=[8, 16, 32, 64], finest_scale=112), bbox_head=dict( type='Shared2FCBBoxHead', in_channels=256, fc_out_channels=1024, roi_feat_size=7, # 修改为自己数据集的类别个数，由80->1，80代表的是coco数据集的类别数量，而1是自己数据集的类别数量 num_classes=1, bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0., 0., 0., 0.], target_stds=[0.1, 0.1, 0.2, 0.2]), reg_class_agnostic=False, reg_decoded_bbox=True, loss_cls=dict( type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0num_dec_layerlambda_2), loss_bbox=dict(type='GIoULoss', loss_weight=10.0num_dec_layerlambda_2)))], bbox_head=[dict( type='CoATSSHead', # 修改为自己数据集的类别个数，由80->1，80代表的是coco数据集的类别数量，而1是自己数据集的类别数量 num_classes=1, in_channels=256, stacked_convs=1, feat_channels=256, anchor_generator=dict( type='AnchorGenerator', ratios=[1.0], octave_base_scale=8, scales_per_octave=1, strides=[8, 16, 32, 64, 128]), bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[.0, .0, .0, .0], target_stds=[0.1, 0.1, 0.2, 0.2]), loss_cls=dict( type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, loss_weight=1.0num_dec_layerlambda_2), loss_bbox=dict(type='GIoULoss', loss_weight=2.0num_dec_layerlambda_2), loss_centerness=dict( type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0num_dec_layerlambda_2)),], # model training and testing settings train_cfg=[ dict( assigner=dict( type='HungarianAssigner', cls_cost=dict(type='FocalLossCost', weight=2.0), reg_cost=dict(type='BBoxL1Cost', weight=5.0, box_format='xywh'), iou_cost=dict(type='IoUCost', iou_mode='giou', weight=2.0))), dict( rpn=dict( assigner=dict( type='MaxIoUAssigner', pos_iou_thr=0.7, neg_iou_thr=0.3, min_pos_iou=0.3, match_low_quality=True, ignore_iof_thr=-1), sampler=dict( type='RandomSampler', num=256, pos_fraction=0.5, neg_pos_ub=-1, add_gt_as_proposals=False), allowed_border=-1, pos_weight=-1, debug=False), rpn_proposal=dict( nms_pre=4000, max_per_img=1000, nms=dict(type='nms', iou_threshold=0.7), min_bbox_size=0), rcnn=dict( assigner=dict( type='MaxIoUAssigner', pos_iou_thr=0.5, neg_iou_thr=0.5, min_pos_iou=0.5, match_low_quality=False, ignore_iof_thr=-1), sampler=dict( type='RandomSampler', num=512, pos_fraction=0.25, neg_pos_ub=-1, add_gt_as_proposals=True), pos_weight=-1, debug=False)), dict( assigner=dict(type='ATSSAssigner', topk=9), allowed_border=-1, pos_weight=-1, debug=False),], test_cfg=[ dict(max_per_img=100), dict( rpn=dict( nms_pre=1000, max_per_img=1000, nms=dict(type='nms', iou_threshold=0.7), min_bbox_size=0), rcnn=dict( score_thr=0.0, nms=dict(type='nms', iou_threshold=0.5), max_per_img=100)), dict( nms_pre=1000, min_bbox_size=0, score_thr=0.0, nms=dict(type='nms', iou_threshold=0.6), max_per_img=100), # soft-nms is also supported for rcnn testing # e.g., nms=dict(type='soft_nms', iou_threshold=0.5, min_score=0.05) ])

img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)

train_pipeline, NOTE the img_scale and the Pad's size_divisor is different

from the default setting in mmdet.

train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='RandomFlip', flip_ratio=0.5), dict( type='AutoAugment', policies=[ [ dict( type='Resize', img_scale=[(480, 1333), (512, 1333), (544, 1333), (576, 1333), (608, 1333), (640, 1333), (672, 1333), (704, 1333), (736, 1333), (768, 1333), (800, 1333)], multiscale_mode='value', keep_ratio=True) ], [ dict( type='Resize', # The radio of all image in train dataset < 7 # follow the original impl img_scale=[(400, 4200), (500, 4200), (600, 4200)], multiscale_mode='value', keep_ratio=True), dict( type='RandomCrop', crop_type='absolute_range', crop_size=(384, 600), allow_negative_crop=True), dict( type='Resize', img_scale=[(480, 1333), (512, 1333), (544, 1333), (576, 1333), (608, 1333), (640, 1333), (672, 1333), (704, 1333), (736, 1333), (768, 1333), (800, 1333)], multiscale_mode='value', override=True, keep_ratio=True) ] ]), dict(type='Normalize', **img_norm_cfg), dict(type='Pad', size_divisor=1), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ]

test_pipeline, NOTE the Pad's size_divisor is different from the default

setting (size_divisor=32). While there is little effect on the performance

whether we use the default setting or use size_divisor=1.

test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict(type='Normalize', **img_norm_cfg), dict(type='Pad', size_divisor=1), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ]

data = dict( samples_per_gpu=2, workers_per_gpu=2, train=dict(filter_empty_gt=False, pipeline=train_pipeline), val=dict(pipeline=test_pipeline), test=dict(pipeline=test_pipeline))

optimizer

optimizer = dict( type='AdamW', lr=2e-4, weight_decay=1e-4, paramwise_cfg=dict( custom_keys={ 'backbone': dict(lr_mult=0.1), 'sampling_offsets': dict(lr_mult=0.1), 'reference_points': dict(lr_mult=0.1) })) optimizer_config = dict(grad_clip=dict(max_norm=0.1, norm_type=2))

learning policy

在训练epochs的设置上，1x太少了(1x=12epochs)

由12->36，与dino-detr的设置是一样的

对应的学习率衰减的epochs，由11->33

lr_config = dict(policy='step', step=[33]) runner = dict(type='EpochBasedRunner', max_epochs=36)

Oct 24 '23 02:10 LL-XSJ

Hello author, is this the training config? I'm worried that I gave it the wrong way

Oct 24 '23 02:10 LL-XSJ

Have you modified the training script? Bad substitution means there are some errors in dist_train.sh and you can check it.

Oct 24 '23 03:10 TempleX98

Which file to use load_from loading the weight file

The image you post shows you are loading the weights from weights/co_deformable_detr-r50_1.pth. Please use load_from to load it.

Oct 24 '23 03:10 TempleX98

Yes, I have modified the category and number of training rounds of the dataset. Will this affect the training of multiple nodes?

Oct 24 '23 04:10 LL-XSJ

训练自己的数据集出现错误

model settings

在修改配置文件训练自己的数据集的时候，一共有3个地方需要修改类别数量

train_pipeline, NOTE the img_scale and the Pad's size_divisor is different

from the default setting in mmdet.

test_pipeline, NOTE the Pad's size_divisor is different from the default

setting (size_divisor=32). While there is little effect on the performance

whether we use the default setting or use size_divisor=1.

optimizer

learning policy

在训练epochs的设置上，1x太少了(1x=12epochs)

由12->36，与dino-detr的设置是一样的

对应的学习率衰减的epochs，由11->33