PETR
PETR copied to clipboard
Can not reproduce petr
When running [petr_r50dcn_gridmask_p4.py](https://github.com/megvii-research/PETR/blob/main/projects/configs/petr/petr_r50dcn_gridmask_p4.py)
, the accuracy I got was:
mAP: 0.3022
mATE: 0.8507
mASE: 0.2785
mAOE: 0.6519
mAVE: 1.0027
mAAE: 0.2668
NDS: 0.3463
Eval time: 302.2s
This is much lower than the reported one.
Also, we I set with_position=False
, the accuracy is extremely low, which is 0.0887mAP and 0.2230NDS.
Hi, Do you modified the batchsize or other parameters? Can you share your config.
Hi, I only change the dataset path. here is my config:
point_cloud_range = [-51.2, -51.2, -5.0, 51.2, 51.2, 3.0] class_names = [ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ] dataset_type = 'CustomNuScenesDataset' data_root = './data/nuscenes/' input_modality = dict( use_lidar=False, use_camera=True, use_radar=False, use_map=False, use_external=False) file_client_args = dict(backend='disk') train_pipeline = [ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict( type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True, with_attr_label=False), dict( type='ObjectRangeFilter', point_cloud_range=[-51.2, -51.2, -5.0, 51.2, 51.2, 3.0]), dict( type='ObjectNameFilter', classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ]), dict( type='ResizeCropFlipImage', data_aug_conf=dict( resize_lim=(0.8, 1.0), final_dim=(512, 1408), bot_pct_lim=(0.0, 0.0), rot_lim=(0.0, 0.0), H=900, W=1600, rand_flip=True), training=True), dict( type='GlobalRotScaleTransImage', rot_range=[-0.3925, 0.3925], translation_std=[0, 0, 0], scale_ratio_range=[0.95, 1.05], reverse_angle=True, training=True), dict( type='NormalizeMultiviewImage', mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False), dict(type='PadMultiViewImage', size_divisor=32), dict( type='DefaultFormatBundle3D', class_names=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ]), dict(type='Collect3D', keys=['gt_bboxes_3d', 'gt_labels_3d', 'img']) ] test_pipeline = [ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict( type='ResizeCropFlipImage', data_aug_conf=dict( resize_lim=(0.8, 1.0), final_dim=(512, 1408), bot_pct_lim=(0.0, 0.0), rot_lim=(0.0, 0.0), H=900, W=1600, rand_flip=True), training=False), dict( type='NormalizeMultiviewImage', mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False), dict(type='PadMultiViewImage', size_divisor=32), dict( type='MultiScaleFlipAug3D', img_scale=(1333, 800), pts_scale_ratio=1, flip=False, transforms=[ dict( type='DefaultFormatBundle3D', class_names=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], with_label=False), dict(type='Collect3D', keys=['img']) ]) ] eval_pipeline = [ dict( type='LoadPointsFromFile', coord_type='LIDAR', load_dim=5, use_dim=5, file_client_args=dict(backend='disk')), dict( type='LoadPointsFromMultiSweeps', sweeps_num=10, file_client_args=dict(backend='disk')), dict( type='DefaultFormatBundle3D', class_names=[ 'car', 'truck', 'trailer', 'bus', 'construction_vehicle', 'bicycle', 'motorcycle', 'pedestrian', 'traffic_cone', 'barrier' ], with_label=False), dict(type='Collect3D', keys=['points']) ] data = dict( samples_per_gpu=1, workers_per_gpu=4, train=dict( type='CustomNuScenesDataset', data_root='./data/nuscenes/', ann_file='./data/nuscenes/nuscenes_infos_train.pkl', pipeline=[ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict( type='LoadAnnotations3D', with_bbox_3d=True, with_label_3d=True, with_attr_label=False), dict( type='ObjectRangeFilter', point_cloud_range=[-51.2, -51.2, -5.0, 51.2, 51.2, 3.0]), dict( type='ObjectNameFilter', classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ]), dict( type='ResizeCropFlipImage', data_aug_conf=dict( resize_lim=(0.8, 1.0), final_dim=(512, 1408), bot_pct_lim=(0.0, 0.0), rot_lim=(0.0, 0.0), H=900, W=1600, rand_flip=True), training=True), dict( type='GlobalRotScaleTransImage', rot_range=[-0.3925, 0.3925], translation_std=[0, 0, 0], scale_ratio_range=[0.95, 1.05], reverse_angle=True, training=True), dict( type='NormalizeMultiviewImage', mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False), dict(type='PadMultiViewImage', size_divisor=32), dict( type='DefaultFormatBundle3D', class_names=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ]), dict( type='Collect3D', keys=['gt_bboxes_3d', 'gt_labels_3d', 'img']) ], classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], modality=dict( use_lidar=False, use_camera=True, use_radar=False, use_map=False, use_external=False), test_mode=False, box_type_3d='LiDAR', use_valid_flag=True), val=dict( type='CustomNuScenesDataset', data_root='data/nuscenes/', ann_file='data/nuscenes/nuscenes_infos_val.pkl', pipeline=[ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict( type='ResizeCropFlipImage', data_aug_conf=dict( resize_lim=(0.8, 1.0), final_dim=(512, 1408), bot_pct_lim=(0.0, 0.0), rot_lim=(0.0, 0.0), H=900, W=1600, rand_flip=True), training=False), dict( type='NormalizeMultiviewImage', mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False), dict(type='PadMultiViewImage', size_divisor=32), dict( type='MultiScaleFlipAug3D', img_scale=(1333, 800), pts_scale_ratio=1, flip=False, transforms=[ dict( type='DefaultFormatBundle3D', class_names=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], with_label=False), dict(type='Collect3D', keys=['img']) ]) ], classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], modality=dict( use_lidar=False, use_camera=True, use_radar=False, use_map=False, use_external=False), test_mode=True, box_type_3d='LiDAR'), test=dict( type='CustomNuScenesDataset', data_root='data/nuscenes/', ann_file='data/nuscenes/nuscenes_infos_val.pkl', pipeline=[ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict( type='ResizeCropFlipImage', data_aug_conf=dict( resize_lim=(0.8, 1.0), final_dim=(512, 1408), bot_pct_lim=(0.0, 0.0), rot_lim=(0.0, 0.0), H=900, W=1600, rand_flip=True), training=False), dict( type='NormalizeMultiviewImage', mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False), dict(type='PadMultiViewImage', size_divisor=32), dict( type='MultiScaleFlipAug3D', img_scale=(1333, 800), pts_scale_ratio=1, flip=False, transforms=[ dict( type='DefaultFormatBundle3D', class_names=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], with_label=False), dict(type='Collect3D', keys=['img']) ]) ], classes=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], modality=dict( use_lidar=False, use_camera=True, use_radar=False, use_map=False, use_external=False), test_mode=True, box_type_3d='LiDAR')) evaluation = dict( interval=1, pipeline=[ dict(type='LoadMultiViewImageFromFiles', to_float32=True), dict( type='ResizeCropFlipImage', data_aug_conf=dict( resize_lim=(0.8, 1.0), final_dim=(512, 1408), bot_pct_lim=(0.0, 0.0), rot_lim=(0.0, 0.0), H=900, W=1600, rand_flip=True), training=False), dict( type='NormalizeMultiviewImage', mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False), dict(type='PadMultiViewImage', size_divisor=32), dict( type='MultiScaleFlipAug3D', img_scale=(1333, 800), pts_scale_ratio=1, flip=False, transforms=[ dict( type='DefaultFormatBundle3D', class_names=[ 'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone' ], with_label=False), dict(type='Collect3D', keys=['img']) ]) ]) checkpoint_config = dict(interval=1) log_config = dict( interval=50, hooks=[dict(type='TextLoggerHook'), dict(type='TensorboardLoggerHook')]) dist_params = dict(backend='nccl') log_level = 'INFO' work_dir = 'work_dirs/petr_r50dcn_gridmask_p4/' load_from = None resume_from = None workflow = [('train', 1)] opencv_num_threads = 0 mp_start_method = 'fork' backbone_norm_cfg = dict(type='LN', requires_grad=True) plugin = True plugin_dir = 'projects/mmdet3d_plugin/' voxel_size = [0.2, 0.2, 8] img_norm_cfg = dict( mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False) model = dict( type='Petr3D', use_grid_mask=True, img_backbone=dict( type='ResNet', depth=50, num_stages=4, out_indices=(2, 3), frozen_stages=-1, norm_cfg=dict(type='BN2d', requires_grad=False), norm_eval=True, style='caffe', with_cp=True, dcn=dict(type='DCNv2', deform_groups=1, fallback_on_stride=False), stage_with_dcn=(False, False, True, True), pretrained='ckpts/resnet50_msra-5891d200.pth'), img_neck=dict( type='CPFPN', in_channels=[1024, 2048], out_channels=256, num_outs=2), pts_bbox_head=dict( type='PETRHead', num_classes=10, in_channels=256, num_query=900, LID=True, with_position=True, with_multiview=True, position_range=[-61.2, -61.2, -10.0, 61.2, 61.2, 10.0], normedlinear=False, transformer=dict( type='PETRTransformer', decoder=dict( type='PETRTransformerDecoder', return_intermediate=True, num_layers=6, transformerlayers=dict( type='PETRTransformerDecoderLayer', attn_cfgs=[ dict( type='MultiheadAttention', embed_dims=256, num_heads=8, dropout=0.1), dict( type='PETRMultiheadAttention', embed_dims=256, num_heads=8, dropout=0.1) ], feedforward_channels=2048, ffn_dropout=0.1, with_cp=True, operation_order=('self_attn', 'norm', 'cross_attn', 'norm', 'ffn', 'norm')))), bbox_coder=dict( type='NMSFreeCoder', post_center_range=[-61.2, -61.2, -10.0, 61.2, 61.2, 10.0], pc_range=[-51.2, -51.2, -5.0, 51.2, 51.2, 3.0], max_num=300, voxel_size=[0.2, 0.2, 8], num_classes=10), positional_encoding=dict( type='SinePositionalEncoding3D', num_feats=128, normalize=True), loss_cls=dict( type='FocalLoss', use_sigmoid=True, gamma=2.0, alpha=0.25, loss_weight=2.0), loss_bbox=dict(type='L1Loss', loss_weight=0.25), loss_iou=dict(type='GIoULoss', loss_weight=0.0)), train_cfg=dict( pts=dict( grid_size=[512, 512, 1], voxel_size=[0.2, 0.2, 8], point_cloud_range=[-51.2, -51.2, -5.0, 51.2, 51.2, 3.0], out_size_factor=4, assigner=dict( type='HungarianAssigner3D', cls_cost=dict(type='FocalLossCost', weight=2.0), reg_cost=dict(type='BBox3DL1Cost', weight=0.25), iou_cost=dict(type='IoUCost', weight=0.0), pc_range=[-51.2, -51.2, -5.0, 51.2, 51.2, 3.0])))) db_sampler = dict() ida_aug_conf = dict( resize_lim=(0.8, 1.0), final_dim=(512, 1408), bot_pct_lim=(0.0, 0.0), rot_lim=(0.0, 0.0), H=900, W=1600, rand_flip=True) optimizer = dict( type='AdamW', lr=0.0002, paramwise_cfg=dict(custom_keys=dict(img_backbone=dict(lr_mult=0.1))), weight_decay=0.01) optimizer_config = dict( type='Fp16OptimizerHook', loss_scale=512.0, grad_clip=dict(max_norm=35, norm_type=2)) lr_config = dict( policy='CosineAnnealing', warmup='linear', warmup_iters=500, warmup_ratio=0.3333333333333333, min_lr_ratio=0.001) total_epochs = 24 find_unused_parameters = False runner = dict(type='EpochBasedRunner', max_epochs=24) gpu_ids = range(0, 8)
Hi,
The config has no problem. Can you tell me the gpu number and the version of python and mmdet3d? Python3.8 may drops some performance.
I use 8 2080ti to train. And I have trained the model using two different python versions, that is, python 3.7.6 and python 3.6.5, both of them are with mmdet3d 1.0.0. Also, when I set with_position=False, the accuracy is extremely low, which is 0.0887mAP and 0.2230NDS. In my opinion, setting with_position=False is just a kind of ablation study about the 3D PE module. Can you explain that?
Hi,
(1) When use mmdet1.0, have you notice here https://github.com/megvii-research/PETR/issues/71#issuecomment-1318191277 . The reverse_angle must be False in GlobalRotScaleTransImage.
(2) Yes, when set with_position=False, it's a result in ablation study.
When set with_position=False, the intrinsics and extrinsics are not used in model. In fact, PETR can work without intrinsics and extrinsics, benefiting from global attention. The low performance is mainly due to ResizeCropFlipImage and GlobalRotScaleTransImage. These data augmentation greatly change the intrinsics and extrinsics during the training process, and the network can't overfit the parameters of the data set. Once these augmentations are removed, resnet50 should obtain the peformance more than 27% mAP. But we don't think it's meaningful to over-fit the dataset.
Hi,
(1) When use mmdet1.0, have you notice here #71 (comment) . The reverse_angle must be False in GlobalRotScaleTransImage. (2) Yes, when set with_position=False, it's a result in ablation study.
When set with_position=False, the intrinsics and extrinsics are not used in model. In fact, PETR can work without intrinsics and extrinsics, benefiting from global attention. The low performance is mainly due to ResizeCropFlipImage and GlobalRotScaleTransImage. These data augmentation greatly change the intrinsics and extrinsics during the training process, and the network can't overfit the parameters of the data set. Once these augmentations are removed, resnet50 should obtain the peformance more than 27% mAP. But we don't think it's meaningful to over-fit the dataset.
I have noticed StreamPETR still set reverse_angle=True
but they use mmdet3d=1.0.0rc6
, have I missed something?
Hi, (1) When use mmdet1.0, have you notice here #71 (comment) . The reverse_angle must be False in GlobalRotScaleTransImage. (2) Yes, when set with_position=False, it's a result in ablation study.
When set with_position=False, the intrinsics and extrinsics are not used in model. In fact, PETR can work without intrinsics and extrinsics, benefiting from global attention. The low performance is mainly due to ResizeCropFlipImage and GlobalRotScaleTransImage. These data augmentation greatly change the intrinsics and extrinsics during the training process, and the network can't overfit the parameters of the data set. Once these augmentations are removed, resnet50 should obtain the peformance more than 27% mAP. But we don't think it's meaningful to over-fit the dataset.
I have noticed StreamPETR still set
reverse_angle=True
but they usemmdet3d=1.0.0rc6
, have I missed something?
The rotate matrix is different.
Hi, (1) When use mmdet1.0, have you notice here #71 (comment) . The reverse_angle must be False in GlobalRotScaleTransImage. (2) Yes, when set with_position=False, it's a result in ablation study.
When set with_position=False, the intrinsics and extrinsics are not used in model. In fact, PETR can work without intrinsics and extrinsics, benefiting from global attention. The low performance is mainly due to ResizeCropFlipImage and GlobalRotScaleTransImage. These data augmentation greatly change the intrinsics and extrinsics during the training process, and the network can't overfit the parameters of the data set. Once these augmentations are removed, resnet50 should obtain the peformance more than 27% mAP. But we don't think it's meaningful to over-fit the dataset.
I have noticed StreamPETR still set
reverse_angle=True
but they usemmdet3d=1.0.0rc6
, have I missed something?The rotate matrix is different.
Thanks, got it. 👍
https://github.com/megvii-research/PETR/issues/86#issue-1499621424
When running
[petr_r50dcn_gridmask_p4.py](https://github.com/megvii-research/PETR/blob/main/projects/configs/petr/petr_r50dcn_gridmask_p4.py)
, the accuracy I got was: mAP: 0.3022 mATE: 0.8507 mASE: 0.2785 mAOE: 0.6519 mAVE: 1.0027 mAAE: 0.2668 NDS: 0.3463 Eval time: 302.2sThis is much lower than the reported one. Also, we I set
with_position=False
, the accuracy is extremely low, which is 0.0887mAP and 0.2230NDS. / /
When running
[petr_r50dcn_gridmask_p4.py](https://github.com/megvii-research/PETR/blob/main/projects/configs/petr/petr_r50dcn_gridmask_p4.py)
, the accuracy I got was: mAP: 0.3022 mATE: 0.8507 mASE: 0.2785 mAOE: 0.6519 mAVE: 1.0027 mAAE: 0.2668 NDS: 0.3463 Eval time: 302.2sThis is much lower than the reported one. Also, we I set
with_position=False
, the accuracy is extremely low, which is 0.0887mAP and 0.2230NDS.