[Bug] Poor training results when trying to configure for camera-only BEVFusion
Prerequisite
- [X] I have searched Issues and Discussions but cannot get the expected help.
- [X] I have read the FAQ documentation but cannot get the expected help.
- [X] The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).
Task
I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.
Branch
main branch https://github.com/open-mmlab/mmdetection3d
Environment
sys.platform: linux Python: 3.10.14 (main, Jul 8 2024, 14:50:49) [GCC 12.3.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0,1,2,3: NVIDIA GeForce GTX 1080 Ti CUDA_HOME: /usr/local/cuda-12.1 NVCC: Cuda compilation tools, release 12.1, V12.1.66 GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 PyTorch: 2.1.2+cu121 PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX512
- CUDA Runtime 12.1
- NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
- CuDNN 8.9.2
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.16.2+cu121 OpenCV: 4.9.0 MMEngine: 0.10.2 MMDetection: 3.3.0 MMDetection3D: 1.4.0+161d091 spconv2.0: False
Reproduces the problem - code sample
'''
Base is the base configuration file. The config files follow a system of inheritance. For example, just like when you inherit from a class,
this config contains all the configurations of default_runtime.py
The same ideas that apply to inheritance with classes apply here. For example, if you wanted to change something in default_runtime,
you can copy it into this class and make the modifications, just like you would do with a function you would like to change in a class.
Custom_imports imports tje modules within the bevfusion project which are needed to run the code.
'''
_base_ = ['../../../configs/_base_/default_runtime.py']
custom_imports = dict(
imports=['projects.BEVFusion.bevfusion'], allow_failed_imports=False)
'''
The pointcloud range specifies the geometric space the pointclouds can occupy.
Voxel Size indiciates the distance in meters of each dimension of the squares that make up the BEV grid
(our map where predictions from BEVFusion are made)
'''
point_cloud_range = [-51.2, -51.2, -5.0, 51.2, 51.2, 3.0] # TODO: step through for more info
# point_cloud_range = [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]
# voxel_size = [0.075, 0.075, 0.2] # this voxel size made it actually have a mAP of 0!
voxel_size = [0.1, 0.1, 0.2]
# image_size = [256, 704]
# post_center_range = [-64.0, -64.0, -10.0, 64.0, 64.0, 10.0]
post_center_range = [-61.2, -61.2, -10.0, 61.2, 61.2, 10.0] # this matches what I see for det in MIT # TODO: step through for more info
'''
Class names used for all object detection tasks. Using nuScenes, we train and evaluate on 6 different object detection tasks, where the combinations of
object classes for each tasks vary. For example, task 0 may comtain car, truck, and bus, while task 1 may contain car, motorcycle, bicycle, barrier.
'''
class_names = [
'car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier',
'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone'
]
'''
metainfo is used to pass the class names from the config in the format the code is looking for.
the dataset type and root specify 1. the class object dataset being used (for other datasets such as KITTI a dataset object is similarly defined)
2. the realtive path to the nuscenes dataset to be used.
data_prefix: data prefix is used for specifying to the nuScenesDataset object what sensors are being used. This can include camera and lidar sensors.
For this case, we inlcude only the 6 cameras available in the nuscenes dataset.
'''
metainfo = dict(classes=class_names) #, version='v1.0-mini')
dataset_type = 'NuScenesDataset'
data_root = 'data/nuscenes/'
data_prefix = dict(
CAM_FRONT='samples/CAM_FRONT',
CAM_FRONT_LEFT='samples/CAM_FRONT_LEFT',
CAM_FRONT_RIGHT='samples/CAM_FRONT_RIGHT',
CAM_BACK='samples/CAM_BACK',
CAM_BACK_RIGHT='samples/CAM_BACK_RIGHT',
CAM_BACK_LEFT='samples/CAM_BACK_LEFT'
)
'''
input modality specifies which sensors are being used, which effects...
'''
input_modality = dict(use_lidar=False, use_camera=True) # TODO: determine the effect of lidar=False
backend_args = None # TODO: find out what is
'''
MODEL DEFINITION
- MMLab's way of defining deep learning models.
- type: specifies the project being used
- data_preprocessor: Det3DDataPreprocessor is a general mmdetection3d preprocessing class that works for lidar, vision only, and more.
- img_backbone: this is the model which performs initial transformation from image data into features using a CNN architecture.
* mmdet.SwinTransformer
- img_neck: this the the model component which takes the first output of the backbone and further refines our features
-
'''
model = dict(
type='BEVFusion',
data_preprocessor=dict(
type='Det3DDataPreprocessor',
pad_size_divisor=32,
# voxelize_cfg=dict(
# max_num_points=10,
# point_cloud_range=point_cloud_range,
# voxel_size=voxel_size,
# max_voxels=[120000, 160000],
# voxelize_reduce=True),
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
bgr_to_rgb=False),
img_backbone=dict(
type='mmdet.SwinTransformer',
embed_dims=96,
depths=[2, 2, 6, 2],
num_heads=[3, 6, 12, 24],
window_size=7,
mlp_ratio=4,
qkv_bias=True,
qk_scale=None,
drop_rate=0.0,
attn_drop_rate=0.0,
drop_path_rate=0.2,
patch_norm=True,
out_indices=[1, 2, 3],
with_cp=False,
convert_weights=True,
init_cfg=dict(
type='Pretrained',
checkpoint= # noqa: E251
'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth' # noqa: E501
)),
img_neck=dict(
type='GeneralizedLSSFPN',
in_channels=[192, 384, 768],
out_channels=256,
start_level=0,
num_outs=3,
norm_cfg=dict(type='BN2d', requires_grad=True),
act_cfg=dict(type='ReLU', inplace=True),
upsample_cfg=dict(mode='bilinear', align_corners=False)),
view_transform=dict(
type='LSSTransform',
in_channels=256,
out_channels=80,
image_size=[256, 704],
feature_size=[32, 88],
# xbound=[-54.0, 54.0, 0.3],
xbound=[-51.2, 51.2, 0.4],
ybound=[-51.2, 51.2, 0.4],
# ybound=[-54.0, 54.0, 0.3],
zbound=[-10.0, 10.0, 20.0],
dbound=[1.0, 60.0, 0.5],
downsample=2),
pts_backbone=dict(
type='GeneralizedResNet',
in_channels=80,
blocks=[[2, 128, 2],
[2, 256, 2],
[2, 512, 1]]),
pts_neck=dict(
type='LSSFPN',
in_indices=[-1,0],
in_channels=[512, 128],
out_channels=256,
scale_factor=2),
bbox_head=dict(
type='CenterHead', # changed from CenterHead to CustomCenterHead
in_channels=256,
tasks=[
dict(num_class=1, class_names=['car']),
dict(num_class=2, class_names=['truck', 'construction_vehicle']),
dict(num_class=2, class_names=['bus', 'trailer']),
dict(num_class=1, class_names=['barrier']),
dict(num_class=2, class_names=['motorcycle', 'bicycle']),
dict(num_class=2, class_names=['pedestrian', 'traffic_cone']),
],
common_heads=dict(
reg=(2, 2), height=(1, 2), dim=(3, 2), rot=(2, 2), vel=(2, 2)),
share_conv_channel=64,
bbox_coder=dict(
type='CenterPointBBoxCoder', # modified from CustomCenterPointBBoxCoder
post_center_range=post_center_range,
pc_range=point_cloud_range,
max_num=500,
score_threshold=0.1,
out_size_factor=8,
voxel_size=voxel_size[:2],
code_size=9),
separate_head=dict(
type='SeparateHead', init_bias=-2.19, final_kernel=3),
loss_cls=dict(type='mmdet.GaussianFocalLoss', reduction='mean'),
loss_bbox=dict(
type='mmdet.L1Loss', reduction='mean', loss_weight=0.25),
norm_bbox=True,
train_cfg=dict(
dataset='nuScenes',
point_cloud_range=point_cloud_range,
grid_size=[1024, 1024, 1],
# grid_size=[1440, 1440, 41],
voxel_size=voxel_size,
out_size_factor=8,
dense_reg=1,
gaussian_overlap=0.1,
max_objs=500,
min_radius=2,
code_weights=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2, 0.2]
),
test_cfg=dict(
dataset='nuScenes',
post_center_limit_range=post_center_range,
max_per_img=500,
max_pool_nms=False,
min_radius=[4, 12, 10, 1, 0.85, 0.175],
score_threshold=0.1,
pc_range=point_cloud_range[:2], # he had 0:2- same thing
out_size_factor=8,
voxel_size=voxel_size[:2],
nms_type= 'circle', #['circle', 'circle', 'circle', 'circle', 'circle', 'circle'], # Changed from just being 'circle'
pre_max_size=1000,
post_max_size=83,
nms_thr=0.2)
)
)
train_pipeline = [
dict(
type='BEVLoadMultiViewImageFromFiles',
to_float32=False, # was flp32- what if we change?
color_type='color',
backend_args=backend_args),
dict(
type='LoadAnnotations3D',
with_bbox_3d=True,
with_label_3d=True,
with_attr_label=False),
# dict(type='ObjectSample', db_sampler=db_sampler),
dict(
type='ImageAug3D',
final_dim=[256, 704],
resize_lim=[0.38, 0.55],
bot_pct_lim=[0.0, 0.0],
rot_lim=[-5.4, 5.4],
rand_flip=True,
is_train=True),
dict(type='BEVFusionRandomFlip3D'), # temporarily commmented out
dict(type='ObjectRangeFilter', point_cloud_range=point_cloud_range),
dict(
type='ObjectNameFilter',
classes=[
'car', 'truck', 'construction_vehicle', 'bus', 'trailer',
'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone'
]),
dict(
type='GridMask',
use_h=True,
use_w=True,
rotate=1,
offset=False,
ratio=0.5,
mode=1,
prob=0,
max_epoch=20,
),
# dict(type='PointShuffle'),
dict(
type='Pack3DDetInputs',
keys=[
'points', 'img', 'gt_bboxes_3d', 'gt_labels_3d', 'gt_bboxes',
'gt_labels'
],
meta_keys=[
'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar',
'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx',
'lidar_path', 'img_path', 'transformation_3d_flow',
#'pcd_rotation','pcd_scale_factor', 'pcd_trans',
'img_aug_matrix',
#'lidar_aug_matrix', 'num_pts_feats'
])
]
test_pipeline = [
dict(
type='BEVLoadMultiViewImageFromFiles', # no BEV prefix in MIT
to_float32=True,
color_type='color',
backend_args=backend_args), # what are the backend args being used??
dict( # MIT has another type inlcuded, LoadAnnotations3D
type='ImageAug3D',
final_dim=[256, 704],
resize_lim=[0.48, 0.48],
bot_pct_lim=[0.0, 0.0],
rot_lim=[0.0, 0.0],
rand_flip=False,
is_train=False),
# dict(
# type='PointsRangeFilter',
# point_cloud_range=point_cloud_range),
dict(
type='Pack3DDetInputs',
keys=['img', 'points', 'gt_bboxes_3d', 'gt_labels_3d'],
meta_keys=[
'cam2img', 'ori_cam2img', 'lidar2cam', 'lidar2img', 'cam2lidar',
'ori_lidar2img', 'img_aug_matrix', 'box_type_3d', 'sample_idx',
'lidar_path', 'img_path', 'num_pts_feats', 'num_views'
])
]
train_dataloader = dict(
batch_size=1, # changed from 2 to 1
num_workers=1, # changed from 1 back to 4
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True), #shuffle
dataset=dict(
type='CBGSDataset',
dataset=dict(
type=dataset_type,
data_root=data_root,
ann_file='nuscenes_infos_train.pkl',
pipeline=train_pipeline,
metainfo=metainfo,
modality=input_modality,
test_mode=False,
data_prefix=data_prefix,
use_valid_flag=True,
# we use box_type_3d='LiDAR' in kitti and nuscenes dataset
# and box_type_3d='Depth' in sunrgbd and scannet dataset.
box_type_3d='LiDAR'))
)
val_dataloader = dict(
batch_size=1,
num_workers=1,
persistent_workers=True,
drop_last=False,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type=dataset_type,
data_root=data_root,
ann_file='nuscenes_infos_val.pkl',
pipeline=test_pipeline,
metainfo=metainfo,
modality=input_modality,
data_prefix=data_prefix,
test_mode=True, # test mode was true- does not make sense for val_dataloader perhaps?
box_type_3d='LiDAR',
backend_args=backend_args))
test_dataloader = val_dataloader
val_evaluator = dict(
type='NuScenesMetric',
data_root=data_root,
ann_file=data_root + 'nuscenes_infos_val.pkl',
metric='bbox',
backend_args=backend_args)
test_evaluator = val_evaluator
vis_backends = [dict(type='LocalVisBackend')]
visualizer = dict(
type='Det3DLocalVisualizer', vis_backends=vis_backends, name='visualizer')
# learning rate
# lr = 0.0001
lr = 2e-5 # changed from 2e-4
param_scheduler = [
# learning rate scheduler
# During the first 8 epochs, learning rate increases from 0 to lr * 10
# during the next 12 epochs, learning rate decreases from lr * 10 to
# lr * 1e-4
dict(
type='CosineAnnealingLR',
T_max=8,
eta_min=lr * 6, # changed from 10
begin=0,
end=8,
by_epoch=True,
convert_to_iter_based=True),
dict(
type='CosineAnnealingLR',
T_max=12,
eta_min=lr * 1e-2, # changed from -4
begin=8,
end=20,
by_epoch=True,
convert_to_iter_based=True),
# momentum scheduler
# During the first 8 epochs, momentum increases from 0 to 0.85 / 0.95
# during the next 12 epochs, momentum increases from 0.85 / 0.95 to 1
dict(
type='CosineAnnealingMomentum',
T_max=8,
eta_min=0.85 / 0.95,
begin=0,
end=8,
by_epoch=True,
convert_to_iter_based=True),
dict(
type='CosineAnnealingMomentum',
T_max=12,
eta_min=1,
begin=8,
end=20,
by_epoch=True,
convert_to_iter_based=True)
]
# runtime settings
train_cfg = dict(by_epoch=True, max_epochs=20, val_interval=1) # do kyoung had change to 10
val_cfg = dict()
test_cfg = dict()
'''
load_from and resume:
load_from: specifies a model path that is either pretrained or partially pretrained that you would like to continue to train from the current state of the weights.
Specifying "None" type for load_from opts to train from scratch.
Here is an example of how you might use load_from to train starting with a pretrained model:
load_from = "/home/a0271391/code/edgeai-mmdetection3d/projects/BEVFusion/models/camera-only-det_converted_copy.pth"
resume: be aware: resume=True means that you want to resume training from a specific training epoch and step for a particular model. If you don't care about actually resuming
training from where training was stopped previously, then you don't need to set resume True. Only set it True if the model you are loading with load_from was trained
to a specific point (ex: on epoch 7, step 19200/30000) and you want to continue from there.
'''
load_from = None
resume = False # resume from the checkpoint defined in load_from
optim_wrapper = dict(
type='OptimWrapper',
optimizer=dict(type='AdamW', lr=lr, weight_decay=0.01),
clip_grad=dict(max_norm=35, norm_type=2))
# Default setting for scaling LR automatically
# - `enable` means enable scaling LR automatically
# or not by default.
# - `base_batch_size` = (8 GPUs) x (4 samples per GPU).
auto_scale_lr = dict(enable=False, base_batch_size=1)
log_processor = dict(window_size=50)
'''
HOOKS -
Objects which operate on actively running code, such as logging information at the end of an epoch.
Hooks are defined in mmdet3d/engine. The purpose of hooks is often to add new features to a predefined python module.
EX: You want to be able to add additional data to your dataloader when training a model every 3 epochs. You could modify the source code for training, or you could make a
hook which adds that functionality on top of your base code. Then all you have to do is init that hook when defining the parameters of your code, or not init it if you want the
base funtionality.
Here, they are used for logging information such as speed to train an epoch.
The DisableObjectSampleHook simply stops augmenting the training data after a specified epoch (epoch 15)
'''
default_hooks = dict(
logger=dict(type='LoggerHook', interval=50),
checkpoint=dict(type='CheckpointHook', interval=1))
custom_hooks = [dict(type='DisableObjectSampleHook', disable_after_epoch=15)]
Reproduces the problem - command or script
bash tools/dist_train.sh projects/BEVFusion/configs/bevfusion_cam_swint_centerpoint_nus-3d.py 4
Reproduces the problem - error message
No error message; issue is even after 20 epochs, the result is extremely poor mAP and NMS. Loss gets down to about 6.x.
Additional information
- I expected training results to be similar to MIT's camera-only results.
- I used the nuScenes dataset
- I suspect there is an issue in my setup in the configuration file. I have included the configuration I have been using for image-only BEVFusion.
Same problem.
hi @ymlab @abubake,
I have a question regarding the training:
I am curious how much time the training takes per epoch and how many gpus do you use? I am particularly interested in the lidar only training if you have any experience with that.
Hi, training with 4 gpu’s took several hours per epoch, both for camera and when I tried with lidar only. I don’t remember the exact time per epoch, but it was about 4-5 days for 20 epochs. Which is roughly 4.5 to 6 hours per epoch.
On Wed, Oct 9, 2024 at 5:34 PM Görkem Güzeler @.***> wrote:
hi @ymlab https://github.com/ymlab @abubake https://github.com/abubake ,
I have a question regarding the training:
I am curious how much time the training takes per epoch and how many gpus do you use? I am particularly interested in the lidar only training if you have any experience with that.
— Reply to this email directly, view it on GitHub https://github.com/open-mmlab/mmdetection3d/issues/3024#issuecomment-2403469447, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHWNVWHXYCSWYDM4OI5DWZLZ2WOQDAVCNFSM6AAAAABM2JY5YWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBTGQ3DSNBUG4 . You are receiving this because you were mentioned.Message ID: @.***>
Thanks a lot for sharing your experience @abubake, it helps! were you able to reproduce good results (comparable to the paper) with lidar only training?
I plan to work with this repository for my thesis, and don't want to waste time if the code is not working as expected. therefore any feedback is valuable for me :)
@gorkemguzeler the repo is working as expected for me. Haven't trained lidar-only but I got 65 mAP after 3 epochs of training the bevfusion model with the lidar-only base. Oh and it took 2h per epoch on 8x 3090 with bs 2 and lr scaling enabled.
Btw we are in the same boat. I am also doing my thesis on multimodal learning :)
@mdessl Hi I'm also working on multimodal 3d det. I'm curious by bs 2 you mean 2 batch per GPU or 2 batch for the whole 8 GPUs? As 3080 seems only have 12G of mem. I have trained the BEVFusion of this repo on 2xA5000 with bs of 4 (with lr scale) and cannot match the result of 71.4 NDS. After using Gradient Accumulation to simulate bs 32, the performance is much better to approximately 70.9 NDS.
For the multimodal, my concern is that the camera branch of this repo is too dependent on LiDAR, as they use DepthLSS instead of original LSS transform.
@curiosity654 ohh sry it was a typo. I meant 3090 (24G RAM), so bs 2 per GPU.
Do you think the issue could have to do with the batchnorm layers? I think BN is not so compatible with gradient accumulation and I am not sure what you could do about it.
@mdessl , thanks a lot for the feedback 👍
oh, good luck on your thesis :)
interesting discussion here, I got following problem with camera-only model: KeyError: 'GeneralizedResNet is not in the mmdet3d::model registry. Please check whether the value of GeneralizedResNet is correct or it was registered as expected. More details can be found at https://mmengine.readthedocs.io/en/latest/advanced_tutorials/config.html#import-the-custom-module'. Which is because, the GeneralizedResNet is not implemented in this offical repo. Did you adapted it yourself?
@abubake @CesarLiu Do you have a camera-only BEVFusion configuration that meets the paper's metrics and can be provided? Thank you very much.
Hey, I was looking at this code and noticed the following:
eta_min=0.85 / 0.95 It seems like it’s trying to calculate eta_min, but I thought eta_min should be a fixed value. So, should I just pick 0.85 or 0.95 for eta_min? Or am I missing something here?
Could you share your code repository? I've recently encountered a similar issue, and I'd like to use your repository to debug and see if it's the same problem I'm facing.