LargeKernel3D icon indicating copy to clipboard operation
LargeKernel3D copied to clipboard

About detection training gpu num

Open fjzpcmj opened this issue 2 years ago • 7 comments

Dear Author, I train a detection model with config "nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_largekernel3d_large.py" with 8GPUs,the model performance is lower than reported. Do you know what's wrong with my trained model?

performance detail: mAP: 0.5944 mATE: 0.2902 mASE: 0.2516 mAOE: 0.3343 mAVE: 0.2870 mAAE: 0.1911 NDS: 0.6618 Eval time: 104.5s

Per-class results: Object Class AP ATE ASE AOE AVE AAE car 0.851 0.181 0.153 0.110 0.267 0.193 truck 0.563 0.331 0.180 0.116 0.255 0.227 bus 0.707 0.333 0.177 0.073 0.487 0.268 trailer 0.393 0.500 0.201 0.564 0.221 0.183 construction_vehicle 0.199 0.713 0.434 1.025 0.123 0.312 pedestrian 0.845 0.147 0.272 0.389 0.219 0.099 motorcycle 0.583 0.202 0.238 0.240 0.499 0.231 bicycle 0.415 0.162 0.268 0.414 0.225 0.016 traffic_cone 0.690 0.136 0.322 nan nan nan barrier 0.698 0.198 0.270 0.077 nan nan Evaluation nusc: Nusc v1.0-trainval Evaluation car Nusc dist [email protected], 1.0, 2.0, 4.0 75.43, 85.77, 88.99, 90.23 mean AP: 0.8510555593669333 truck Nusc dist [email protected], 1.0, 2.0, 4.0 37.71, 54.44, 64.86, 68.22 mean AP: 0.5630814131474254 construction_vehicle Nusc dist [email protected], 1.0, 2.0, 4.0 3.41, 12.18, 27.07, 37.13 mean AP: 0.19945290009625813 bus Nusc dist [email protected], 1.0, 2.0, 4.0 46.04, 69.87, 82.21, 84.80 mean AP: 0.7072839081535547 trailer Nusc dist [email protected], 1.0, 2.0, 4.0 11.06, 35.42, 50.44, 60.39 mean AP: 0.39329063939703013 barrier Nusc dist [email protected], 1.0, 2.0, 4.0 59.75, 70.12, 73.86, 75.32 mean AP: 0.6976273792134366 motorcycle Nusc dist [email protected], 1.0, 2.0, 4.0 52.38, 59.28, 60.53, 61.11 mean AP: 0.5832459929266106 bicycle Nusc dist [email protected], 1.0, 2.0, 4.0 40.47, 41.59, 41.82, 41.98 mean AP: 0.41463017310991623 pedestrian Nusc dist [email protected], 1.0, 2.0, 4.0 82.15, 83.86, 85.29, 86.59 mean AP: 0.8447317287125544 traffic_cone Nusc dist [email protected], 1.0, 2.0, 4.0 66.22, 67.69, 69.40, 72.53 mean AP: 0.6896184124816187

fjzpcmj avatar May 16 '23 11:05 fjzpcmj

Hi @fjzpcmj ,

Thanks for your interests in our work. Sorry for the late reply. I have some deadline this week. I will check the nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_largekernel3d_large.py.

I used 4 GPUs for training. Would you please have a try on nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_largekernel3d_tiny.py? The performance of it is more stable.

Regards, Yukang Chen

yukang2017 avatar May 19 '23 15:05 yukang2017

Thanks for your reply, I will try on try on nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_largekernel3d_tiny.py. Would you like to tell me that which performance is better, "large" v.s. “tiny” ?

fjzpcmj avatar May 29 '23 06:05 fjzpcmj

Thanks for your message. Generally, "large" performs a bit better than "tiny" (less than 0.5 mAP). But "tiny" is more stable and faster.

yukang2017 avatar May 29 '23 15:05 yukang2017

Thanks very much.

fjzpcmj avatar May 30 '23 00:05 fjzpcmj

Dear @yukang2017 ,I train “tiny” model with 4GPUs,the mAP result is still 59, less than reported 63.3. Do you know what's the matter? In addition,I download the pretrained model (63.3mAP) and testing. It seems the downloaded model structure is different from the structure in the "tiny" config.


The model and loaded state dict do not match exactly

unexpected key in source state_dict: backbone.conv1.0.conv1.conv3x3_1.weight, backbone.conv1.0.conv1.conv3x3_1.bias, backbone.conv1.0.conv2.conv3x3_1.weight, backbone.conv1.0.conv2.conv3x3_1.bias, backbone.conv1.1.conv1.conv3x3_1.weight, backbone.conv1.1.conv1.conv3x3_1.bias, backbone.conv1.1.conv2.conv3x3_1.weight, backbone.conv1.1.conv2.conv3x3_1.bias, backbone.conv2.3.conv1.weight, backbone.conv2.3.conv1.bias, backbone.conv2.3.conv2.weight, backbone.conv2.3.conv2.bias, backbone.conv2.4.conv1.weight, backbone.conv2.4.conv1.bias, backbone.conv2.4.conv2.weight, backbone.conv2.4.conv2.bias, backbone.conv3.3.conv1.weight, backbone.conv3.3.conv1.bias, backbone.conv3.3.conv2.weight, backbone.conv3.3.conv2.bias, backbone.conv3.4.conv1.weight, backbone.conv3.4.conv1.bias, backbone.conv3.4.conv2.weight, backbone.conv3.4.conv2.bias

missing keys in source state_dict: backbone.conv2.4.conv2.block.weight, backbone.conv1.1.conv2.block.position_embedding, backbone.conv3.4.conv2.block.bias, backbone.conv2.4.conv1.block.weight, backbone.conv3.4.conv1.conv3x3_1.weight, backbone.conv2.4.conv2.conv3x3_1.weight, backbone.conv3.3.conv1.conv3x3_1.weight, backbone.conv2.3.conv1.conv3x3_1.bias, backbone.conv1.1.conv1.block.position_embedding, backbone.conv3.3.conv1.block.weight, backbone.conv3.3.conv2.conv3x3_1.weight, backbone.conv3.4.conv2.conv3x3_1.bias, backbone.conv2.3.conv1.conv3x3_1.weight, backbone.conv3.4.conv1.block.bias, backbone.conv3.4.conv1.block.weight, backbone.conv2.3.conv2.conv3x3_1.bias, backbone.conv1.0.conv1.block.position_embedding, backbone.conv3.3.conv1.conv3x3_1.bias, backbone.conv2.4.conv2.block.bias, backbone.conv3.3.conv2.block.bias, backbone.conv3.4.conv1.conv3x3_1.bias, backbone.conv2.4.conv1.conv3x3_1.bias, backbone.conv3.3.conv2.conv3x3_1.bias, backbone.conv2.3.conv2.conv3x3_1.weight, backbone.conv2.3.conv2.block.weight, backbone.conv2.4.conv1.block.bias, backbone.conv1.0.conv2.block.position_embedding, backbone.conv3.4.conv2.block.weight, backbone.conv2.3.conv1.block.bias, backbone.conv2.3.conv2.block.bias, backbone.conv3.3.conv1.block.bias, backbone.conv2.4.conv1.conv3x3_1.weight, backbone.conv3.3.conv2.block.weight, backbone.conv2.4.conv2.conv3x3_1.bias, backbone.conv2.3.conv1.block.weight, backbone.conv3.4.conv2.conv3x3_1.weight

these keys have mismatched shape: +-------------------------------------+---------------------------------+---------------------------------+ | key | expected shape | loaded shape | +-------------------------------------+---------------------------------+---------------------------------+ | backbone.conv1.0.conv1.block.weight | torch.Size([3, 3, 3, 16, 16]) | torch.Size([7, 7, 7, 16, 16]) | | backbone.conv1.0.conv2.block.weight | torch.Size([3, 3, 3, 16, 16]) | torch.Size([7, 7, 7, 16, 16]) | | backbone.conv1.1.conv1.block.weight | torch.Size([3, 3, 3, 16, 16]) | torch.Size([7, 7, 7, 16, 16]) | | backbone.conv1.1.conv2.block.weight | torch.Size([3, 3, 3, 16, 16]) | torch.Size([7, 7, 7, 16, 16]) | | backbone.conv4.3.conv1.weight | torch.Size([128, 3, 3, 3, 128]) | torch.Size([5, 5, 5, 128, 128]) | | backbone.conv4.3.conv2.weight | torch.Size([128, 3, 3, 3, 128]) | torch.Size([5, 5, 5, 128, 128]) | | backbone.conv4.4.conv1.weight | torch.Size([128, 3, 3, 3, 128]) | torch.Size([5, 5, 5, 128, 128]) | | backbone.conv4.4.conv2.weight | torch.Size([128, 3, 3, 3, 128]) | torch.Size([5, 5, 5, 128, 128]) | +-------------------------------------+---------------------------------+---------------------------------+

fjzpcmj avatar Jun 09 '23 02:06 fjzpcmj

Hi,

For training, to reproduce, please disable the gt sampling augmentation in the last 5 epochs, this is a detailed trick, listed in the implementation details.

For testing, sorry for this misalignment, I double check the config file. There are some typos. I fixed it to be aligned with the checkpoint, please try it again.

yukang2017 avatar Jun 16 '23 05:06 yukang2017

thanks very much. I have reproduced the result

fjzpcmj avatar Jun 25 '23 15:06 fjzpcmj