PETR
PETR copied to clipboard
Cannot reproduce the training result of PETRv2
Hi, thanks for your great work. I want to train the PETRv2 using a single GPU, with default config you have provided with nothing else altered, but get nan in grad_norm after several iters during the first epoch just as the below log file shows, Any suggestion? Thanks in advance! 20220812_090303.log
Hi, We train the PETR with 8 gpus, so the total batchsize is 8. (1) If you want to train PETR with a single GPU, you can set "samples_per_gpu=8," and keep the "lr=2e-4,". (2) If out of memory when bs =8, you can set "samples_per_gpu=4," and modify the "lr=1e-4,". (3) When out of memory, you can also try the GradientCumulative, for example set "samples_per_gpu=4," and keep the "lr=2e-4," , then use optimizer_config = dict(type='GradientCumulativeFp16OptimizerHook', cumulative_iters=2, loss_scale=512., grad_clip=dict(max_norm=35, norm_type=2)) (make "samples_per_gpu x cumulative_iters = 8")
(2) and (3) may sacrifice some performance.
Hi,
I also have a question about how I can reproduce your results. I trained PETR_vovnet_gridmask_p4_800x320. I used 4 A100 to train. And I saw my results are worse than official with different batch sizes and learning rates. Can you please give me some suggestions on it?
Here are the results.
PETR_vovnet_gridmask_p4_800x320 (Official) batch_size: 8*1=8, lr=0.0002
mAP: 0.3778 mATE: 0.7463 mASE: 0.2718 mAOE: 0.4883 mAVE: 0.9062 mAAE: 0.2123 NDS: 0.4264 Eval time: 242.1s
Per-class results: Object Class AP ATE ASE AOE AVE AAE car 0.556 0.555 0.153 0.091 0.917 0.216 truck 0.330 0.805 0.218 0.119 0.859 0.250 bus 0.412 0.789 0.205 0.162 2.067 0.337 trailer 0.221 0.976 0.233 0.663 0.797 0.146 construction_vehicle 0.094 1.096 0.493 1.145 0.190 0.349 pedestrian 0.453 0.688 0.289 0.636 0.549 0.235 motorcycle 0.368 0.690 0.256 0.622 1.417 0.149 bicycle 0.341 0.609 0.270 0.812 0.455 0.017 traffic_cone 0.531 0.582 0.320 nan nan nan barrier 0.472 0.673 0.281 0.145 nan nan
batch_size: 4*4=16, lr=0.0004 (train time:8hour)
mAP: 0.3681 mATE: 0.7727 mASE: 0.2714 mAOE: 0.5808 mAVE: 0.9009 mAAE: 0.2196 NDS: 0.4095 Eval time: 118.4s
Per-class results: Object Class AP ATE ASE AOE AVE AAE car 0.554 0.559 0.152 0.103 0.923 0.215 truck 0.325 0.809 0.221 0.156 0.909 0.257 bus 0.402 0.825 0.205 0.150 1.932 0.362 trailer 0.199 1.050 0.238 0.823 0.846 0.148 construction_vehicle 0.090 1.069 0.479 1.206 0.163 0.295 pedestrian 0.441 0.710 0.293 0.927 0.724 0.283 motorcycle 0.351 0.739 0.262 0.736 1.294 0.184 bicycle 0.330 0.698 0.263 0.964 0.415 0.013 traffic_cone 0.531 0.580 0.317 nan nan nan barrier 0.460 0.689 0.283 0.163 nan nan
batch_size: 4*2=8, lr=0.0002 (train time=11hour)
mAP: 0.3761 mATE: 0.7736 mASE: 0.2695 mAOE: 0.5725 mAVE: 0.8573 mAAE: 0.2186 NDS: 0.4189 Eval time: 111.9s
Per-class results: Object Class AP ATE ASE AOE AVE AAE car 0.558 0.547 0.151 0.106 0.880 0.210 truck 0.337 0.787 0.214 0.136 0.833 0.246 bus 0.410 0.842 0.207 0.136 1.915 0.348 trailer 0.209 1.169 0.233 0.688 0.580 0.146 construction_vehicle 0.089 1.102 0.469 1.189 0.165 0.348 pedestrian 0.448 0.695 0.291 0.902 0.710 0.265 motorcycle 0.372 0.714 0.255 0.743 1.225 0.166 bicycle 0.331 0.634 0.261 1.078 0.551 0.021 traffic_cone 0.544 0.558 0.320 nan nan nan barrier 0.463 0.687 0.293 0.175 nan nan
Thanks
Hi,
I also have a question about how I can reproduce your results. I trained PETR_vovnet_gridmask_p4_800x320. I used 4 A100 to train. And I saw my results are worse than official with different batch sizes and learning rates. Can you please give me some suggestions on it?
Here are the results.
PETR_vovnet_gridmask_p4_800x320 (Official) batch_size: 8*1=8, lr=0.0002
mAP: 0.3778 mATE: 0.7463 mASE: 0.2718 mAOE: 0.4883 mAVE: 0.9062 mAAE: 0.2123 NDS: 0.4264 Eval time: 242.1s
Per-class results: Object Class AP ATE ASE AOE AVE AAE car 0.556 0.555 0.153 0.091 0.917 0.216 truck 0.330 0.805 0.218 0.119 0.859 0.250 bus 0.412 0.789 0.205 0.162 2.067 0.337 trailer 0.221 0.976 0.233 0.663 0.797 0.146 construction_vehicle 0.094 1.096 0.493 1.145 0.190 0.349 pedestrian 0.453 0.688 0.289 0.636 0.549 0.235 motorcycle 0.368 0.690 0.256 0.622 1.417 0.149 bicycle 0.341 0.609 0.270 0.812 0.455 0.017 traffic_cone 0.531 0.582 0.320 nan nan nan barrier 0.472 0.673 0.281 0.145 nan nan
batch_size: 4*4=16, lr=0.0004 (train time:8hour)
mAP: 0.3681 mATE: 0.7727 mASE: 0.2714 mAOE: 0.5808 mAVE: 0.9009 mAAE: 0.2196 NDS: 0.4095 Eval time: 118.4s
Per-class results: Object Class AP ATE ASE AOE AVE AAE car 0.554 0.559 0.152 0.103 0.923 0.215 truck 0.325 0.809 0.221 0.156 0.909 0.257 bus 0.402 0.825 0.205 0.150 1.932 0.362 trailer 0.199 1.050 0.238 0.823 0.846 0.148 construction_vehicle 0.090 1.069 0.479 1.206 0.163 0.295 pedestrian 0.441 0.710 0.293 0.927 0.724 0.283 motorcycle 0.351 0.739 0.262 0.736 1.294 0.184 bicycle 0.330 0.698 0.263 0.964 0.415 0.013 traffic_cone 0.531 0.580 0.317 nan nan nan barrier 0.460 0.689 0.283 0.163 nan nan
batch_size: 4*2=8, lr=0.0002 (train time=11hour)
mAP: 0.3761 mATE: 0.7736 mASE: 0.2695 mAOE: 0.5725 mAVE: 0.8573 mAAE: 0.2186 NDS: 0.4189 Eval time: 111.9s
Per-class results: Object Class AP ATE ASE AOE AVE AAE car 0.558 0.547 0.151 0.106 0.880 0.210 truck 0.337 0.787 0.214 0.136 0.833 0.246 bus 0.410 0.842 0.207 0.136 1.915 0.348 trailer 0.209 1.169 0.233 0.688 0.580 0.146 construction_vehicle 0.089 1.102 0.469 1.189 0.165 0.348 pedestrian 0.448 0.695 0.291 0.902 0.710 0.265 motorcycle 0.372 0.714 0.255 0.743 1.225 0.166 bicycle 0.331 0.634 0.261 1.078 0.551 0.021 traffic_cone 0.544 0.558 0.320 nan nan nan barrier 0.463 0.687 0.293 0.175 nan nan
Thanks
Hi, When you train the model on A100, do you still use the checkpoint? In our pratice, the "batch_size: 4*4=16, lr=0.0004" is always better than default setting. In PETR-r50-c5, this setting may increase the performance about 2%.
I noticed that the result was mainly poor on mAOE. Maybe you can try to train the PETRv2, the multiframe data can enhance the robustness. Someone has tried this setting on 4x3090 gpus, and achieve a good result.
@wongsinglam Thank you very much, you help me a lot! I trained the model from scratch with your advice using a single A4000 GPU containing 16GB memory. I set samples_per_gpu=2, and keep the lr=5e-5. After 20 epochs, I can get 0.381MAP and 0.4751 NDS. Thanks again.