Mask R-CNN opt_base_learning_rate change from 0.02 to 0.01

Open hanyunfan opened this issue 4 years ago • 1 comments

in the training rule, here

The opt_base_learning_rate has been defined to be K*0.02.

maskrcnn	sgd	opt_base_learning_rate	0.02 * K for any integer K	base learning rate, this should be the learning rate after warm up and before decay

It is good for systems has 4,8,16 GPUs. However, it doesn't converge well with systems has other numbers of GPUs, ex. 10 GPUs.

Another example is the RCP itself. https://github.com/mlcommons/logging/blob/master/mlperf_logging/rcp_checker/training_1.1.0/rcps_maskrcnn.json "maskrcnn_ref_96": { "Benchmark": "maskrcnn", "Creator": "NVIDIA", "When": "Prior to 1.0 submission", "Platform": "TBD", "BS": 96, "Hyperparams": { "opt_learning_decay_steps": [12000, 16000], "opt_base_learning_rate": 0.12, "num_image_candidates": 6000, "opt_learning_rate_warmup_factor": 0.000192, "opt_learning_rate_warmup_steps": 625 }, "Epochs to converge": [ 14, 15, 14, 14, 14, 14, 14, 14, 14, 13, 14, 14, 15, 14, 14, 14, 14, 14, 14, 14] },

"maskrcnn_ref_128": { "Benchmark": "maskrcnn", "Creator": "NVIDIA", "When": "Prior to 1.0 submission", "Platform": "TBD", "BS": 128, "Hyperparams": { "opt_learning_decay_steps": [9000, 12000], "opt_base_learning_rate": 0.16, "num_image_candidates": 6000, "opt_learning_rate_warmup_factor": 0.000256, "opt_learning_rate_warmup_steps": 625 }, "Epochs to converge": [ 14, 14, 14, 14, 14, 14, 14, 14, 14, 14] },

LR needs to be scaled with Global_batchsize, which isn't friendly for 10-GPU systems.

We had run with bs=120 and LR=0.15 on a 10-GPU system, and it converged at the same epochs(14) just like bs=96 and LR=0.12 did. Both have the same local_BS=12. Plus, in the BS=128 case defined in the same RCP, converged epoch also defined as 14.

So, we propose to have the rule on opt_base_learning_rate of Mask R-CNN adjusted from 0.02K to 0.01K. This will be more fair to systems has different numbers of GPU.

Oct 23 '21 18:10 hanyunfan

Link back to https://github.com/mlcommons/submission_training_1.1/issues/24

Oct 23 '21 18:10 hanyunfan