ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: Relationship between `BATCH_SIZE`, `LEARNING_RATE` and GPU amount

Open BoxiangW opened this issue 3 years ago • 2 comments

🐛 Describe the bug

When testing DeTr on Colossal-Example, I encountered an issue that model with only DDP in situations:

  1. LEARNING_RATE=1e-4, world_size=4
  2. LEARNING_RATE=2e-4, world_size=8
  3. LEARNING_RATE=1e-4, world_size=8

yield significantly different accuracies. (Due to significantly long training time, I only ran one epoch for each situation)

Related Discussion

I listed the log info and the precision and recall below.

  1. LEARNING_RATE=1e-4, world_size=4
{'BATCH_SIZE': 2,
'LEARNING_RATE': 0.0001,
'LOG_PATH': './detr_1d_coco_tp1_bs2_lr0.0001/',
'NUM_EPOCHS': 300,
'SEED': 42,
'TENSOR_PARALLEL_MODE': '1d',
'TENSOR_PARALLEL_SIZE': 1,
'WEIGHT_DECAY': 0.0001,
'aux_loss': True,
'backbone': 'resnet50',
'bbox_loss_coef': 5,
'clip_max_norm': 0.1,
'coco_path': '/data/scratch/coco',
'cudnn_benchmark': False,
'dataset_file': 'coco',
'dec_layers': 6,
'device': 'cuda',
'dice_loss_coef': 1,
'dilation': False,
'dim_feedforward': 2048,
'dist_url': 'env://',
'distributed': True,
'dropout': 0.1,
'enc_layers': 6,
'eos_coef': 0.1,
'eval': False,
'giou_loss_coef': 2,
'hidden_dim': 256,
'lr_backbone': 1e-05,
'lr_drop': 200,
'mask_loss_coef': 1,
'masks': False,
'nheads': 8,
'num_queries': 100,
'num_workers': 2,
'output_dir': '',
'parallel': {'pipeline': 1, 'tensor': {'mode': '1d', 'size': 1}},
'position_embedding': 'sine',
'remove_difficult': False,
'resume': '',
'save_ckpt_freq': 50,
'seed': 42,
'set_cost_bbox': 5,
'set_cost_class': 1,
'set_cost_giou': 2,
'start_epoch': 0}

DONE (t=5.15s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.002
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.007
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.004
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.012
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.027
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.036
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.010
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.066
  1. LEARNING_RATE=2e-4, world_size=8
{'BATCH_SIZE': 2,
'LEARNING_RATE': 0.0002,
'LOG_PATH': './detr_1d_coco_tp1_bs2_lr0.0002/',
'NUM_EPOCHS': 300,
'SEED': 42,
'TENSOR_PARALLEL_MODE': '1d',
'TENSOR_PARALLEL_SIZE': 1,
'WEIGHT_DECAY': 0.0001,
'aux_loss': True,
'backbone': 'resnet50',
'bbox_loss_coef': 5,
'clip_max_norm': 0.1,
'coco_path': '/data/scratch/coco',
'cudnn_benchmark': False,
'dataset_file': 'coco',
'dec_layers': 6,
'device': 'cuda',
'dice_loss_coef': 1,
'dilation': False,
'dim_feedforward': 2048,
'dist_url': 'env://',
'distributed': True,
'dropout': 0.1,
'enc_layers': 6,
'eos_coef': 0.1,
'eval': False,
'giou_loss_coef': 2,
'hidden_dim': 256,
'lr_backbone': 1e-05,
'lr_drop': 200,
'mask_loss_coef': 1,
'masks': False,
'nheads': 8,
'num_queries': 100,
'num_workers': 2,
'output_dir': '',
'parallel': {'pipeline': 1, 'tensor': {'mode': '1d', 'size': 1}},
'position_embedding': 'sine',
'remove_difficult': False,
'resume': '',
'save_ckpt_freq': 50,
'seed': 42,
'set_cost_bbox': 5,
'set_cost_class': 1,
'set_cost_giou': 2,
'start_epoch': 0}     

 DONE (t=5.13s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.001
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.005
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.008
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.010
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.002
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.017
  1. LEARNING_RATE=1e-4, world_size=8
{'BATCH_SIZE': 2,
'LEARNING_RATE': 0.0001,
'LOG_PATH': './detr_1d_coco_tp1_bs2_lr0.0001/',
'NUM_EPOCHS': 300,
'SEED': 42,
'TENSOR_PARALLEL_MODE': '1d',
'TENSOR_PARALLEL_SIZE': 1,
'WEIGHT_DECAY': 0.0001,
'aux_loss': True,
'backbone': 'resnet50',
'bbox_loss_coef': 5,
'clip_max_norm': 0.1,
'coco_path': '/data/scratch/coco',
'cudnn_benchmark': False,
'dataset_file': 'coco',
'dec_layers': 6,
'device': 'cuda',
'dice_loss_coef': 1,
'dilation': False,
'dim_feedforward': 2048,
'dist_url': 'env://',
'distributed': True,
'dropout': 0.1,
'enc_layers': 6,
'eos_coef': 0.1,
'eval': False,
'giou_loss_coef': 2,
'hidden_dim': 256,
'lr_backbone': 1e-05,
'lr_drop': 200,
'mask_loss_coef': 1,
'masks': False,
'nheads': 8,
'num_queries': 100,
'num_workers': 2,
'output_dir': '',
'parallel': {'pipeline': 1, 'tensor': {'mode': '1d', 'size': 1}},
'position_embedding': 'sine',
'remove_difficult': False,
'resume': '',
'save_ckpt_freq': 50,
'seed': 42,
'set_cost_bbox': 5,
'set_cost_class': 1,
'set_cost_giou': 2,
'start_epoch': 0}

DONE (t=5.91s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.003
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.002
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.008
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.016
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.020
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.004
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.034

Environment

CUDA = 11.4 / Python = 3.8.13 / PyTorch = 1.11.0

BoxiangW avatar Jun 17 '22 08:06 BoxiangW

First, global batch size = batch size per dp * dp world size. Linear learning rate scaling scheme and sqrt scaling scheme are both popular and useful. For example, if the global batch size is increased by 4x, the learning rate can be increased by 4x or 2x.

ver217 avatar Jun 17 '22 08:06 ver217

Besides, learning rate scheduler may change with the change of global batch size. Generally, if we have larger global batch size, we should warmup learning rate for more steps.

ver217 avatar Jun 17 '22 09:06 ver217

This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 13 '23 04:04 binmakeswell