ColossalAI [BUG]: Relationship between `BATCH_SIZE`, `LEARNING

🐛 Describe the bug

When testing DeTr on Colossal-Example, I encountered an issue that model with only DDP in situations:

LEARNING_RATE=1e-4, world_size=4
LEARNING_RATE=2e-4, world_size=8
LEARNING_RATE=1e-4, world_size=8

yield significantly different accuracies. (Due to significantly long training time, I only ran one epoch for each situation)

Related Discussion

I listed the log info and the precision and recall below.

LEARNING_RATE=1e-4, world_size=4

{'BATCH_SIZE': 2,
'LEARNING_RATE': 0.0001,
'LOG_PATH': './detr_1d_coco_tp1_bs2_lr0.0001/',
'NUM_EPOCHS': 300,
'SEED': 42,
'TENSOR_PARALLEL_MODE': '1d',
'TENSOR_PARALLEL_SIZE': 1,
'WEIGHT_DECAY': 0.0001,
'aux_loss': True,
'backbone': 'resnet50',
'bbox_loss_coef': 5,
'clip_max_norm': 0.1,
'coco_path': '/data/scratch/coco',
'cudnn_benchmark': False,
'dataset_file': 'coco',
'dec_layers': 6,
'device': 'cuda',
'dice_loss_coef': 1,
'dilation': False,
'dim_feedforward': 2048,
'dist_url': 'env://',
'distributed': True,
'dropout': 0.1,
'enc_layers': 6,
'eos_coef': 0.1,
'eval': False,
'giou_loss_coef': 2,
'hidden_dim': 256,
'lr_backbone': 1e-05,
'lr_drop': 200,
'mask_loss_coef': 1,
'masks': False,
'nheads': 8,
'num_queries': 100,
'num_workers': 2,
'output_dir': '',
'parallel': {'pipeline': 1, 'tensor': {'mode': '1d', 'size': 1}},
'position_embedding': 'sine',
'remove_difficult': False,
'resume': '',
'save_ckpt_freq': 50,
'seed': 42,
'set_cost_bbox': 5,
'set_cost_class': 1,
'set_cost_giou': 2,
'start_epoch': 0}

DONE (t=5.15s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.002
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.007
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.004
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.012
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.027
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.036
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.010
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.066

LEARNING_RATE=2e-4, world_size=8

{'BATCH_SIZE': 2,
'LEARNING_RATE': 0.0002,
'LOG_PATH': './detr_1d_coco_tp1_bs2_lr0.0002/',
'NUM_EPOCHS': 300,
'SEED': 42,
'TENSOR_PARALLEL_MODE': '1d',
'TENSOR_PARALLEL_SIZE': 1,
'WEIGHT_DECAY': 0.0001,
'aux_loss': True,
'backbone': 'resnet50',
'bbox_loss_coef': 5,
'clip_max_norm': 0.1,
'coco_path': '/data/scratch/coco',
'cudnn_benchmark': False,
'dataset_file': 'coco',
'dec_layers': 6,
'device': 'cuda',
'dice_loss_coef': 1,
'dilation': False,
'dim_feedforward': 2048,
'dist_url': 'env://',
'distributed': True,
'dropout': 0.1,
'enc_layers': 6,
'eos_coef': 0.1,
'eval': False,
'giou_loss_coef': 2,
'hidden_dim': 256,
'lr_backbone': 1e-05,
'lr_drop': 200,
'mask_loss_coef': 1,
'masks': False,
'nheads': 8,
'num_queries': 100,
'num_workers': 2,
'output_dir': '',
'parallel': {'pipeline': 1, 'tensor': {'mode': '1d', 'size': 1}},
'position_embedding': 'sine',
'remove_difficult': False,
'resume': '',
'save_ckpt_freq': 50,
'seed': 42,
'set_cost_bbox': 5,
'set_cost_class': 1,
'set_cost_giou': 2,
'start_epoch': 0}     

 DONE (t=5.13s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.001
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.005
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.008
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.010
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.002
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.017

LEARNING_RATE=1e-4, world_size=8

{'BATCH_SIZE': 2,
'LEARNING_RATE': 0.0001,
'LOG_PATH': './detr_1d_coco_tp1_bs2_lr0.0001/',
'NUM_EPOCHS': 300,
'SEED': 42,
'TENSOR_PARALLEL_MODE': '1d',
'TENSOR_PARALLEL_SIZE': 1,
'WEIGHT_DECAY': 0.0001,
'aux_loss': True,
'backbone': 'resnet50',
'bbox_loss_coef': 5,
'clip_max_norm': 0.1,
'coco_path': '/data/scratch/coco',
'cudnn_benchmark': False,
'dataset_file': 'coco',
'dec_layers': 6,
'device': 'cuda',
'dice_loss_coef': 1,
'dilation': False,
'dim_feedforward': 2048,
'dist_url': 'env://',
'distributed': True,
'dropout': 0.1,
'enc_layers': 6,
'eos_coef': 0.1,
'eval': False,
'giou_loss_coef': 2,
'hidden_dim': 256,
'lr_backbone': 1e-05,
'lr_drop': 200,
'mask_loss_coef': 1,
'masks': False,
'nheads': 8,
'num_queries': 100,
'num_workers': 2,
'output_dir': '',
'parallel': {'pipeline': 1, 'tensor': {'mode': '1d', 'size': 1}},
'position_embedding': 'sine',
'remove_difficult': False,
'resume': '',
'save_ckpt_freq': 50,
'seed': 42,
'set_cost_bbox': 5,
'set_cost_class': 1,
'set_cost_giou': 2,
'start_epoch': 0}

DONE (t=5.91s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.003
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.002
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.008
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.016
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.020
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.004
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.034

Environment

CUDA = 11.4 / Python = 3.8.13 / PyTorch = 1.11.0

Jun 17 '22 08:06 BoxiangW

First, global batch size = batch size per dp * dp world size. Linear learning rate scaling scheme and sqrt scaling scheme are both popular and useful. For example, if the global batch size is increased by 4x, the learning rate can be increased by 4x or 2x.

Jun 17 '22 08:06 ver217

Besides, learning rate scheduler may change with the change of global batch size. Generally, if we have larger global batch size, we should warmup learning rate for more steps.

Jun 17 '22 09:06 ver217

This issue was closed due to inactivity. Thanks.

Apr 13 '23 04:04 binmakeswell

ColossalAI
ColossalAI copied to clipboard

[BUG]: Relationship between `BATCH_SIZE`, `LEARNING_RATE` and GPU amount

🐛 Describe the bug

Environment

ColossalAI ColossalAI copied to clipboard

[BUG]: Relationship between `BATCH_SIZE`, `LEARNING_RATE` and GPU amount

🐛 Describe the bug

Environment

ColossalAI
ColossalAI copied to clipboard