ColossalAI
ColossalAI copied to clipboard
[BUG]: Relationship between `BATCH_SIZE`, `LEARNING_RATE` and GPU amount
🐛 Describe the bug
When testing DeTr on Colossal-Example, I encountered an issue that model with only DDP in situations:
LEARNING_RATE=1e-4,world_size=4LEARNING_RATE=2e-4,world_size=8LEARNING_RATE=1e-4,world_size=8
yield significantly different accuracies. (Due to significantly long training time, I only ran one epoch for each situation)
I listed the log info and the precision and recall below.
LEARNING_RATE=1e-4,world_size=4
{'BATCH_SIZE': 2,
'LEARNING_RATE': 0.0001,
'LOG_PATH': './detr_1d_coco_tp1_bs2_lr0.0001/',
'NUM_EPOCHS': 300,
'SEED': 42,
'TENSOR_PARALLEL_MODE': '1d',
'TENSOR_PARALLEL_SIZE': 1,
'WEIGHT_DECAY': 0.0001,
'aux_loss': True,
'backbone': 'resnet50',
'bbox_loss_coef': 5,
'clip_max_norm': 0.1,
'coco_path': '/data/scratch/coco',
'cudnn_benchmark': False,
'dataset_file': 'coco',
'dec_layers': 6,
'device': 'cuda',
'dice_loss_coef': 1,
'dilation': False,
'dim_feedforward': 2048,
'dist_url': 'env://',
'distributed': True,
'dropout': 0.1,
'enc_layers': 6,
'eos_coef': 0.1,
'eval': False,
'giou_loss_coef': 2,
'hidden_dim': 256,
'lr_backbone': 1e-05,
'lr_drop': 200,
'mask_loss_coef': 1,
'masks': False,
'nheads': 8,
'num_queries': 100,
'num_workers': 2,
'output_dir': '',
'parallel': {'pipeline': 1, 'tensor': {'mode': '1d', 'size': 1}},
'position_embedding': 'sine',
'remove_difficult': False,
'resume': '',
'save_ckpt_freq': 50,
'seed': 42,
'set_cost_bbox': 5,
'set_cost_class': 1,
'set_cost_giou': 2,
'start_epoch': 0}
DONE (t=5.15s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.002
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.007
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.001
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.001
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.004
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.012
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.027
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.036
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.010
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.066
LEARNING_RATE=2e-4,world_size=8
{'BATCH_SIZE': 2,
'LEARNING_RATE': 0.0002,
'LOG_PATH': './detr_1d_coco_tp1_bs2_lr0.0002/',
'NUM_EPOCHS': 300,
'SEED': 42,
'TENSOR_PARALLEL_MODE': '1d',
'TENSOR_PARALLEL_SIZE': 1,
'WEIGHT_DECAY': 0.0001,
'aux_loss': True,
'backbone': 'resnet50',
'bbox_loss_coef': 5,
'clip_max_norm': 0.1,
'coco_path': '/data/scratch/coco',
'cudnn_benchmark': False,
'dataset_file': 'coco',
'dec_layers': 6,
'device': 'cuda',
'dice_loss_coef': 1,
'dilation': False,
'dim_feedforward': 2048,
'dist_url': 'env://',
'distributed': True,
'dropout': 0.1,
'enc_layers': 6,
'eos_coef': 0.1,
'eval': False,
'giou_loss_coef': 2,
'hidden_dim': 256,
'lr_backbone': 1e-05,
'lr_drop': 200,
'mask_loss_coef': 1,
'masks': False,
'nheads': 8,
'num_queries': 100,
'num_workers': 2,
'output_dir': '',
'parallel': {'pipeline': 1, 'tensor': {'mode': '1d', 'size': 1}},
'position_embedding': 'sine',
'remove_difficult': False,
'resume': '',
'save_ckpt_freq': 50,
'seed': 42,
'set_cost_bbox': 5,
'set_cost_class': 1,
'set_cost_giou': 2,
'start_epoch': 0}
DONE (t=5.13s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.001
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.001
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.001
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.005
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.008
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.010
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.002
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.017
LEARNING_RATE=1e-4,world_size=8
{'BATCH_SIZE': 2,
'LEARNING_RATE': 0.0001,
'LOG_PATH': './detr_1d_coco_tp1_bs2_lr0.0001/',
'NUM_EPOCHS': 300,
'SEED': 42,
'TENSOR_PARALLEL_MODE': '1d',
'TENSOR_PARALLEL_SIZE': 1,
'WEIGHT_DECAY': 0.0001,
'aux_loss': True,
'backbone': 'resnet50',
'bbox_loss_coef': 5,
'clip_max_norm': 0.1,
'coco_path': '/data/scratch/coco',
'cudnn_benchmark': False,
'dataset_file': 'coco',
'dec_layers': 6,
'device': 'cuda',
'dice_loss_coef': 1,
'dilation': False,
'dim_feedforward': 2048,
'dist_url': 'env://',
'distributed': True,
'dropout': 0.1,
'enc_layers': 6,
'eos_coef': 0.1,
'eval': False,
'giou_loss_coef': 2,
'hidden_dim': 256,
'lr_backbone': 1e-05,
'lr_drop': 200,
'mask_loss_coef': 1,
'masks': False,
'nheads': 8,
'num_queries': 100,
'num_workers': 2,
'output_dir': '',
'parallel': {'pipeline': 1, 'tensor': {'mode': '1d', 'size': 1}},
'position_embedding': 'sine',
'remove_difficult': False,
'resume': '',
'save_ckpt_freq': 50,
'seed': 42,
'set_cost_bbox': 5,
'set_cost_class': 1,
'set_cost_giou': 2,
'start_epoch': 0}
DONE (t=5.91s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.001
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.003
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.001
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.001
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.002
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.008
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.016
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.020
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.004
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.034
Environment
CUDA = 11.4 / Python = 3.8.13 / PyTorch = 1.11.0
First, global batch size = batch size per dp * dp world size. Linear learning rate scaling scheme and sqrt scaling scheme are both popular and useful. For example, if the global batch size is increased by 4x, the learning rate can be increased by 4x or 2x.
Besides, learning rate scheduler may change with the change of global batch size. Generally, if we have larger global batch size, we should warmup learning rate for more steps.
This issue was closed due to inactivity. Thanks.