DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Why ZeRO-2 use more CUDA Memory than ZeRO-1?
Follow the bing_bert tutorial, my deepspeed_config is:
{
"train_batch_size": 4096,
"train_micro_batch_size_per_gpu": 32,
"steps_per_print": 1000,
"prescale_gradients": false,
"optimizer": {
"type": "Adam",
"params": {
"lr": 6e-3,
"betas": [
0.9,
0.99
],
"eps": 1e-8,
"weight_decay": 0.01
}
},
"zero_optimization": {
"stage": 1,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"grad_hooks": true,
"round_robin_gradients": false
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 1e-8,
"warmup_max_lr": 6e-3
}
},
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"fp16": {
"enabled": true,
"loss_scale": 0
},
"sparse_attention": {
"mode": "fixed",
"block": 16,
"different_layout_per_head": true,
"num_local_blocks": 4,
"num_global_blocks": 1,
"attention": "bidirectional",
"horizontal_global_attention": false,
"num_different_global_patterns": 4
}
}
The CUDA Memory usage for stage 1 is 8900MB per GPU The CUDA Memory usage for stage 2 is 9600MB per GPU
And the ZeRO-2 is much slower than ZeRO-1 in training speed.
Any help will be appreciate~