DeepSpeedExamples Is GPU throughput reasonable?

I currently have some tests on Zero3 infinite and have had some problems and would like your help.

Machine configuration: two nodes, each node a piece of A100-PCIE-40GB, RAM 126G (actual operation available 60G), SSD 1TB (Samsung 980)

Benchmark Code：/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/

Model cases tested： HIDDEN_SIZE / NUM_ATTN_HEADS/ NUM_LAYERS/ BATCHSIZE = 4096/16/50/8 （Model size 10B） GPU memory occupies 13395/40537MB RAM occupancy 109/126G, (60G at idle) 80G of swap files stored in nvme file system Effective Tera Flops per GPU is 1.5TFLPOS

Question： Whether the GPU throughput achieved under the current environment configuration is reasonable, and whether the throughput can be increased by increasing the batch size or other configurations Effective Tera Flops per GPU calculated in flops_calculator of DeepSpeedExamples is about 1.5 TFLPOS. But deepspeed profile tested FLOPS per GPU is 2.32 GFLOPS.(deepspeed _profile.txt is generated by deepspeed profile and train.log is the information output during training)

deepspeed _profile.txt train.log

I hope to get your help, thank you very much!

Aug 01 '22 04:08 Crispig

@Crispig, thanks for your question.

The TFLOPs on 16xA100-40GB is quite low. What is the batch size? 10B model is too small for zero-infinity with nvme offload, given the overheads of parameter partitioning and nvme offload. You should get much better performance with zero-offload.

Some factors to consider in order to understand and improve zero-infinity performance

SSD is likely a bottleneck, you can profile the SSD using this guide.
Only offload optimizer state to CPU/NVMe, but not parameters since there is sufficient GPU memory for that
Increase batch size to improve compute load and efficiency
Disable or reduce activation checkpointing frequency

Aug 01 '22 14:08 tjruwase

Thank you very much for your reply! The batch size I used in my previous test was 8. I have done the following tests so far:

For a 1.7B size model with a batchsize of 48, without offload or offload to the CPU, the throughput can reach 36TFLOPS. For a 17B model, with only the optimizer status infinite to the SSD, it can run up to 15TFLPOS.
Infinite optimizer status to SSD, without modifying any other configuration options only modify the hidden_size to make the model smaller will result in the following error: Traceback (most recent call last): 192.168.189.10: File "pretrain_gpt2.py", line 134, in <module> 192.168.189.10: args_defaults={'tokenizer_type': 'GPT2BPETokenizer'}) 192.168.189.10: File "/home/lcy/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/training.py", line 111, in pretrain 192.168.189.10: train_data_iterator, valid_data_iterator) 192.168.189.10: File "/home/lcy/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/training.py", line 545, in train 192.168.189.10: lr_scheduler) 192.168.189.10: File "/home/lcy/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3/megatron/training.py", line 394, in train_step 192.168.189.10: model.step() 192.168.189.10: File "/home/lcy/DeepSpeed/deepspeed/runtime/engine.py", line 1911, in step 192.168.189.10: self._take_model_step(lr_kwargs) 192.168.189.10: File "/home/lcy/DeepSpeed/deepspeed/runtime/engine.py", line 1812, in _take_model_step 192.168.189.10: self.optimizer.step() 192.168.189.10: File "/home/lcy/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 1932, in step 192.168.189.10: self.unscale_and_clip_grads(sub_group_id, scaled_global_grad_norm) 192.168.189.10: File "/home/lcy/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 2007, in unscale_and_clip_grads 192.168.189.10: self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale) 192.168.189.10: AttributeError: 'NoneType' object has no attribute 'mul_'
In the training process constantly output the following warning, through this issue correction, do not know whether this will affect the performance [WARNING] [parameter_offload.py:48:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'torch.nn.parameter.Parameter'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.

Aug 04 '22 06:08 Crispig

Maybe I am too late here but this old Megatron has been deprecated. Can you kindly try the latest code and recipes from here?

https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples/azure

Aug 31 '22 22:08 awan-10

DeepSpeedExamples DeepSpeedExamples copied to clipboard

Is GPU throughput reasonable?

DeepSpeedExamples
DeepSpeedExamples copied to clipboard