DeepSpeedExamples Add low_cpu_mem_usage flag in inference test

Add low_cpu_mem_usage flag in inference test

Open lokoppakmsft opened this issue 3 years ago • 1 comments

Nov 15 '22 23:11 lokoppakmsft

@awan-10 @lekurile Please review

Nov 15 '22 23:11 lokoppakmsft

Can confirm: when the model is loaded in safetensors format, this can reduce the memory usage by a factor of 5+.

When experimenting with llama2-70b, we found that the memory usage before this fix was over 260GB per process before it OOMed. After the fix, it took <250GB in total. This is likely because safetensors can memmap the weight files into the same os-wide cache such that different ranks point to the same memory.

To reproduce:

deepspeed --num_gpus 4 inference-test.py --model meta-llama/Llama-2-70b-hf  --batch_size 2 --dtype float16 --max_new_tokens 32 --test_performance

Aug 31 '23 15:08 poedator

DeepSpeedExamples DeepSpeedExamples copied to clipboard

Add low_cpu_mem_usage flag in inference test

DeepSpeedExamples
DeepSpeedExamples copied to clipboard