DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Add low_cpu_mem_usage flag in inference test
@awan-10 @lekurile Please review
Can confirm: when the model is loaded in safetensors format, this can reduce the memory usage by a factor of 5+.
When experimenting with llama2-70b, we found that the memory usage before this fix was over 260GB per process before it OOMed. After the fix, it took <250GB in total. This is likely because safetensors can memmap the weight files into the same os-wide cache such that different ranks point to the same memory.
To reproduce:
deepspeed --num_gpus 4 inference-test.py --model meta-llama/Llama-2-70b-hf --batch_size 2 --dtype float16 --max_new_tokens 32 --test_performance