DeepSpeed
DeepSpeed copied to clipboard
[BUG] Excessive CPU and GPU Memory Usage with Multi-GPU Inference Using DeepSpeed
I am experiencing excessive CPU and GPU memory usage when running multi-GPU inference with DeepSpeed. Specifically, the memory usage does not scale as expected when increasing the number of GPUs. Below is the code I am using for inference:
import os
import torch
import deepspeed
import time
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
from deepspeed.runtime.zero.config import DeepSpeedZeroConfig
from deepspeed.inference.config import DeepSpeedTPConfig
from deepspeed.runtime.utils import see_memory_usage
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
model_dir = "/mnt/sgnfsdata/tolo-03-97/pretrained_models/internlm2-chat-20b"
trust_remote_code = True
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=trust_remote_code)
config = AutoConfig.from_pretrained(model_dir, trust_remote_code=trust_remote_code)
model = AutoModelForCausalLM.from_pretrained(model_dir,
torch_dtype=torch.bfloat16,
trust_remote_code=trust_remote_code
)
model = model.eval()
see_memory_usage("After load model", force=True)
tp_config = DeepSpeedTPConfig(tp_size=world_size)
zero_config = DeepSpeedZeroConfig(stage=3,
model_persistence_threshold=0,
max_live_parameters=0,
mics_shard_size=world_size
)
ds_engine = deepspeed.init_inference(model=model,
tensor_parallel=tp_config,
dtype=torch.bfloat16,
zero=zero_config,
max_out_tokens=1024,
replace_method="auto",
replace_with_kernel_inject=True)
see_memory_usage("After DS-inference init", force=True)
model = ds_engine.module
print("device: ", model.device)
prompt = "what is deepspeed?"
t0 = time.time()
response = model.chat(tokenizer=tokenizer,
query=prompt,
history=[],
max_new_tokens=1024,
do_sample=True,
temperature=0.8,
top_p=0.8
)
t1 = time.time()
print(response)
print('=' * 100)
print("inference time: ", t1 - t0)
print('=' * 100)
Steps to Reproduce:
- Run the script with 2 GPUs:
deepspeed --num_gpus 2 main.py --ds_inference
- Run the script with 4 GPUs:
deepspeed --num_gpus 4 main.py --ds_inference
Expected Behavior: I expected that using 4 GPUs would reduce the memory usage per GPU, ideally halving the GPU memory usage compared to running with 2 GPUs.
Actual Behavior:
With 2 GPUs:
CPU virtual memory: 92.87GB
Each GPU memory: 37.74GB
With 4 GPUs:
CPU virtual memory: 162.92GB (significantly higher than expected)
Each GPU memory: 37.74GB (no reduction)
Questions:
Why does the CPU virtual memory usage increase significantly when using more GPUs?
How can I reduce the memory usage per GPU when scaling up the number of GPUs?
System Info:
DeepSpeed version: 0.14.4
PyTorch version: 2.3.1
Transformers version: 4.42.3
Python version: 3.10
OS: ubuntu 24.04
Additional Context: Any insights or suggestions on how to optimize the memory usage for multi-GPU inference with DeepSpeed would be greatly appreciated. Thank you!
@gawain000000, can you clarify your goals because there are two different solutions for latency and throughput (and low budget) scenarios. I noticed the use of deepspeed.init_inference and zero stage 3 in your codes, which are not recommended combinations.
@tjruwase My goals are the following:
- Reduce the latency and throughput of the inference, so we need to use DeepSpeed.
- Slice the model across multiple GPUs so that each GPU requires a smaller amount of memory.
The reason for this is that when LLMs perform inference on a long document, they need additional memory for storing the KV cache. Currently, I use L40S GPUs to deploy the LLM, and each GPU has only 46GB of memory. If no model slicing is performed, it will result in an OOM error when processing a document around 7000 tokens. I wonder why DeepSpeed's inference initialization allocates the same amount of memory in both cases—using 2 GPUs and 4 GPUs for deployment—making it impossible to process long documents.
Why does the CPU virtual memory usage increase significantly when using more GPUs?
请问随着gpu数量的增加,cpu内存增加解决了吗?