vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Memory leak when using CUDA Graph with torch.distributed.all_reduce (vLLM default config)

Open pcmoritz opened this issue 8 months ago • 4 comments

Running the following on the latest vLLM master

python -m vllm.entrypoints.openai.api_server --model mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 8 --max-num-batched-tokens 32768 --max-num-seqs 192

and then

git clone https://github.com/ray-project/llmperf
cd llmperf

export OPENAI_API_KEY="EMPTY"
export OPENAI_API_BASE="http://localhost:8000/v1"

python token_benchmark_ray.py --model "mistralai/Mixtral-8x7B-Instruct-v0.1" --num-concurrent-requests 192 --max-num-completed-requests 100000 --timeout 3600

(you will need to apply the following patch to avoid 400 errors:

diff --git a/src/llmperf/ray_clients/openai_chat_completions_client.py b/src/llmperf/ray_clients/openai_chat_completions_client.py
index f2e0a91252..0f95e30a71 100644
--- a/src/llmperf/ray_clients/openai_chat_completions_client.py
+++ b/src/llmperf/ray_clients/openai_chat_completions_client.py
@@ -20,7 +20,7 @@ class OpenAIChatCompletionsClient(LLMClient):
         prompt, prompt_len = prompt
 
         message = [
-            {"role": "system", "content": ""},
+            # {"role": "system", "content": ""},
             {"role": "user", "content": prompt},
         ]
         model = request_config.model

)

produces a pretty big CPU memory leak. This can be further diagnosed by applying the following diff

+++ b/vllm/entrypoints/openai/api_server.py
@@ -18,6 +18,8 @@ from fastapi.exceptions import RequestValidationError
 from fastapi.middleware.cors import CORSMiddleware
 from fastapi.responses import JSONResponse, StreamingResponse, Response
 
 
+import ray
+
 from vllm.engine.arg_utils import AsyncEngineArgs
 from vllm.engine.async_llm_engine import AsyncLLMEngine
 from vllm.engine.metrics import add_global_metrics_labels
@@ -43,6 +45,9 @@ engine = None
 response_role = None
 
 
+ray.init(runtime_env={"env_vars": {"PYTHONMALLOC": "malloc"}})
+
+
 def parse_args():

and then running memray on a PID of a RayWorkerVllm (in this example, PID = 68654) process like so:

sudo -E /home/ray/anaconda3/bin/memray attach 68654

In my setting (A100 80GB, CUDA version 12.2, latest pytorch 2.1.2), I'm seeing the heap memory of each actor leaking at about 50MB / min, so the total leakage is about 8 * 50MB / min = 400MB / min.

pcmoritz avatar Jan 03 '24 02:01 pcmoritz

If I revert https://github.com/vllm-project/vllm/pull/2152, the memory leak goes away :)

pcmoritz avatar Jan 03 '24 02:01 pcmoritz

@pcmoritz Do you happen to try --enforce-eager on the main brach? I'm wondering whether this memory leak is due to CUDA graph or due to the fix in #2151.

WoosukKwon avatar Jan 03 '24 20:01 WoosukKwon

There is no memory leak with --enforce-eager on the main branch. I believe this is some bad interaction between the torch collective communications and cuda graph :)

pcmoritz avatar Jan 03 '24 20:01 pcmoritz

I can dig into this more later today and see if I can figure out where exactly the leak is happening :)

pcmoritz avatar Jan 03 '24 20:01 pcmoritz

@PCmoritz Hello, have you made any progress? I have also encountered such a problem

junior-zsy avatar Jan 11 '24 08:01 junior-zsy

My current workaround is to use cupy as before https://github.com/vllm-project/vllm/pull/2152, that's working well. I haven't found the root cause of the bug with torch.distributed.all_reduce yet unfortunately :(

pcmoritz avatar Jan 11 '24 09:01 pcmoritz

@WoosukKwon I believe this memory leak is removed by https://github.com/vllm-project/vllm/pull/2192 so maybe that's a way forward to fix this without needing cupy, what do you think?

pcmoritz avatar Jan 19 '24 06:01 pcmoritz

I believe this is fixed with https://github.com/vllm-project/vllm/pull/2192 now if the custom allreduce kernel is used, please comment on the issue or open a new one if you still see a problem!

pcmoritz avatar Feb 05 '24 22:02 pcmoritz