vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: Ray memory leak

Open saattrupdan opened this issue 2 months ago • 5 comments

Your current environment

PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.31

Python version: 3.11.3 (main, Apr 19 2024, 17:22:27) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-177-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 10.1.243
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A40
GPU 1: NVIDIA A40
GPU 2: NVIDIA A40

Nvidia driver version: 535.161.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      40 bits physical, 48 bits virtual
CPU(s):                             64
On-line CPU(s) list:                0-63
Thread(s) per core:                 1
Core(s) per socket:                 1
Socket(s):                          64
NUMA node(s):                       1
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              1
Model name:                         AMD EPYC-Milan Processor
Stepping:                           1
CPU MHz:                            2994.374
BogoMIPS:                           5988.74
Virtualization:                     AMD-V
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          2 MiB
L1i cache:                          2 MiB
L2 cache:                           32 MiB
L3 cache:                           2 GiB
NUMA node0 CPU(s):                  0-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip vaes vpclmulqdq rdpid arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.18.1
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	GPU1	GPU2	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PHB	PHB	0-63	0		N/A
GPU1	PHB	 X 	PHB	0-63	0		N/A
GPU2	PHB	PHB	 X 	0-63	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

I'm using vLLM with several models within the same Python session (one model at a time), and I'm running on a multi-GPU setup. After each model run I need to clear the GPU memory to leave room for the next model, which means (among other things) that I need to shut down the Ray cluster (via ray.shutdown()). That's all fine, but this only clears the GPU memory of one of the GPUs.

Minimal example:

from vllm import LLM
import ray

# Instantiate first model
llm = LLM("mhenrichsen/danskgpt-tiny", tensor_parallel_size=2)

# Destroy Ray cluster; this only clears the GPU memory on one of the GPUs
# Note that adding any combination of `torch.cuda.empty_cache()`, 
# `gc.collect()` or `destroy_model_parallel()` doesn't help here
ray.shutdown()

# Instantiate second model; this now causes OOM errors
llm = LLM("mhenrichsen/danskgpt-tiny-chat", tensor_parallel_size=2)

This is a known Ray issue, where the solution as mentioned in that issue as well as the official Ray docs is to include max_calls=1 in ray.remote calls, which supposedly fixes it. In vLLM this is called in these lines in the vllm.executor.ray_gpu_executor module and these lines in the vllm.engine.async_llm_engine module. However, in those applications the decorator is wrapping a class (an "actor" in Ray speak), where the max_calls argument is not allowed, so I'm not sure if this solution helps here.

saattrupdan avatar Apr 21 '24 14:04 saattrupdan

The same situation occurs during multi-GPU deployment.

dongxiaolong avatar Apr 24 '24 02:04 dongxiaolong

Actually, think this is not ray issue, but from cuda itself (that cuda's cache is not cleaned up). Can you try calling this before new initiailzation.

# init llm
ray.shutdown()
torch.cuda.empty_cache()
import gc
gc.collect()
# init llm again

When you call ray.shutdown, it kills the ray worker process, which uses up GPU, and that's cleaned up. But in vLLM the driver (your Python script) is using the first GPU, and it is not cleaned up automatically unless you call these 2 APIs.

rkooo567 avatar May 03 '24 08:05 rkooo567

Actually, think this is not ray issue, but from cuda itself (that cuda's cache is not cleaned up). Can you try calling this before new initiailzation.

# init llm
ray.shutdown()
torch.cuda.empty_cache()
import gc
gc.collect()
# init llm again

When you call ray.shutdown, it kills the ray worker process, which uses up GPU, and that's cleaned up. But in vLLM the driver (your Python script) is using the first GPU, and it is not cleaned up automatically unless you call these 2 APIs.

Hi @rkooo567. I just tried your solution, and unfortunately it still doesn't clear the GPU memory from the first GPU.

saattrupdan avatar May 03 '24 09:05 saattrupdan

I see. Let me try reproducing it very soon. One more thing, can you try running it with tensor_parallel_size=1, and see if the cleanup is happening? (with/without empty cache)

rkooo567 avatar May 03 '24 15:05 rkooo567

One more thing, can you try running it with tensor_parallel_size=1, and see if the cleanup is happening?

In this case it's all good, the GPU memory is fully reset and re-initialisation of the LLM instance isn't an issue.

saattrupdan avatar May 04 '24 11:05 saattrupdan