vllm
vllm copied to clipboard
[Bug]: (OOM) Find two places that cause a significant increase in GPU memory usage (probably lead to memory leak)
Your current environment
Collecting environment information... PyTorch version: 2.3.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.30.1 Libc version: glibc-2.31
Python version: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] (64-bit runtime) Python platform: Linux-5.4.0-190-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 12.4.99 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA L40 GPU 1: NVIDIA L40 GPU 2: NVIDIA L40 GPU 3: NVIDIA L40 GPU 4: NVIDIA L40
Nvidia driver version: 550.54.14 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.2.1 /usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn.so.8 /usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8 /usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8 /usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8 /usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8 /usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8 /usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 112 On-line CPU(s) list: 0-111 Thread(s) per core: 2 Core(s) per socket: 28 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7453 28-Core Processor Stepping: 1 Frequency boost: enabled CPU MHz: 1496.936 CPU max MHz: 2750.0000 CPU min MHz: 1500.0000 BogoMIPS: 5489.55 Virtualization: AMD-V L1d cache: 1.8 MiB L1i cache: 1.8 MiB L2 cache: 28 MiB L3 cache: 128 MiB NUMA node0 CPU(s): 0-27,56-83 NUMA node1 CPU(s): 28-55,84-111 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
Versions of relevant libraries: [pip3] flashinfer==0.1.1+cu121torch2.3 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==8.9.2.26 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.555.43 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.5.82 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.0.3 [pip3] torch==2.3.1 [pip3] torchaudio==2.3.1 [pip3] torchvision==0.18.1 [pip3] transformers==4.43.3 [pip3] triton==2.3.1 [pip3] zmq==0.0.0 [conda] flashinfer 0.1.1+cu121torch2.3 pypi_0 pypi [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 8.9.2.26 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.555.43 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.5.82 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pyzmq 26.0.3 py311h08a0b41_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge [conda] torch 2.3.1 pypi_0 pypi [conda] torchaudio 2.3.1 pypi_0 pypi [conda] torchvision 0.18.1 pypi_0 pypi [conda] transformers 4.43.3 pypi_0 pypi [conda] triton 2.3.1 pypi_0 pypi [conda] zmq 0.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.3.post1@3eeb148f467e3619e8890b1a5ebe86a173f91bc9 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X SYS SYS SYS SYS 0-27,56-83 0 N/A GPU1 SYS X SYS SYS SYS 0-27,56-83 0 N/A GPU2 SYS SYS X SYS SYS 0-27,56-83 0 N/A GPU3 SYS SYS SYS X SYS 28-55,84-111 1 N/A GPU4 SYS SYS SYS SYS X 28-55,84-111 1 N/A
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
🐛 Describe the bug
First in vllm/model_executor/layers/sampler.py
:
class Sampler():
...
def forward(...):
...
probs = torch.softmax(logits, dim=-1, dtype=torch.float)
logprobs = torch.log_softmax(logits, dim=-1, dtype=torch.float)
...
Next also in vllm/model_executor/layers/sampler.py
:
def _get_ranks(...):
...
return result.sum(1).add(1) # The memory increase is caused by Tensor.sum()
Both of the two operations will allocate a large amount of GPU memory (eg. Qwen2-7b for about 4GB in one inference), and some of them will not be released. It leads to OOM when I run Qwen2-7b in a single 6000Ada GPU.
llm = LLM(
'Qwen/Qwen2-7B',
gpu_memory_utilization=0.8,
tensor_parallel_size=1,
max_model_len=2048,
)
sampling_params = SamplingParams(
temperature=0,
prompt_logprobs=0,
max_tokens=1
)
Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.