vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: FP8 Marlin fallback out of memory regression

Open cduk opened this issue 1 year ago • 3 comments

Your current environment

Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.28.3 Libc version: glibc-2.35

Python version: 3.12.1 | packaged by Anaconda, Inc. | (main, Jan 19 2024, 15:51:05) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-118-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 Ti Nvidia driver version: 550.90.07 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 40 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 6
On-line CPU(s) list: 0-5 Vendor ID: AuthenticAMD Model name: QEMU Virtual CPU version 2.5+ CPU family: 15
Model: 107 Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 1
Stepping: 1
BogoMIPS: 6986.87 Flags: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm rep_good nopl cpuid extd_apicid tsc_known_freq pni ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes hypervisor lahf_lm cmp_legacy 3dnowprefetch vmmcall Hypervisor vendor: KVM Virtualization type: full L1d cache: 384 KiB (6 instances) L1i cache: 384 KiB (6 instances) L2 cache: 3 MiB (6 instances) L3 cache: 96 MiB (6 instances) NUMA node(s): 1
NUMA node0 CPU(s): 0-5 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Not affected Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers Vulnerability Spectre v2: Vulnerable; STIBP: disabled; PBRSB-eIBRS: Not affected; BHI: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] mypy==1.5.1 [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==8.9.2.26 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.4.127 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.0.2 [pip3] torch==2.3.0 [pip3] transformers==4.42.3 conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 8.9.2.26 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pyzmq 26.0.2 pypi_0 pypi [conda] torch 2.3.0 pypi_0 pypi [conda] transformers 4.42.3 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: ^[[4mGPU0 CPU Affinity NUMA Affinity GPU NUMA ID^[[0m GPU0 X 0-5 0 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

There is a regression that took place between around 4 weeks ago until a week ago.

Running Nemo model with FP8 quantization using this command:

--model mistralai/Mistral-Nemo-Instruct-2407 --max-model-len 8192 --gpu-memory-utilization 0.7 --quantization fp8 --enable-prefix-caching --enforce-eager

Previously worked using only around 16 GB of VRAM.

However, there has since been a regression leading to an out of memory error even with 24GB

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 140.00 MiB. GPU 0 has a total capacity of 23.67 GiB of which 37.88 MiB is free. Process 258755 has 23.63 GiB memory in use. Of the allocated memory 23.28 GiB is allocated by PyTorch, and 51.00 MiB is reserved by PyTorch but unallocated.

It appears it happens during marling weight re-packing:

marlin_qweight = ops.gptq_marlin_repack(b_q_weight=pack_fp8_to_int32( File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py"

Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 230, in run_rpc_server server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 31, in init self.engine = AsyncLLMEngine.from_engine_args( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 740, in from_engine_args engine = cls( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in init self.engine = self._init_engine(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 840, in _init_engine return engine_class(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 272, in init super().init(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 276, in init self.model_executor = executor_class( File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 46, in init self._init_executor() File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 39, in _init_executor self.driver_worker.load_model() File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 182, in load_model self.model_runner.load_model() File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 880, in load_model self.model = get_model(model_config=self.model_config, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model return loader.load_model(model_config=model_config, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 361, in load_model quant_method.process_weights_after_loading(module) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 239, in process_weights_after_loading prepare_fp8_layer_for_marlin(layer) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 67, in prepare_fp8_layer_for_marlin marlin_qweight = ops.gptq_marlin_repack(b_q_weight=pack_fp8_to_int32( File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 102, in pack_fp8_to_int32 (byte_tensor[:, 1].to(torch.int32) << 8) | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 140.00 MiB. GPU 0 has a total capacity of 23.67 GiB of which 37.88 MiB is free. Process 258755 has 23.63 GiB memory in use. Of the allocated memory 23.28 GiB is allocated by PyTorch, and 51.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

cduk avatar Aug 22 '24 19:08 cduk