vllm
vllm copied to clipboard
[Bug]: VLLM OOMing unpredictably on prediction
Your current environment
Collecting environment information... PyTorch version: N/A Is debug build: N/A CUDA used to build PyTorch: N/A ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.16.3 Libc version: glibc-2.31
Python version: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0] (64-bit runtime) Python platform: Linux-5.15.0-1051-aws-x86_64-with-glibc2.31 Is CUDA available: N/A CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: N/A GPU models and configuration: GPU 0: NVIDIA A10G GPU 1: NVIDIA A10G GPU 2: NVIDIA A10G GPU 3: NVIDIA A10G GPU 4: NVIDIA A10G GPU 5: NVIDIA A10G GPU 6: NVIDIA A10G GPU 7: NVIDIA A10G
Nvidia driver version: 535.104.12 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: N/A
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 192 On-line CPU(s) list: 0-191 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7R32 Stepping: 0 CPU MHz: 3258.700 BogoMIPS: 5599.97 Hypervisor vendor: KVM Virtualization type: full L1d cache: 3 MiB L1i cache: 3 MiB L2 cache: 48 MiB L3 cache: 384 MiB NUMA node0 CPU(s): 0-47,96-143 NUMA node1 CPU(s): 48-95,144-191 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
Versions of relevant libraries: [pip3] No relevant packages [conda] No relevant packagesROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB PHB PHB PHB PHB PHB PHB 0-191 0-1 N/A GPU1 PHB X PHB PHB PHB PHB PHB PHB 0-191 0-1 N/A GPU2 PHB PHB X PHB PHB PHB PHB PHB 0-191 0-1 N/A GPU3 PHB PHB PHB X PHB PHB PHB PHB 0-191 0-1 N/A GPU4 PHB PHB PHB PHB X PHB PHB PHB 0-191 0-1 N/A GPU5 PHB PHB PHB PHB PHB X PHB PHB 0-191 0-1 N/A GPU6 PHB PHB PHB PHB PHB PHB X PHB 0-191 0-1 N/A GPU7 PHB PHB PHB PHB PHB PHB PHB X 0-191 0-1 N/A
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
🐛 Describe the bug
Environment: AWS g5.48xlarge ec2 instance, running VLLM w/ python 3.10.9, VLLM version 0.3.0 (details in the environment section above)
Here's the (simplified) code I'm using to load and query the model (in this case, Mixtral-8x7b)
from vllm import LLM, SamplingParams
model = LLM(
tensor_parallel_size=8,
model'mistralai_Mixtral-8x7B-Instruct-v0.1',
dtype='bfloat16',
quantization=None
)
vllm_params = SamplingParams(use_beam_search=False, max_tokens=max_tokens, ignore_eos=True) # ignore_eos makes it so max_tokens are actually generated
model_batch_output = self.model_pipeline.model.generate(prompts=batch_input, sampling_params=vllm_params)
Typically, when it comes to LLMs, if you're able to load your model into VRAM, and able to run GPU-based predictions for a given batch size and a given context-length=input_tokens+output_tokens, you can pretty much guarantee that the same model on the same environment won't OOM if make any of those 3 values (batch size, num-input-tokens, num-output-tokens). This is great, because you can figure out your model's limits on a given machine, and then configure your model to go just under those limits in production.
However, I haven't experienced exactly this when it comes to VLLM, presumably because of the complications resulting from its great features (smart KV-caching, continuous batching, etc).
Here's an example: I have a program that runs ^^ the above code on a variety of batch sizes, num-input-tokens, and num-output-tokens. I initially assumed if something OOMed at prediction-time for some tuple of those parameters, it would also OOM at prediction-time if any values in that tuple increased. However I've found this isn't always the case:
e.g. here, we have:
input-num-tokens=8000 (promtp_len)
output-num-tokens=8000 (output_len)
batch_size=1, 2, and then OOMs at 4
However here (same environment, same code, screenshot from the same actual single program that was run), we have:
input-num-tokens=9000 (promtp_len)
output-num-tokens=9000 (output_len)
batch_size=Doesn't (!) OOM for batch sizes of 1, 2, 4, 8, and 16, even though memory requirements presumably have increased
--> Therefore, some questions:
- Is this kind of experience a known issue?
- Relatedly, are there best practices when it comes to VLLM GPU parameters? There's
gpu_memory_utilization
, which seems like would reduce OOMs when at 100%, though VLLM documentation suggests that if it's too high, it may cause OOMs. This feels very confusing to me! How can allowing VLLM to use more GPU cause OOMs - unless the leftover GPU amount is reserved for times when VLLM needs extra / 'unexpected' GPU memory, for whatever reason. --> Is that the case, and is there any formula or explanation to help someone figure out when that kind of situation can happen? e.g. if you setgpu_memory_utilization
to 50% like Triton does , does that make some sort of special guarantee? What is the best method to figuring out how to guarantee that a VLLM server will never OOM? There's very limited explanations of how these kinds of parameters are affecting things under the hood, it would be fantastic to get a better understanding of this in order to be confident using VLLM.
Thank you!
Hi @hillarysanders are you able to try a newer release of vLLM like 0.3.3? I ask because 0.3.0 is about two months old at this point and has many improvements, especially for MoE models.
@mgoin good point, OK I bumped up to 0.3.3, and it did actually improve things a lot (!), although it didn't get rid of all the hard-to-predict OOMs. Here's a summary plot of before (0.3.0) and after (0.3.3) -
Batch sizes much larger that should be possible w/ a standard batch are now (usually) working, indicating that continuous batching is working mostly properly again (perhaps because of the memory leak bugfix in 0.3.1? But I was seeing the same program hang in 0.3.1, so I'm not sure if those kinds of issues have also been resolved 🤔).
--> Any idea how I can try and debug why the program is still OOMing (and possibly hanging, though I haven't see that yet in 0.3.3) unpredictably? (i.e. the orange dot in the second plot)