vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: VLLM hangs on prediction when preceded by other predictions

Open hillarysanders opened this issue 11 months ago • 0 comments

Your current environment

Collecting environment information... PyTorch version: N/A Is debug build: N/A CUDA used to build PyTorch: N/A ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.16.3 Libc version: glibc-2.31

Python version: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0] (64-bit runtime) Python platform: Linux-5.15.0-1051-aws-x86_64-with-glibc2.31 Is CUDA available: N/A CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: N/A GPU models and configuration: GPU 0: NVIDIA A10G GPU 1: NVIDIA A10G GPU 2: NVIDIA A10G GPU 3: NVIDIA A10G GPU 4: NVIDIA A10G GPU 5: NVIDIA A10G GPU 6: NVIDIA A10G GPU 7: NVIDIA A10G

Nvidia driver version: 535.104.12 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: N/A

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 192 On-line CPU(s) list: 0-191 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7R32 Stepping: 0 CPU MHz: 3258.700 BogoMIPS: 5599.97 Hypervisor vendor: KVM Virtualization type: full L1d cache: 3 MiB L1i cache: 3 MiB L2 cache: 48 MiB L3 cache: 384 MiB NUMA node0 CPU(s): 0-47,96-143 NUMA node1 CPU(s): 48-95,144-191 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid

Versions of relevant libraries: [pip3] No relevant packages [conda] No relevant packagesROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB PHB PHB PHB PHB PHB PHB 0-191 0-1 N/A GPU1 PHB X PHB PHB PHB PHB PHB PHB 0-191 0-1 N/A GPU2 PHB PHB X PHB PHB PHB PHB PHB 0-191 0-1 N/A GPU3 PHB PHB PHB X PHB PHB PHB PHB 0-191 0-1 N/A GPU4 PHB PHB PHB PHB X PHB PHB PHB 0-191 0-1 N/A GPU5 PHB PHB PHB PHB PHB X PHB PHB 0-191 0-1 N/A GPU6 PHB PHB PHB PHB PHB PHB X PHB 0-191 0-1 N/A GPU7 PHB PHB PHB PHB PHB PHB PHB X 0-191 0-1 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

Environment: AWS g5.48xlarge ec2 instance, running VLLM w/ python 3.10.9, VLLM version 0.3.0 (details in the environment section above)

Here's the (simplified) code I'm using to load and query the model (in this case, Mixtral-8x7b)

from vllm import LLM, SamplingParams
model = LLM(
                    tensor_parallel_size=8,
                    model'mistralai_Mixtral-8x7B-Instruct-v0.1',
                    dtype='bfloat16',
                    quantization=None
                )

vllm_params = SamplingParams(use_beam_search=False, max_tokens=max_tokens, ignore_eos=True) # ignore_eos makes it so max_tokens are actually generated
model_batch_output = self.model_pipeline.model.generate(prompts=batch_input, sampling_params=vllm_params)

I've noticed an unfortunately difficult-to-reproduce bug which causes VLLM to hang for hours (not even OOM and crash, but hang - 😬).

image

Specifically, it seems to happen after multiple OOMs have occurred, been caught (try-catch) and passed over. e.g. if I run ^^ this with a relatively small amount of data:

batch_size=1, input_num_tokens=3000, output_max_tokens=3000

It runs fine. However if I run some more intensive queries w/ large memory requirements that predictably fail (doing this for stress-testing), specifically:

batch_size==8, input_num_tokens=2000, output_max_tokens=2000. # doesn't OOM batch_size==16, input_num_tokens=2000, output_max_tokens=2000. # OOMs batch_size==32, input_num_tokens=2000, output_max_tokens=2000. # OOMs batch_size==64, input_num_tokens=2000, output_max_tokens=2000. # OOMs

and then run:

batch_size=1, input_num_tokens=3000, output_max_tokens=3000

I've seen VLLM hang with the error above twice (I ran the same program and it hung in the exact same place, so it does seem to be relatively deterministic, but also based on a long string of past calls and OOMs, since running just that single model call doesn't cause any issues).

Has anyone experienced this kind of issue? I'm wary of relying on VLLM when I've spotted various kinds of issues relating to hanging, or inconsistent patterns in OOMing (at prediction-time, not model-load time). Sometimes I've seen it OOM with a shorter context-window and batch size when a longer context-window and larger batch size succeeded on the same machine, which is a bit bewildering (however, a different bug so I'll avoid going into details about that here).

Mostly, is this kind of (screenshotted) issue known, does anyone know what could be happening or recommend best-practices for avoiding it? The program hanging to me is more pernicious than even an unexpected OOM 🤔.

Thank you!

hillarysanders avatar Mar 28 '24 19:03 hillarysanders