vllm [Bug]: Qwen72B service(TP=4) gets stuck after running N requests. The GPU utilization of 3 GPUs is at 100%, while 1 GPU is at 0%. Simultaneously, the CPU utilization is at 100%, and many requests are in CLOSE

[Bug]: Qwen72B service(TP=4) gets stuck after running N requests. The GPU utilization of 3 GPUs is at 100%, while 1 GPU is at 0%. Simultaneously, the CPU utilization is at 100%, and many requests are in CLOSE_WAIT status.

Open liweiqing1997 opened this issue 5 months ago • 5 comments

Your current environment

Collecting environment information... PyTorch version: 2.1.2+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.22.1 Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-3.10.0-1160.108.1.el7.x86_64-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.1.66 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB GPU 1: NVIDIA A100-SXM4-80GB GPU 2: NVIDIA A100-SXM4-80GB GPU 3: NVIDIA A100-SXM4-80GB

Nvidia driver version: 535.129.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 2 Stepping: 6 BogoMIPS: 5799.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd rsb_ctxsw ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512vbmi avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq spec_ctrl intel_stibp arch_capabilities Hypervisor vendor: KVM Virtualization type: full L1d cache: 3 MiB (64 instances) L1i cache: 2 MiB (64 instances) L2 cache: 80 MiB (64 instances) L3 cache: 96 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-63 NUMA node1 CPU(s): 64-127 Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; Load fences, usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full retpoline, IBPB Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==8.9.2.26 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-nccl-cu12==2.18.1 [pip3] nvidia-nvjitlink-cu12==12.3.101 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pynvml==11.5.0 [pip3] torch==2.1.2 [pip3] transformers==4.38.2 [pip3] transformers-stream-generator==0.0.5 [pip3] triton==2.1.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV12 NV12 NV12 0-127 0-1 N/A GPU1 NV12 X NV12 NV12 0-127 0-1 N/A GPU2 NV12 NV12 X NV12 0-127 0-1 N/A GPU3 NV12 NV12 NV12 X 0-127 0-1 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

Model Input Dumps

No response

🐛 Describe the bug

When I use TP=4 to start the service of the qwen72B model on the A100, the service will get stuck after running N requests. The GPU utilization of 3 GPU is 100% and 1 GPU is 0%. At the same time, the CPU utilization is 100%, and many requests are in CLOSE_WAIT status.

The overall logs show no errors, but one request is always in the running state:

INFO 09-11 14:13:08 async_llm_engine.py:554] Received request e88c77477cde4531a4fc53f5e78724d7: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user. XXXXXX. <|im_end|>\n<|im_start|>assistant\n', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.3, top_p=0.3, top_k=50, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['<|im_end|>', '<|endoftext|>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=5120, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: None, lora_request: None. INFO 09-11 14:13:08 metrics.py:229] Avg prompt throughput: 12.6 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0% INFO 09-11 14:13:13 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0% INFO 09-11 14:13:18 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%

more info :

besides,I set the environment variable to enable more logging. export VLLM_LOGGING_LEVEL=DEBUG export CUDA_LAUNCH_BLOCKING=1 export NCCL_DEBUG=TRACE export VLLM_TRACE_FUNCTION=1

The last log of VLLM_TRACE_FUNCTION is:

`2024-09-11 14:13:36.802123 Call to apply in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:110 from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:306 2024-09-11 14:13:36.802355 Return from apply in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:119 to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:306 2024-09-11 14:13:36.802378 Return from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:313 to _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.802404 Call to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/activation.py:31 from _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.802437 Call to silu_and_mul in /function/causal_language_modeling_0905/vllm/_custom_ops.py:13 from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/activation.py:35 2024-09-11 14:13:36.802477 Return from silu_and_mul in /function/causal_language_modeling_0905/vllm/_custom_ops.py:14 to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/activation.py:35 2024-09-11 14:13:36.802492 Return from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/activation.py:36 to _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.802516 Call to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:715 from _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.802532 Call to apply in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:110 from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:726 2024-09-11 14:13:36.802667 Return from apply in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:119 to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:726 2024-09-11 14:13:36.802687 Call to tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:13 from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:728 2024-09-11 14:13:36.802707 Call to get_tensor_model_parallel_world_size in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:212 from tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:30 2024-09-11 14:13:36.802727 Call to get_tensor_model_parallel_group in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:198 from get_tensor_model_parallel_world_size in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:215 2024-09-11 14:13:36.802741 Return from get_tensor_model_parallel_group in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:202 to get_tensor_model_parallel_world_size in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:215 2024-09-11 14:13:36.802762 Return from get_tensor_model_parallel_world_size in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:214 to tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:30 2024-09-11 14:13:36.802777 Call to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:116 from tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:32 2024-09-11 14:13:36.802790 Call to get_handle in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:96 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:117 2024-09-11 14:13:36.802803 Return from get_handle in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:97 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:117 2024-09-11 14:13:36.802817 Call to is_capturing in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:92 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:121 2024-09-11 14:13:36.802830 Return from is_capturing in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:93 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:121 2024-09-11 14:13:36.802845 Call to should_custom_ar in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:249 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:135 2024-09-11 14:13:36.802861 Return from should_custom_ar in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:250 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:135 2024-09-11 14:13:36.802875 Call to all_reduce_unreg in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:262 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:136

2024-09-11 14:13:36.802500 Return from silu_and_mul in /function/causal_language_modeling_0905/vllm/_custom_ops.py:14 to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/activation.py:35 2024-09-11 14:13:36.802522 Return from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/activation.py:36 to _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.802560 Call to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:715 from _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.802592 Call to apply in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:110 from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:726 2024-09-11 14:13:36.802742 Return from apply in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:119 to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:726 2024-09-11 14:13:36.802767 Call to tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:13 from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:728 2024-09-11 14:13:36.802794 Call to get_tensor_model_parallel_world_size in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:212 from tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:30 2024-09-11 14:13:36.802813 Call to get_tensor_model_parallel_group in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:198 from get_tensor_model_parallel_world_size in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:215 2024-09-11 14:13:36.802838 Return from get_tensor_model_parallel_group in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:202 to get_tensor_model_parallel_world_size in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:215 2024-09-11 14:13:36.802865 Return from get_tensor_model_parallel_world_size in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:214 to tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:30 2024-09-11 14:13:36.802883 Call to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:116 from tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:32 2024-09-11 14:13:36.802901 Call to get_handle in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:96 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:117 2024-09-11 14:13:36.802918 Return from get_handle in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:97 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:117 2024-09-11 14:13:36.802935 Call to is_capturing in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:92 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:121 2024-09-11 14:13:36.802953 Return from is_capturing in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:93 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:121 2024-09-11 14:13:36.802971 Call to should_custom_ar in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:249 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:135 2024-09-11 14:13:36.802991 Return from should_custom_ar in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:250 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:135 2024-09-11 14:13:36.803009 Call to all_reduce_unreg in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:262 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:136

2024-09-11 14:13:36.801708 Call to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:116 from tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:32 2024-09-11 14:13:36.801722 Call to get_handle in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:96 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:117 2024-09-11 14:13:36.801735 Return from get_handle in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:97 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:117 2024-09-11 14:13:36.801749 Call to is_capturing in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:92 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:121 2024-09-11 14:13:36.801762 Return from is_capturing in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:93 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:121 2024-09-11 14:13:36.801776 Call to should_custom_ar in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:249 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:135 2024-09-11 14:13:36.801793 Return from should_custom_ar in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:250 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:135 2024-09-11 14:13:36.801807 Call to all_reduce_unreg in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:262 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:136 2024-09-11 14:13:36.801882 Return from all_reduce_unreg in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:266 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:136 2024-09-11 14:13:36.801900 Return from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:136 to tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:32 2024-09-11 14:13:36.801915 Return from tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:34 to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:728 2024-09-11 14:13:36.801935 Return from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:738 to _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.801954 Return from forward in /function/causal_language_modeling_0905/vllm/model_executor/models/qwen2.py:223 to _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.801988 Call to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/layernorm.py:46 from _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.802011 Call to fused_add_rms_norm in /function/causal_language_modeling_0905/vllm/_custom_ops.py:109 from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/layernorm.py:52 2024-09-11 14:13:36.802047 Return from fused_add_rms_norm in /function/causal_language_modeling_0905/vllm/_custom_ops.py:111 to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/layernorm.py:52 2024-09-11 14:13:36.802064 Return from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/layernorm.py:58 to _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527

` 2024-09-11 14:13:36.727544 Return from is_prefill in /function/causal_language_modeling_0905/vllm/sequence.py:557 to schedule in /function/causal_language_modeling_0905/vllm/core/scheduler.py:964 2024-09-11 14:13:36.727572 Call to init in /function/causal_language_modeling_0905/vllm/sequence.py:585 from schedule in /function/causal_language_modeling_0905/vllm/core/scheduler.py:965 2024-09-11 14:13:36.727599 Return from init in /function/causal_language_modeling_0905/vllm/sequence.py:615 to schedule in /function/causal_language_modeling_0905/vllm/core/scheduler.py:965 2024-09-11 14:13:36.727618 Call to mark_blocks_as_computed in /function/causal_language_modeling_0905/vllm/core/block_manager_v1.py:622 from schedule in /function/causal_language_modeling_0905/vllm/core/scheduler.py:992 2024-09-11 14:13:36.727632 Return from mark_blocks_as_computed in /function/causal_language_modeling_0905/vllm/core/block_manager_v1.py:623 to schedule in /function/causal_language_modeling_0905/vllm/core/scheduler.py:992 2024-09-11 14:13:36.727646 Return from schedule in /function/causal_language_modeling_0905/vllm/core/scheduler.py:995 to step_async in /function/causal_language_modeling_0905/vllm/engine/async_llm_engine.py:226 2024-09-11 14:13:36.727661 Call to is_empty in /function/causal_language_modeling_0905/vllm/core/scheduler.py:142 from step_async in /function/causal_language_modeling_0905/vllm/engine/async_llm_engine.py:228 2024-09-11 14:13:36.727675 Return from is_empty in /function/causal_language_modeling_0905/vllm/core/scheduler.py:144 to step_async in /function/causal_language_modeling_0905/vllm/engine/async_llm_engine.py:228 2024-09-11 14:13:36.727694 Call to execute_model_async in /function/causal_language_modeling_0905/vllm/executor/distributed_gpu_executor.py:108 from step_async in /function/causal_language_modeling_0905/vllm/engine/async_llm_engine.py:230 2024-09-11 14:13:36.727710 Call to _run_workers_async in /function/causal_language_modeling_0905/vllm/executor/ray_gpu_executor.py:329 from execute_model_async in /function/causal_language_modeling_0905/vllm/executor/distributed_gpu_executor.py:110 2024-09-11 14:13:36.727728 Call to _async_wrapper in /function/causal_language_modeling_0905/vllm/utils.py:216 from _run_workers_async in /function/causal_language_modeling_0905/vllm/executor/ray_gpu_executor.py:346 2024-09-11 14:13:36.727935 Return from _async_wrapper in /function/causal_language_modeling_0905/vllm/utils.py:219 to _run_workers_async in /function/causal_language_modeling_0905/vllm/executor/ray_gpu_executor.py:346 2024-09-11 14:13:36.728567 Return from _run_workers_async in /function/causal_language_modeling_0905/vllm/executor/ray_gpu_executor.py:352 to execute_model_async in /function/causal_language_modeling_0905/vllm/executor/distributed_gpu_executor.py:110 2024-09-11 14:13:36.728642 Return from execute_model_async in /function/causal_language_modeling_0905/vllm/executor/distributed_gpu_executor.py:110 to step_async in /function/causal_language_modeling_0905/vllm/engine/async_llm_engine.py:230 2024-09-11 14:13:36.728732 Return from step_async in /function/causal_language_modeling_0905/vllm/engine/async_llm_engine.py:230 to engine_step in /function/causal_language_modeling_0905/vllm/engine/async_llm_engine.py:500 2024-09-11 14:13:36.728815 Return from engine_step in /function/causal_language_modeling_0905/vllm/engine/async_llm_engine.py:500 to run in /usr/lib/python3.10/asyncio/runners.py:44

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Sep 12 '24 03:09 liweiqing1997

vllm vllm copied to clipboard

[Bug]: Qwen72B service(TP=4) gets stuck after running N requests. The GPU utilization of 3 GPUs is at 100%, while 1 GPU is at 0%. Simultaneously, the CPU utilization is at 100%, and many requests are in CLOSE_WAIT status.

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

vllm
vllm copied to clipboard