vllm
vllm copied to clipboard
[Bug]: Qwen72B service(TP=4) gets stuck after running N requests. The GPU utilization of 3 GPUs is at 100%, while 1 GPU is at 0%. Simultaneously, the CPU utilization is at 100%, and many requests are in CLOSE_WAIT status.
Your current environment
Collecting environment information... PyTorch version: 2.1.2+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.22.1 Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-3.10.0-1160.108.1.el7.x86_64-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.1.66 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB GPU 1: NVIDIA A100-SXM4-80GB GPU 2: NVIDIA A100-SXM4-80GB GPU 3: NVIDIA A100-SXM4-80GB
Nvidia driver version: 535.129.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 2 Stepping: 6 BogoMIPS: 5799.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd rsb_ctxsw ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512vbmi avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq spec_ctrl intel_stibp arch_capabilities Hypervisor vendor: KVM Virtualization type: full L1d cache: 3 MiB (64 instances) L1i cache: 2 MiB (64 instances) L2 cache: 80 MiB (64 instances) L3 cache: 96 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-63 NUMA node1 CPU(s): 64-127 Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; Load fences, usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full retpoline, IBPB Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected
Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==8.9.2.26 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-nccl-cu12==2.18.1 [pip3] nvidia-nvjitlink-cu12==12.3.101 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pynvml==11.5.0 [pip3] torch==2.1.2 [pip3] transformers==4.38.2 [pip3] transformers-stream-generator==0.0.5 [pip3] triton==2.1.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV12 NV12 NV12 0-127 0-1 N/A GPU1 NV12 X NV12 NV12 0-127 0-1 N/A GPU2 NV12 NV12 X NV12 0-127 0-1 N/A GPU3 NV12 NV12 NV12 X 0-127 0-1 N/A
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
Model Input Dumps
No response
🐛 Describe the bug
When I use TP=4 to start the service of the qwen72B model on the A100, the service will get stuck after running N requests. The GPU utilization of 3 GPU is 100% and 1 GPU is 0%. At the same time, the CPU utilization is 100%, and many requests are in CLOSE_WAIT status.
The overall logs show no errors, but one request is always in the running state:
INFO 09-11 14:13:08 async_llm_engine.py:554] Received request e88c77477cde4531a4fc53f5e78724d7: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user. XXXXXX. <|im_end|>\n<|im_start|>assistant\n', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.3, top_p=0.3, top_k=50, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['<|im_end|>', '<|endoftext|>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=5120, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: None, lora_request: None. INFO 09-11 14:13:08 metrics.py:229] Avg prompt throughput: 12.6 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0% INFO 09-11 14:13:13 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0% INFO 09-11 14:13:18 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%
more info :
besides,I set the environment variable to enable more logging. export VLLM_LOGGING_LEVEL=DEBUG export CUDA_LAUNCH_BLOCKING=1 export NCCL_DEBUG=TRACE export VLLM_TRACE_FUNCTION=1
The last log of VLLM_TRACE_FUNCTION is:
`2024-09-11 14:13:36.802123 Call to apply in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:110 from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:306 2024-09-11 14:13:36.802355 Return from apply in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:119 to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:306 2024-09-11 14:13:36.802378 Return from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:313 to _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.802404 Call to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/activation.py:31 from _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.802437 Call to silu_and_mul in /function/causal_language_modeling_0905/vllm/_custom_ops.py:13 from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/activation.py:35 2024-09-11 14:13:36.802477 Return from silu_and_mul in /function/causal_language_modeling_0905/vllm/_custom_ops.py:14 to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/activation.py:35 2024-09-11 14:13:36.802492 Return from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/activation.py:36 to _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.802516 Call to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:715 from _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.802532 Call to apply in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:110 from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:726 2024-09-11 14:13:36.802667 Return from apply in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:119 to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:726 2024-09-11 14:13:36.802687 Call to tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:13 from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:728 2024-09-11 14:13:36.802707 Call to get_tensor_model_parallel_world_size in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:212 from tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:30 2024-09-11 14:13:36.802727 Call to get_tensor_model_parallel_group in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:198 from get_tensor_model_parallel_world_size in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:215 2024-09-11 14:13:36.802741 Return from get_tensor_model_parallel_group in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:202 to get_tensor_model_parallel_world_size in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:215 2024-09-11 14:13:36.802762 Return from get_tensor_model_parallel_world_size in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:214 to tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:30 2024-09-11 14:13:36.802777 Call to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:116 from tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:32 2024-09-11 14:13:36.802790 Call to get_handle in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:96 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:117 2024-09-11 14:13:36.802803 Return from get_handle in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:97 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:117 2024-09-11 14:13:36.802817 Call to is_capturing in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:92 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:121 2024-09-11 14:13:36.802830 Return from is_capturing in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:93 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:121 2024-09-11 14:13:36.802845 Call to should_custom_ar in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:249 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:135 2024-09-11 14:13:36.802861 Return from should_custom_ar in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:250 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:135 2024-09-11 14:13:36.802875 Call to all_reduce_unreg in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:262 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:136
`
2024-09-11 14:13:36.802500 Return from silu_and_mul in /function/causal_language_modeling_0905/vllm/_custom_ops.py:14 to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/activation.py:35 2024-09-11 14:13:36.802522 Return from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/activation.py:36 to _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.802560 Call to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:715 from _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.802592 Call to apply in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:110 from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:726 2024-09-11 14:13:36.802742 Return from apply in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:119 to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:726 2024-09-11 14:13:36.802767 Call to tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:13 from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:728 2024-09-11 14:13:36.802794 Call to get_tensor_model_parallel_world_size in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:212 from tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:30 2024-09-11 14:13:36.802813 Call to get_tensor_model_parallel_group in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:198 from get_tensor_model_parallel_world_size in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:215 2024-09-11 14:13:36.802838 Return from get_tensor_model_parallel_group in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:202 to get_tensor_model_parallel_world_size in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:215 2024-09-11 14:13:36.802865 Return from get_tensor_model_parallel_world_size in /function/causal_language_modeling_0905/vllm/distributed/parallel_state.py:214 to tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:30 2024-09-11 14:13:36.802883 Call to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:116 from tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:32 2024-09-11 14:13:36.802901 Call to get_handle in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:96 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:117 2024-09-11 14:13:36.802918 Return from get_handle in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:97 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:117 2024-09-11 14:13:36.802935 Call to is_capturing in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:92 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:121 2024-09-11 14:13:36.802953 Return from is_capturing in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:93 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:121 2024-09-11 14:13:36.802971 Call to should_custom_ar in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:249 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:135 2024-09-11 14:13:36.802991 Return from should_custom_ar in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:250 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:135 2024-09-11 14:13:36.803009 Call to all_reduce_unreg in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:262 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:136
2024-09-11 14:13:36.801708 Call to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:116 from tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:32 2024-09-11 14:13:36.801722 Call to get_handle in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:96 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:117 2024-09-11 14:13:36.801735 Return from get_handle in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:97 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:117 2024-09-11 14:13:36.801749 Call to is_capturing in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:92 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:121 2024-09-11 14:13:36.801762 Return from is_capturing in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:93 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:121 2024-09-11 14:13:36.801776 Call to should_custom_ar in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:249 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:135 2024-09-11 14:13:36.801793 Return from should_custom_ar in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:250 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:135 2024-09-11 14:13:36.801807 Call to all_reduce_unreg in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:262 from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:136 2024-09-11 14:13:36.801882 Return from all_reduce_unreg in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:266 to custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:136 2024-09-11 14:13:36.801900 Return from custom_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/device_communicators/custom_all_reduce.py:136 to tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:32 2024-09-11 14:13:36.801915 Return from tensor_model_parallel_all_reduce in /function/causal_language_modeling_0905/vllm/distributed/communication_op.py:34 to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:728 2024-09-11 14:13:36.801935 Return from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/linear.py:738 to _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.801954 Return from forward in /function/causal_language_modeling_0905/vllm/model_executor/models/qwen2.py:223 to _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.801988 Call to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/layernorm.py:46 from _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527 2024-09-11 14:13:36.802011 Call to fused_add_rms_norm in /function/causal_language_modeling_0905/vllm/_custom_ops.py:109 from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/layernorm.py:52 2024-09-11 14:13:36.802047 Return from fused_add_rms_norm in /function/causal_language_modeling_0905/vllm/_custom_ops.py:111 to forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/layernorm.py:52 2024-09-11 14:13:36.802064 Return from forward in /function/causal_language_modeling_0905/vllm/model_executor/layers/layernorm.py:58 to _call_impl in /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527
` 2024-09-11 14:13:36.727544 Return from is_prefill in /function/causal_language_modeling_0905/vllm/sequence.py:557 to schedule in /function/causal_language_modeling_0905/vllm/core/scheduler.py:964 2024-09-11 14:13:36.727572 Call to init in /function/causal_language_modeling_0905/vllm/sequence.py:585 from schedule in /function/causal_language_modeling_0905/vllm/core/scheduler.py:965 2024-09-11 14:13:36.727599 Return from init in /function/causal_language_modeling_0905/vllm/sequence.py:615 to schedule in /function/causal_language_modeling_0905/vllm/core/scheduler.py:965 2024-09-11 14:13:36.727618 Call to mark_blocks_as_computed in /function/causal_language_modeling_0905/vllm/core/block_manager_v1.py:622 from schedule in /function/causal_language_modeling_0905/vllm/core/scheduler.py:992 2024-09-11 14:13:36.727632 Return from mark_blocks_as_computed in /function/causal_language_modeling_0905/vllm/core/block_manager_v1.py:623 to schedule in /function/causal_language_modeling_0905/vllm/core/scheduler.py:992 2024-09-11 14:13:36.727646 Return from schedule in /function/causal_language_modeling_0905/vllm/core/scheduler.py:995 to step_async in /function/causal_language_modeling_0905/vllm/engine/async_llm_engine.py:226 2024-09-11 14:13:36.727661 Call to is_empty in /function/causal_language_modeling_0905/vllm/core/scheduler.py:142 from step_async in /function/causal_language_modeling_0905/vllm/engine/async_llm_engine.py:228 2024-09-11 14:13:36.727675 Return from is_empty in /function/causal_language_modeling_0905/vllm/core/scheduler.py:144 to step_async in /function/causal_language_modeling_0905/vllm/engine/async_llm_engine.py:228 2024-09-11 14:13:36.727694 Call to execute_model_async in /function/causal_language_modeling_0905/vllm/executor/distributed_gpu_executor.py:108 from step_async in /function/causal_language_modeling_0905/vllm/engine/async_llm_engine.py:230 2024-09-11 14:13:36.727710 Call to _run_workers_async in /function/causal_language_modeling_0905/vllm/executor/ray_gpu_executor.py:329 from execute_model_async in /function/causal_language_modeling_0905/vllm/executor/distributed_gpu_executor.py:110 2024-09-11 14:13:36.727728 Call to _async_wrapper in /function/causal_language_modeling_0905/vllm/utils.py:216 from _run_workers_async in /function/causal_language_modeling_0905/vllm/executor/ray_gpu_executor.py:346 2024-09-11 14:13:36.727935 Return from _async_wrapper in /function/causal_language_modeling_0905/vllm/utils.py:219 to _run_workers_async in /function/causal_language_modeling_0905/vllm/executor/ray_gpu_executor.py:346 2024-09-11 14:13:36.728567 Return from _run_workers_async in /function/causal_language_modeling_0905/vllm/executor/ray_gpu_executor.py:352 to execute_model_async in /function/causal_language_modeling_0905/vllm/executor/distributed_gpu_executor.py:110 2024-09-11 14:13:36.728642 Return from execute_model_async in /function/causal_language_modeling_0905/vllm/executor/distributed_gpu_executor.py:110 to step_async in /function/causal_language_modeling_0905/vllm/engine/async_llm_engine.py:230 2024-09-11 14:13:36.728732 Return from step_async in /function/causal_language_modeling_0905/vllm/engine/async_llm_engine.py:230 to engine_step in /function/causal_language_modeling_0905/vllm/engine/async_llm_engine.py:500 2024-09-11 14:13:36.728815 Return from engine_step in /function/causal_language_modeling_0905/vllm/engine/async_llm_engine.py:500 to run in /usr/lib/python3.10/asyncio/runners.py:44
`
Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.