vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: `size_k must divisible by BLOCK_SIZE_K` error when using tensor parallelism with AWQ-quantized MoE models

Open ehartford opened this issue 7 months ago • 10 comments

Your current environment

The output of python collect_env.py
PyTorch version: 2.7.0+cu126
Is debug build: False
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-1053-nvidia-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.8.93
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 535.161.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      43 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             256
On-line CPU(s) list:                0-255
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7742 64-Core Processor
CPU family:                         23
Model:                              49
Thread(s) per core:                 2
Core(s) per socket:                 64
Socket(s):                          2
Stepping:                           0
Frequency boost:                    enabled
CPU max MHz:                        2250.0000
CPU min MHz:                        1500.0000
BogoMIPS:                           4491.41
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization:                     AMD-V
L1d cache:                          4 MiB (128 instances)
L1i cache:                          4 MiB (128 instances)
L2 cache:                           64 MiB (128 instances)
L3 cache:                           512 MiB (32 instances)
NUMA node(s):                       8
NUMA node0 CPU(s):                  0-15,128-143
NUMA node1 CPU(s):                  16-31,144-159
NUMA node2 CPU(s):                  32-47,160-175
NUMA node3 CPU(s):                  48-63,176-191
NUMA node4 CPU(s):                  64-79,192-207
NUMA node5 CPU(s):                  80-95,208-223
NUMA node6 CPU(s):                  96-111,224-239
NUMA node7 CPU(s):                  112-127,240-255
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pynvml==12.0.0
[pip3] pytorch-triton==3.3.0+git96316ce5
[pip3] pyzmq==26.4.0
[pip3] torch==2.7.0
[pip3] torchaudio==2.7.0
[pip3] torchvision==0.22.0
[pip3] transformers==4.52.0.dev0
[pip3] triton==3.3.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.6.4.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.6.80                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.6.77                  pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.6.77                  pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.5.1.17                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.3.0.4                 pypi_0    pypi
[conda] nvidia-cufile-cu12        1.11.1.6                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.7.77                pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.7.1.2                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.5.4.2                 pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
[conda] nvidia-ml-py              12.570.86                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.6.85                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.6.77                  pypi_0    pypi
[conda] pynvml                    12.0.0                   pypi_0    pypi
[conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
[conda] pyzmq                     26.4.0                   pypi_0    pypi
[conda] torch                     2.7.0                    pypi_0    pypi
[conda] torchaudio                2.7.0                    pypi_0    pypi
[conda] torchvision               0.22.0                   pypi_0    pypi
[conda] transformers              4.52.0.dev0              pypi_0    pypi
[conda] triton                    3.3.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.5.dev443+gb90b0852e (git sha: b90b0852e)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   NIC11   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3               N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3               N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     16-31,144-159   1               N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     16-31,144-159   1               N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     112-127,240-255 7               N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     112-127,240-255 7               N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     80-95,208-223   5               N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     80-95,208-223   5               N/A
NIC0    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC2    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC3    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     SYS     SYS     SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      SYS     SYS     SYS     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS
NIC7    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS
NIC8    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS
NIC9    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS
NIC10   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX
NIC11   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
  NIC10: mlx5_10
  NIC11: mlx5_11

NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

When attempting to use tensor parallelism with AWQ-quantized MoE models (specifically with Qwen3-30B-A3B-AWQ), I'm encountering issues with tensor parallelism. The behavior varies depending on the tensor parallel size:

  • TP=1: Works correctly
  • TP=2: Works correctly
  • TP=3, 5, 6, 7: Fast fails with ValueError: Total number of attention heads (32) must be divisible by tensor parallel size
  • TP=4, 8: Starts loading but fails during initialization with:
    RuntimeError: Worker failed with error 'size_k must divisible by BLOCK_SIZE_K', please check the stack trace above for the root cause
    

Reproduction Steps

  1. Clone the latest vLLM repo
  2. Install with pip install -e .
  3. Run the following command:
python -m vllm.entrypoints.api_server \
  --model CognitiveComputations/Qwen3-30B-A3B-AWQ \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768 \
  --max-num-seqs 64 \
  --tensor-parallel-size 4 \
  --host 127.0.0.1 \
  --port 8080

Expected Behavior

The model should load successfully with tensor parallelism.

Actual Behavior

For TP=4 and TP=8, the error occurs during model initialization in the fused_moe/fused_moe.py code:

File "/raid/workspace/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 526, in invoke_fused_moe_kernel
  ops.moe_wna16_gemm(A, C, B, B_scale, B_zp,
File "/raid/workspace/vllm/vllm/_custom_ops.py", line 1265, in moe_wna16_gemm
  torch.ops._moe_C.moe_wna16_gemm(input, output, b_qweight, b_scales,
RuntimeError: size_k must divisible by BLOCK_SIZE_K

For TP=3, 5, 6, 7, the error is that the total number of attention heads (32) is not divisible by the tensor parallel size.

Environment Information

  • vLLM version: 0.8.5.dev443 (development version)
  • PyTorch version: 2.2.0
  • CUDA version: 12.1
  • Hardware: 8x A100 GPUs
  • Operating System: Linux

Additional Context

  1. Every MoE layer in the model shows this warning before the error occurs:

    WARNING: Layer 'model.layers.*.mlp.experts' is not supported by AWQMoeMarlin. Falling back to Moe WNA16 kernels.
    

    Then the failure happens when trying to use those WNA16 kernels with tensor parallelism.

  2. This issue seems to be specific to AWQ-quantized MoE models with tensor parallelism, as:

    • The model works fine without tensor parallelism
    • The tensor parallelism works with divisible TP sizes (2 and potentially others like 4 and 8)
    • The issue occurs specifically in the MoE AWQ quantization code path

Potential Solution Direction

There seems to be an issue with how the matrix dimensions are handled when using tensor parallelism with AWQ-MoE models. The error suggests that when using tensor parallelism, the matrix dimensions used in the MoE layers are not properly aligned with the required block size for the AWQ WNA16 kernels.

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

ehartford avatar May 03 '25 00:05 ehartford

Hi, regarding the failures you observed with tensor-parallel-size (TP) set to 3, 5, 6, or 7, the error message ValueError: Total number of attention heads (32) must be divisible by tensor parallel size points to the cause.

This specific model (Qwen3-30B-A3B-AWQ) has 32 attention heads, as confirmed by its configuration (num_attention_heads: 32). A known requirement for tensor parallelism implementation in vLLM is that the total number of attention heads must be evenly divisible by the tensor parallel size for the attention mechanism to be parallelized correctly.

Since 32 is not divisible by 3, 5, 6, or 7, the failure you see with these TP sizes is actually expected behavior due to this fundamental constraint of the tensor parallelism implementation for attention layers.

princepride avatar May 03 '25 03:05 princepride

@DarkLight1337 I can confirm I'm also seeing the RuntimeError: size_k must divisible by BLOCK_SIZE_K when running CognitiveComputations/Qwen3-30B-A3B-AWQ with tensor-parallel-size=4 (and likely TP=8 too). The logs show this happens after vLLM falls back to the Moe WNA16 kernels for the MoE layers.

Root Cause:

The error comes directly from a check in the C++ code for moe_wna16_gemm (vllm/csrc/moe/moe_wna16_gemm.cu): TORCH_CHECK(size_k % BLOCK_SIZE_K == 0, ...). This check makes sure the K dimension (size_k) going into the CUDA kernel is a multiple of BLOCK_SIZE_K. The kernel needs this alignment for performance/correctness.

With TP=4, when the model's K dimension (like hidden_size or intermediate_size) gets split across GPUs, the resulting size_k on each worker isn't divisible by BLOCK_SIZE_K for this model. This causes the check to fail during the model warm-up/graph capture (_dummy_run).

Proposed Fix:

The way to handle this is to make sure the dimensions are correctly aligned before calling moe_wna16_gemm. This usually means padding the K dimension up to the next multiple of BLOCK_SIZE_K.

The input activation tensor A needs its K dimension padded. This could potentially be done dynamically in Python (invoke_fused_moe_kernel in fused_moe.py) using torch.nn.functional.pad just before calling the operation. The weight tensors (B, B_scale, B_zp) also need their K dimension aligned. Padding these dynamically every time would be slow. The better approach is likely to handle this during model loading/sharding so the weights on each worker already have the padded K dimension. This might involve changes to the model class or weight loading code. Just removing the TORCH_CHECK in the C++ code won't work; it would just hide the problem and likely cause GPU errors or incorrect results later.

princepride avatar May 03 '25 07:05 princepride

Hi, regarding the failures you observed with tensor-parallel-size (TP) set to 3, 5, 6, or 7, the error message ValueError: Total number of attention heads (32) must be divisible by tensor parallel size points to the cause.

This specific model (Qwen3-30B-A3B-AWQ) has 32 attention heads, as confirmed by its configuration (num_attention_heads: 32). A known requirement for tensor parallelism implementation in vLLM is that the total number of attention heads must be evenly divisible by the tensor parallel size for the attention mechanism to be parallelized correctly.

Since 32 is not divisible by 3, 5, 6, or 7, the failure you see with these TP sizes is actually expected behavior due to this fundamental constraint of the tensor parallelism implementation for attention layers.

I'm aware. I included that information for completeness.

ehartford avatar May 03 '25 15:05 ehartford

I am testing today with https://huggingface.co/cognitivecomputations/Qwen3-235B-A22B-AWQ I expect to see the same pattern. I will update with the results.

ehartford avatar May 03 '25 15:05 ehartford

Tested with https://huggingface.co/cognitivecomputations/Qwen3-235B-A22B-AWQ and experienced the same issue, a quick Google search got me here. TP=8, with 8x RTX 3090s.

TheAhmadOsman avatar May 04 '25 14:05 TheAhmadOsman

@TheAhmadOsman @ehartford Have you test the original model https://huggingface.co/Qwen/Qwen3-235B-A22B

princepride avatar May 05 '25 05:05 princepride

Also tested with Qwen/Qwen3-235B-A22B-GPTQ-Int4 and experienced the same issue. TP=8, with 8x A6000

braxtynmd avatar May 12 '25 20:05 braxtynmd

Also tested with Qwen/Qwen3-235B-A22B-GPTQ-Int4 and same issue with 8x3090.

Nero10578 avatar May 13 '25 11:05 Nero10578

I have the same issue with https://huggingface.co/Qwen/Qwen3-235B-A22B-GPTQ-Int4 using TP=4

chriswritescode-dev avatar May 14 '25 17:05 chriswritescode-dev

I have the same issue with OPEA/DeepSeek-V2.5-1210-int4-sym-inc using TP=8,TP=4 is OK.

eecspan avatar May 27 '25 03:05 eecspan

I have the same issue with https://huggingface.co/Qwen/Qwen3-235B-A22B-GPTQ-Int4 using TP=8

zhangao0086 avatar Jun 10 '25 03:06 zhangao0086

The same issue using custom AWQ-quantified model.

JL-Cheng avatar Jun 17 '25 09:06 JL-Cheng

+1 in 4 * 3090 by using TP=4 (but I could run it in a few weeks ago in the same settings?)

xmanners avatar Jun 18 '25 04:06 xmanners

but it was very strange that when using enforce-eager, vllm could run successfully..

xmanners avatar Jun 18 '25 04:06 xmanners

I have the same issue with https://huggingface.co/Qwen/Qwen3-235B-A22B-GPTQ-Int4 using TP=8 ,with 8x L20

MrBlue-1996 avatar Jun 25 '25 07:06 MrBlue-1996

The enviroment: 8 A800-40G

docker : vllm/vllm-openai:latest

pip3 list |grep vllm vllm 0.9.1 root@lx-a800-403-j11-16u:/vllm-workspace# nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2025 NVIDIA Corporation Built on Fri_Feb_21_20:23:50_PST_2025 Cuda compilation tools, release 12.8, V12.8.93 Build cuda_12.8.r12.8/compiler.35583870_0

model:https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-GPTQ-Int4

server : root@lx-a800-403-j11-16u:/vllm-workspace# VLLM_USE_MODELSCOPE=true vllm serve /data1/llm-models/qwen/Qwen3-235B-A22B-GPTQ-Int4 --enable-reasoning --reasoning-parser deepseek_r1 --gpu-memory-utilization 0.85 -tp 8 --enforce-eager --served-model-name qwq-235b-gptq-int4 --trust-remote-code

the cmd can be successful to run, but when post a request. it's error

client:

curl http://0.0.0.0:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "qwq-235b-gptq-int4","prompt": "hello","max_tokens": 7,"temperature": 0}' {"object":"error","message":"EngineCore encountered an issue. See stack trace (above) for the root cause.","type":"Internal Server Error","param":null,"code":500}

Then server:

ERROR 06-26 00:04:52 [core.py:517] raise RuntimeError( ERROR 06-26 00:04:52 [core.py:517] RuntimeError: Worker failed with error 'size_k must divisible by BLOCK_SIZE_K', please check the stack trace above for the root cause ERROR 06-26 00:04:52 [async_llm.py:420] AsyncLLM output_handler failed. ERROR 06-26 00:04:52 [async_llm.py:420] Traceback (most recent call last): ERROR 06-26 00:04:52 [async_llm.py:420] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 379, in output_handler ERROR 06-26 00:04:52 [async_llm.py:420] outputs = await engine_core.get_output_async() ERROR 06-26 00:04:52 [async_llm.py:420] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-26 00:04:52 [async_llm.py:420] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 790, in get_output_async ERROR 06-26 00:04:52 [async_llm.py:420] raise self._format_exception(outputs) from None ERROR 06-26 00:04:52 [async_llm.py:420] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. INFO 06-26 00:04:52 [async_llm.py:346] Request cmpl-aaa2110c5e334d49bc6d0b4203b3f46c-0 failed (engine dead). INFO: 127.0.0.1:49422 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error INFO: Shutting down INFO: Waiting for application shutdown. INFO: Application shutdown complete. INFO: Finished server process [9394]

xinyihaha avatar Jun 26 '25 07:06 xinyihaha

I have the same issue with https://huggingface.co/Qwen/Qwen3-235B-A22B-GPTQ-Int4 using TP=8. l print size_k and BLOCK_SIZE_K ,size_k =192 , BLOCK_SIZE_K =128 ,so what is need change ?

fp674018495 avatar Jul 07 '25 09:07 fp674018495

For errors caused by excessive tensor parallelism, you can set --enable-expert-parallel. Refer to: https://github.com/vllm-project/vllm/issues/17327

llmadd avatar Jul 16 '25 05:07 llmadd

对于因张量并行度过高导致的错误,可以设置 --enable-expert-parallel 参数。 参考:#17327

Deploying a model with such settings might reduce the inference efficienc

fp674018495 avatar Jul 18 '25 05:07 fp674018495

对于因张量并行度过高导致的错误,可以设置 --enable-expert-parallel 参数。 参考:#17327

Deploying a model with such settings might reduce the inference efficienc使用这些设置部署模型可能会降低推理效率

I'm not quite sure. I tried EP or TP 4 PP 2, and it worked. Is the inference efficiency ranked as TP > PP > EP?

llmadd avatar Jul 18 '25 08:07 llmadd

For errors caused by excessive tensor parallelism, you can set --enable-expert-parallel. Refer to: #17327

it works! thank u!

hediyuan avatar Jul 22 '25 06:07 hediyuan

set --enable-expert-parallel would cause an other problem on old GPU (<sm80)

ValueError: Marlin does not support weight_bits = uint4. Only types = [] are supported (for group_size = 128, device_capability = 75, zp = True).

alexpong0630 avatar Aug 02 '25 10:08 alexpong0630

Just so you know, we were able to run it successfully, and this --enable-expert-parallel helped us get a step closer.

We're running on a g5.48xlarge with 8x NVIDIA A10G.

$ vllm serve Qwen/Qwen3-235B-A22B-GPTQ-Int4 \
  [...]
  --tensor-parallel-size=8 \
  --enable-expert-parallel \
  --gpu-memory-utilization=0.8

Moep90 avatar Aug 05 '25 07:08 Moep90

Just so you know, we were able to run it successfully, and this --enable-expert-parallel helped us get a step closer.

We're running on a g5.48xlarge with 8x NVIDIA A10G.

$ vllm serve Qwen/Qwen3-235B-A22B-GPTQ-Int4
[...] --tensor-parallel-size=8
--enable-expert-parallel
--gpu-memory-utilization=0.8

Haha, me too – A100 * 8 and Qwen3-235B-A22B-GPTQ-Int4

llmadd avatar Aug 05 '25 08:08 llmadd

对于因张量并行度过高导致的错误,可以设置 --enable-expert-parallel 参数。 参考:#17327

Deploying a model with such settings might reduce the inference efficienc

Is there any reference

10jin-yidiandian avatar Aug 28 '25 06:08 10jin-yidiandian

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Nov 29 '25 02:11 github-actions[bot]