[Bug]: `size_k must divisible by BLOCK_SIZE_K` error when using tensor parallelism with AWQ-quantized MoE models
Your current environment
The output of python collect_env.py
PyTorch version: 2.7.0+cu126
Is debug build: False
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35
Python version: 3.12.9 | packaged by Anaconda, Inc. | (main, Feb 6 2025, 18:56:27) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-1053-nvidia-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.8.93
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB
Nvidia driver version: 535.161.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7742 64-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
Stepping: 0
Frequency boost: enabled
CPU max MHz: 2250.0000
CPU min MHz: 1500.0000
BogoMIPS: 4491.41
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization: AMD-V
L1d cache: 4 MiB (128 instances)
L1i cache: 4 MiB (128 instances)
L2 cache: 64 MiB (128 instances)
L3 cache: 512 MiB (32 instances)
NUMA node(s): 8
NUMA node0 CPU(s): 0-15,128-143
NUMA node1 CPU(s): 16-31,144-159
NUMA node2 CPU(s): 32-47,160-175
NUMA node3 CPU(s): 48-63,176-191
NUMA node4 CPU(s): 64-79,192-207
NUMA node5 CPU(s): 80-95,208-223
NUMA node6 CPU(s): 96-111,224-239
NUMA node7 CPU(s): 112-127,240-255
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pynvml==12.0.0
[pip3] pytorch-triton==3.3.0+git96316ce5
[pip3] pyzmq==26.4.0
[pip3] torch==2.7.0
[pip3] torchaudio==2.7.0
[pip3] torchvision==0.22.0
[pip3] transformers==4.52.0.dev0
[pip3] triton==3.3.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi
[conda] nvidia-cufile-cu12 1.11.1.6 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi
[conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi
[conda] nvidia-ml-py 12.570.86 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi
[conda] pynvml 12.0.0 pypi_0 pypi
[conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi
[conda] pyzmq 26.4.0 pypi_0 pypi
[conda] torch 2.7.0 pypi_0 pypi
[conda] torchaudio 2.7.0 pypi_0 pypi
[conda] torchvision 0.22.0 pypi_0 pypi
[conda] transformers 4.52.0.dev0 pypi_0 pypi
[conda] triton 3.3.0 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.5.dev443+gb90b0852e (git sha: b90b0852e)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS 48-63,176-191 3 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 16-31,144-159 1 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 16-31,144-159 1 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127,240-255 7 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5 N/A
NIC0 PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC1 PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC2 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS SYS SYS
NIC3 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX SYS SYS SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X SYS SYS SYS SYS SYS SYS
NIC6 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS
NIC7 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS
NIC8 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS X PXB SYS SYS
NIC9 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS PXB X SYS SYS
NIC10 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX
NIC11 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
NIC10: mlx5_10
NIC11: mlx5_11
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
When attempting to use tensor parallelism with AWQ-quantized MoE models (specifically with Qwen3-30B-A3B-AWQ), I'm encountering issues with tensor parallelism. The behavior varies depending on the tensor parallel size:
- TP=1: Works correctly
- TP=2: Works correctly
- TP=3, 5, 6, 7: Fast fails with
ValueError: Total number of attention heads (32) must be divisible by tensor parallel size - TP=4, 8: Starts loading but fails during initialization with:
RuntimeError: Worker failed with error 'size_k must divisible by BLOCK_SIZE_K', please check the stack trace above for the root cause
Reproduction Steps
- Clone the latest vLLM repo
- Install with
pip install -e . - Run the following command:
python -m vllm.entrypoints.api_server \
--model CognitiveComputations/Qwen3-30B-A3B-AWQ \
--gpu-memory-utilization 0.9 \
--max-model-len 32768 \
--max-num-seqs 64 \
--tensor-parallel-size 4 \
--host 127.0.0.1 \
--port 8080
Expected Behavior
The model should load successfully with tensor parallelism.
Actual Behavior
For TP=4 and TP=8, the error occurs during model initialization in the fused_moe/fused_moe.py code:
File "/raid/workspace/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 526, in invoke_fused_moe_kernel
ops.moe_wna16_gemm(A, C, B, B_scale, B_zp,
File "/raid/workspace/vllm/vllm/_custom_ops.py", line 1265, in moe_wna16_gemm
torch.ops._moe_C.moe_wna16_gemm(input, output, b_qweight, b_scales,
RuntimeError: size_k must divisible by BLOCK_SIZE_K
For TP=3, 5, 6, 7, the error is that the total number of attention heads (32) is not divisible by the tensor parallel size.
Environment Information
- vLLM version: 0.8.5.dev443 (development version)
- PyTorch version: 2.2.0
- CUDA version: 12.1
- Hardware: 8x A100 GPUs
- Operating System: Linux
Additional Context
-
Every MoE layer in the model shows this warning before the error occurs:
WARNING: Layer 'model.layers.*.mlp.experts' is not supported by AWQMoeMarlin. Falling back to Moe WNA16 kernels.Then the failure happens when trying to use those WNA16 kernels with tensor parallelism.
-
This issue seems to be specific to AWQ-quantized MoE models with tensor parallelism, as:
- The model works fine without tensor parallelism
- The tensor parallelism works with divisible TP sizes (2 and potentially others like 4 and 8)
- The issue occurs specifically in the MoE AWQ quantization code path
Potential Solution Direction
There seems to be an issue with how the matrix dimensions are handled when using tensor parallelism with AWQ-MoE models. The error suggests that when using tensor parallelism, the matrix dimensions used in the MoE layers are not properly aligned with the required block size for the AWQ WNA16 kernels.
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Hi, regarding the failures you observed with tensor-parallel-size (TP) set to 3, 5, 6, or 7, the error message ValueError: Total number of attention heads (32) must be divisible by tensor parallel size points to the cause.
This specific model (Qwen3-30B-A3B-AWQ) has 32 attention heads, as confirmed by its configuration (num_attention_heads: 32). A known requirement for tensor parallelism implementation in vLLM is that the total number of attention heads must be evenly divisible by the tensor parallel size for the attention mechanism to be parallelized correctly.
Since 32 is not divisible by 3, 5, 6, or 7, the failure you see with these TP sizes is actually expected behavior due to this fundamental constraint of the tensor parallelism implementation for attention layers.
@DarkLight1337 I can confirm I'm also seeing the RuntimeError: size_k must divisible by BLOCK_SIZE_K when running CognitiveComputations/Qwen3-30B-A3B-AWQ with tensor-parallel-size=4 (and likely TP=8 too). The logs show this happens after vLLM falls back to the Moe WNA16 kernels for the MoE layers.
Root Cause:
The error comes directly from a check in the C++ code for moe_wna16_gemm (vllm/csrc/moe/moe_wna16_gemm.cu): TORCH_CHECK(size_k % BLOCK_SIZE_K == 0, ...). This check makes sure the K dimension (size_k) going into the CUDA kernel is a multiple of BLOCK_SIZE_K. The kernel needs this alignment for performance/correctness.
With TP=4, when the model's K dimension (like hidden_size or intermediate_size) gets split across GPUs, the resulting size_k on each worker isn't divisible by BLOCK_SIZE_K for this model. This causes the check to fail during the model warm-up/graph capture (_dummy_run).
Proposed Fix:
The way to handle this is to make sure the dimensions are correctly aligned before calling moe_wna16_gemm. This usually means padding the K dimension up to the next multiple of BLOCK_SIZE_K.
The input activation tensor A needs its K dimension padded. This could potentially be done dynamically in Python (invoke_fused_moe_kernel in fused_moe.py) using torch.nn.functional.pad just before calling the operation. The weight tensors (B, B_scale, B_zp) also need their K dimension aligned. Padding these dynamically every time would be slow. The better approach is likely to handle this during model loading/sharding so the weights on each worker already have the padded K dimension. This might involve changes to the model class or weight loading code. Just removing the TORCH_CHECK in the C++ code won't work; it would just hide the problem and likely cause GPU errors or incorrect results later.
Hi, regarding the failures you observed with tensor-parallel-size (TP) set to 3, 5, 6, or 7, the error message ValueError: Total number of attention heads (32) must be divisible by tensor parallel size points to the cause.
This specific model (Qwen3-30B-A3B-AWQ) has 32 attention heads, as confirmed by its configuration (num_attention_heads: 32). A known requirement for tensor parallelism implementation in vLLM is that the total number of attention heads must be evenly divisible by the tensor parallel size for the attention mechanism to be parallelized correctly.
Since 32 is not divisible by 3, 5, 6, or 7, the failure you see with these TP sizes is actually expected behavior due to this fundamental constraint of the tensor parallelism implementation for attention layers.
I'm aware. I included that information for completeness.
I am testing today with https://huggingface.co/cognitivecomputations/Qwen3-235B-A22B-AWQ I expect to see the same pattern. I will update with the results.
Tested with https://huggingface.co/cognitivecomputations/Qwen3-235B-A22B-AWQ and experienced the same issue, a quick Google search got me here. TP=8, with 8x RTX 3090s.
@TheAhmadOsman @ehartford Have you test the original model https://huggingface.co/Qwen/Qwen3-235B-A22B
Also tested with Qwen/Qwen3-235B-A22B-GPTQ-Int4 and experienced the same issue. TP=8, with 8x A6000
Also tested with Qwen/Qwen3-235B-A22B-GPTQ-Int4 and same issue with 8x3090.
I have the same issue with https://huggingface.co/Qwen/Qwen3-235B-A22B-GPTQ-Int4 using TP=4
I have the same issue with OPEA/DeepSeek-V2.5-1210-int4-sym-inc using TP=8,TP=4 is OK.
I have the same issue with https://huggingface.co/Qwen/Qwen3-235B-A22B-GPTQ-Int4 using TP=8
The same issue using custom AWQ-quantified model.
+1 in 4 * 3090 by using TP=4 (but I could run it in a few weeks ago in the same settings?)
but it was very strange that when using enforce-eager, vllm could run successfully..
I have the same issue with https://huggingface.co/Qwen/Qwen3-235B-A22B-GPTQ-Int4 using TP=8 ,with 8x L20
The enviroment: 8 A800-40G
docker : vllm/vllm-openai:latest
pip3 list |grep vllm vllm 0.9.1 root@lx-a800-403-j11-16u:/vllm-workspace# nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2025 NVIDIA Corporation Built on Fri_Feb_21_20:23:50_PST_2025 Cuda compilation tools, release 12.8, V12.8.93 Build cuda_12.8.r12.8/compiler.35583870_0
model:https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-GPTQ-Int4
server : root@lx-a800-403-j11-16u:/vllm-workspace# VLLM_USE_MODELSCOPE=true vllm serve /data1/llm-models/qwen/Qwen3-235B-A22B-GPTQ-Int4 --enable-reasoning --reasoning-parser deepseek_r1 --gpu-memory-utilization 0.85 -tp 8 --enforce-eager --served-model-name qwq-235b-gptq-int4 --trust-remote-code
the cmd can be successful to run, but when post a request. it's error
client:
curl http://0.0.0.0:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "qwq-235b-gptq-int4","prompt": "hello","max_tokens": 7,"temperature": 0}' {"object":"error","message":"EngineCore encountered an issue. See stack trace (above) for the root cause.","type":"Internal Server Error","param":null,"code":500}
Then server:
ERROR 06-26 00:04:52 [core.py:517] raise RuntimeError( ERROR 06-26 00:04:52 [core.py:517] RuntimeError: Worker failed with error 'size_k must divisible by BLOCK_SIZE_K', please check the stack trace above for the root cause ERROR 06-26 00:04:52 [async_llm.py:420] AsyncLLM output_handler failed. ERROR 06-26 00:04:52 [async_llm.py:420] Traceback (most recent call last): ERROR 06-26 00:04:52 [async_llm.py:420] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 379, in output_handler ERROR 06-26 00:04:52 [async_llm.py:420] outputs = await engine_core.get_output_async() ERROR 06-26 00:04:52 [async_llm.py:420] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 06-26 00:04:52 [async_llm.py:420] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 790, in get_output_async ERROR 06-26 00:04:52 [async_llm.py:420] raise self._format_exception(outputs) from None ERROR 06-26 00:04:52 [async_llm.py:420] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. INFO 06-26 00:04:52 [async_llm.py:346] Request cmpl-aaa2110c5e334d49bc6d0b4203b3f46c-0 failed (engine dead). INFO: 127.0.0.1:49422 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error INFO: Shutting down INFO: Waiting for application shutdown. INFO: Application shutdown complete. INFO: Finished server process [9394]
I have the same issue with https://huggingface.co/Qwen/Qwen3-235B-A22B-GPTQ-Int4 using TP=8. l print size_k and BLOCK_SIZE_K ,size_k =192 , BLOCK_SIZE_K =128 ,so what is need change ?
For errors caused by excessive tensor parallelism, you can set --enable-expert-parallel. Refer to: https://github.com/vllm-project/vllm/issues/17327
对于因张量并行度过高导致的错误,可以设置 --enable-expert-parallel 参数。 参考:#17327
Deploying a model with such settings might reduce the inference efficienc
对于因张量并行度过高导致的错误,可以设置 --enable-expert-parallel 参数。 参考:#17327
Deploying a model with such settings might reduce the inference efficienc使用这些设置部署模型可能会降低推理效率
I'm not quite sure. I tried EP or TP 4 PP 2, and it worked. Is the inference efficiency ranked as TP > PP > EP?
For errors caused by excessive tensor parallelism, you can set --enable-expert-parallel. Refer to: #17327
it works! thank u!
set --enable-expert-parallel would cause an other problem on old GPU (<sm80)
ValueError: Marlin does not support weight_bits = uint4. Only types = [] are supported (for group_size = 128, device_capability = 75, zp = True).
Just so you know, we were able to run it successfully, and this --enable-expert-parallel helped us get a step closer.
We're running on a g5.48xlarge with 8x NVIDIA A10G.
$ vllm serve Qwen/Qwen3-235B-A22B-GPTQ-Int4 \
[...]
--tensor-parallel-size=8 \
--enable-expert-parallel \
--gpu-memory-utilization=0.8
Just so you know, we were able to run it successfully, and this
--enable-expert-parallelhelped us get a step closer.We're running on a g5.48xlarge with 8x NVIDIA A10G.
$ vllm serve Qwen/Qwen3-235B-A22B-GPTQ-Int4
[...] --tensor-parallel-size=8
--enable-expert-parallel
--gpu-memory-utilization=0.8
Haha, me too – A100 * 8 and Qwen3-235B-A22B-GPTQ-Int4
对于因张量并行度过高导致的错误,可以设置 --enable-expert-parallel 参数。 参考:#17327
Deploying a model with such settings might reduce the inference efficienc
Is there any reference
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!