[Bug] Autotuning + trtllm_fp4_block_scale_routed_moe Issue
Hey guys, I am facing an issue with trtllm_fp4_block_scale_routed_moe with autotuner.
Issue:
Calling trtllm_fp4_block_scale_routed_moe with autotune results in an error,
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] File "/home/varun-sundar-rabindranath/code/vllm/vllm/model_executor/layers/fused_moe/trtllm_moe.py", line 140, in apply
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] trtllm_fp4_block_scale_routed_moe(**kwargs)
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] File "/home/varun-sundar-rabindranath/code/vllm/vllm-test/lib/python3.12/site-packages/flashinfer/fused_moe/core.py", line 2061, in trtllm_fp4_block_scale_routed_moe
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] return get_trtllm_moe_sm100_module().trtllm_fp4_block_scale_moe(
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] File "/home/varun-sundar-rabindranath/code/vllm/vllm-test/lib/python3.12/site-packages/flashinfer/fused_moe/core.py", line 1540, in trtllm_fp4_block_scale_moe_op
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] _, tactic = tuner.choose_one(
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] File "/home/varun-sundar-rabindranath/code/vllm/vllm-test/lib/python3.12/site-packages/flashinfer/autotuner.py", line 477, in choose_one
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] r(tensors, tactic=-1, do_preparation=True, **kwargs)
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] File "/home/varun-sundar-rabindranath/code/vllm/vllm-test/lib/python3.12/site-packages/flashinfer/autotuner.py", line 217, in __call__
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] return self.forward(inputs, **kwargs)
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] File "/home/varun-sundar-rabindranath/code/vllm/vllm-test/lib/python3.12/site-packages/flashinfer/fused_moe/core.py", line 1086, in forward
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] moe_op.trtllm_fp4_block_scale_moe(
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] File "python/tvm_ffi/cython/function.pxi", line 744, in core.Function.__call__
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] File "python/tvm_ffi/cython/function.pxi", line 158, in core.TVMFFIPyArgSetterDLPackExchangeAPI_
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] RuntimeError: Cannot pack tensors on meta
Note that the stack trace is from vLLM, executed with the command,
VLLM_ALL2ALL_BACKEND="deepep_high_throughput" vllm serve openai/gpt-oss-20b --data-parallel-size 2 --tensor-parallel-size 1 --enable-expert-parallel --no-enable-prefix-caching --port 9010
but I believe it should be reproducible directly with flashinfer.
I did some investigation,
- For
trtllm_fp4_block_scale_routed_moe(), the routing logits are directly set to None here https://github.com/flashinfer-ai/flashinfer/blob/f9cd0345a162f4b19d62a0918ba027a3c59917a7/flashinfer/fused_moe/core.py#L2061 and, routing_logits are assigned onmetadevice here, - https://github.com/flashinfer-ai/flashinfer/blob/f9cd0345a162f4b19d62a0918ba027a3c59917a7/flashinfer/fused_moe/core.py#L1530C13-L1530C18 This causes the autotune to error-out with the above error.
Could someone please take a look. Thanks 🙌
@varun-sundar-rabindranath What vLLM version (commit) and what FlashInfer version did you use?
How do you install deepep kernel? Which cuda version are you using? I tried https://github.com/vllm-project/vllm/tree/main/tools/ep_kernels, but seems some of the building steps need cuda 12.9 but some of them need 13.0...
vLLM ToT + flashinfer ToT is runnable without VLLM_ALL2ALL_BACKEND="deepep_high_throughput"
python3 -m vllm.entrypoints.openai.api_server --model openai/gpt-oss-20b --data-parallel-size 2 --tensor-parallel-size 1 --enable-expert-parallel --no-enable-prefix-caching --port 8080 --max-model-len 8192
The VLLM_ALL2ALL_BACKEND="deepep_high_throughput" is the only code path using the trtllm_fp4_block_scale_routed_moe api from flashinfer.
@nvjullin
python3 -m vllm.entrypoints.openai.api_server --model openai/gpt-oss-20b --data-parallel-size 2 --tensor-parallel-size 1 --enable-expert-parallel --no-enable-prefix-caching --port 8080 --max-model-len 8192
This works because the All2Alls used in this case, directly send over the routing logits and we end up using either trtllm_fp4_block_scale_moe or flashinfer_cutlass_fused_moe which has no issues with Autotune.
I'll see if I can get smaller repro example that doesn't use vllm.
How do you install deepep kernel? Which cuda version are you using? I tried https://github.com/vllm-project/vllm/tree/main/tools/ep_kernels, but seems some of the building steps need cuda 12.9 but some of them need 13.0...
@elvischenv please take a look at https://github.com/vllm-project/vllm/tree/3758757377b713b6acc997d0ac2c5dd49c332278/tools/ep_kernels
edit: I am using cuda version 13.0 on B200
@varun-sundar-rabindranath I did use that scripts but encountered errors:
-- Configuring done (1.2s)
-- Generating done (0.4s)
-- Build files have been written to: /workspace/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_build
+ cmake --build /workspace/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_build/ --target install
[2/132] Building CUDA object src/CMakeFiles/nvshmem.dir/host/init/init.cu.o
FAILED: [code=1] src/CMakeFiles/nvshmem.dir/host/init/init.cu.o
/usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DNVSHMEM_X86_64 -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -I/workspace/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_src/src/include -I/workspace/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_src/src/include/host/env -I/workspace/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_src/src/host/stream/coll -I/workspace/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_src/src/host/coll -I/workspace/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_src/src/host/topo -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /usr/local/cuda/targets/x86_64-linux/include/cccl -O3 -DNDEBUG -std=c++11 "--generate-code=arch=compute_70,code=[sm_70]" "--generate-code=arch=compute_80,code=[sm_80]" "--generate-code=arch=compute_90,code=[sm_90]" "--generate-code=arch=compute_100,code=[compute_100,sm_100]" -Xcompiler=-fPIC -O3 --maxrregcount=32 -MD -MT src/CMakeFiles/nvshmem.dir/host/init/init.cu.o -MF src/CMakeFiles/nvshmem.dir/host/init/init.cu.o.d -x cu -rdc=true -c /workspace/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_src/src/host/init/init.cu -o src/CMakeFiles/nvshmem.dir/host/init/init.cu.o
nvcc fatal : Unsupported gpu architecture 'compute_70'
The script is still using old arch compute_70 that not supported by cuda 13.0
@elvischenv how are you invoking the scripts ? like,
# for hopper
TORCH_CUDA_ARCH_LIST="9.0" bash install_python_libraries.sh
# for blackwell
TORCH_CUDA_ARCH_LIST="10.0" bash install_python_libraries.sh
?
also if you are using uv can you set the PIP_CMD like in https://github.com/vllm-project/vllm/blob/3758757377b713b6acc997d0ac2c5dd49c332278/tools/ep_kernels/install_python_libraries.sh#L15
Yes, I am using TORCH_CUDA_ARCH_LIST="10.0" bash install_python_libraries.sh.
nvcc fatal : Unsupported gpu architecture 'compute_70' is not related to uv, right? That error should be related to the nvshmem version. The old nvshmem still supports sm70 so it will build with sm70 but cuda 13.0 doesn't support sm70.
https://github.com/vllm-project/vllm/blob/14a125a06df7275923fe9748f67e27e449412d1f/tools/ep_kernels/install_python_libraries.sh#L24
From the script, it is using nvshmem 3.2.5 and it doesn't support cuda 13.0 https://docs.nvidia.com/nvshmem/release-notes-install-guide/prior-releases/release-3205.html#compatibility We may want to update to a version that support cuda 13.0, like the latest 3.4.5 https://docs.nvidia.com/nvshmem/release-notes-install-guide/release-notes/release-3405.html#compatibility
@varun-sundar-rabindranath Could you share your installation process in detailed steps? For example, what base container did you use and what are the commands you used to install all the dependencies and vLLM? thanks!
Hey guys. Sorry about the delay in getting a minimal repro - I have one now, PTAL. Thanks.
import flashinfer
import torch
from flashinfer import trtllm_fp4_block_scale_routed_moe
from flashinfer import autotune
def make_kwargs():
# Reconstruct input tensors from vLLM.
"""
(EngineCore_DP0 pid=121033) topk_ids : torch.int32 torch.Size([16384, 4]) (4, 1) cuda:0
(EngineCore_DP0 pid=121033) routing_bias : None
(EngineCore_DP0 pid=121033) hidden_states : torch.bfloat16 torch.Size([16384, 3072]) (3072, 1) cuda:0
(EngineCore_DP0 pid=121033) hidden_states_scale : None
(EngineCore_DP0 pid=121033) gemm1_weights : torch.uint8 torch.Size([16, 6144, 1536]) (9437184, 1536, 1) cuda:0
(EngineCore_DP0 pid=121033) gemm1_weights_scale : torch.float8_e4m3fn torch.Size([16, 6144, 96]) (589824, 96, 1) cuda:0
(EngineCore_DP0 pid=121033) gemm1_bias : torch.float32 torch.Size([16, 6144]) (6144, 1) cuda:0
(EngineCore_DP0 pid=121033) gemm1_alpha : torch.float32 torch.Size([16]) (1,) cuda:0
(EngineCore_DP0 pid=121033) gemm1_beta : torch.float32 torch.Size([16]) (1,) cuda:0
(EngineCore_DP0 pid=121033) gemm1_clamp_limit : torch.float32 torch.Size([16]) (1,) cuda:0
(EngineCore_DP0 pid=121033) gemm2_weights : torch.uint8 torch.Size([16, 3072, 1536]) (4718592, 1536, 1) cuda:0
(EngineCore_DP0 pid=121033) gemm2_weights_scale : torch.float8_e4m3fn torch.Size([16, 3072, 96]) (294912, 96, 1) cuda:0
(EngineCore_DP0 pid=121033) gemm2_bias : torch.float32 torch.Size([16, 3072]) (3072, 1) cuda:0
(EngineCore_DP0 pid=121033) output1_scale_scalar : None
(EngineCore_DP0 pid=121033) output1_scale_gate_scalar : None
(EngineCore_DP0 pid=121033) output2_scale_scalar : None
(EngineCore_DP0 pid=121033) num_experts : 32
(EngineCore_DP0 pid=121033) top_k : 4
(EngineCore_DP0 pid=121033) n_group : None
(EngineCore_DP0 pid=121033) topk_group : None
(EngineCore_DP0 pid=121033) intermediate_size : 3072
(EngineCore_DP0 pid=121033) local_expert_offset : 0
(EngineCore_DP0 pid=121033) local_num_experts : 16
(EngineCore_DP0 pid=121033) routed_scaling_factor : None
(EngineCore_DP0 pid=121033) tile_tokens_dim : None
(EngineCore_DP0 pid=121033) routing_method_type : 1
(EngineCore_DP0 pid=121033) do_finalize : True
(EngineCore_DP0 pid=121033) output : torch.bfloat16 torch.Size([16384, 3072]) (3072, 1) cuda:0
(EngineCore_DP0 pid=121033) tune_max_num_tokens : 1
"""
M = 16
TOPK = 4
_topk_ids = torch.randint(low=0,
high=16,
size=(M, TOPK),
device="cuda")
_topk_weights = torch.randn((M, TOPK),
dtype=torch.bfloat16,
device="cuda")
# packed
topk_ids = (_topk_ids.to(torch.int32) << 16) | _topk_weights.to(torch.bfloat16).view(torch.int16)
hidden_states = torch.empty((M, 3072),
dtype=torch.bfloat16,
device="cuda")
gemm1_weights = torch.empty((16, 6144, 1536),
dtype=torch.uint8,
device="cuda")
gemm1_weights_scale = torch.empty((16, 6144, 96),
dtype=torch.float8_e4m3fn,
device="cuda")
gemm1_bias = torch.empty((16, 6144),
dtype=torch.float32,
device = "cuda")
gemm1_alpha = torch.empty((16,), dtype=torch.float32, device="cuda")
gemm1_beta = torch.empty((16,), dtype=torch.float32, device="cuda")
gemm1_clamp_limit = torch.empty((16,), dtype=torch.float32, device="cuda")
gemm2_weights = torch.empty((16, 3072, 1536), device="cuda", dtype=torch.uint8)
gemm2_weights_scale = torch.empty((16, 3072, 96),
dtype=torch.float8_e4m3fn,
device="cuda")
gemm2_bias = torch.empty((16, 3072),
dtype=torch.float32,
device="cuda")
output = torch.empty((M, 3072),
dtype = torch.bfloat16,
device = "cuda")
kwargs = {
"topk_ids": topk_ids,
"routing_bias": None,
"hidden_states": hidden_states,
"hidden_states_scale": None,
"gemm1_weights": gemm1_weights,
"gemm1_weights_scale": gemm1_weights_scale,
"gemm1_bias": gemm1_bias,
"gemm1_alpha": gemm1_alpha,
"gemm1_beta": gemm1_beta,
"gemm1_clamp_limit": gemm1_clamp_limit,
"gemm2_weights": gemm2_weights,
"gemm2_weights_scale": gemm2_weights_scale,
"gemm2_bias": gemm2_bias,
"output1_scale_scalar": None,
"output1_scale_gate_scalar": None,
"output2_scale_scalar": None,
"num_experts": 16,
"top_k": TOPK,
"n_group": None,
"topk_group": None,
"intermediate_size": 3072,
"local_expert_offset": 0,
"local_num_experts": 16,
"routed_scaling_factor": None,
"tile_tokens_dim": None,
"routing_method_type": 1,
"do_finalize": True,
"output": output,
"tune_max_num_tokens": 1,
}
return kwargs
def main(with_autotune: bool):
print (f"Running with_autotune={with_autotune} ...")
with autotune(with_autotune):
trtllm_fp4_block_scale_routed_moe(**make_kwargs())
if __name__ == '__main__':
main(with_autotune=False)
main(with_autotune=True)
You should be able to see the error when running this script on a venv with just flashinfer-python=0.5.2. Thanks 🙌