flashinfer icon indicating copy to clipboard operation
flashinfer copied to clipboard

[Bug] Autotuning + trtllm_fp4_block_scale_routed_moe Issue

Open varun-sundar-rabindranath opened this issue 1 month ago • 10 comments

Hey guys, I am facing an issue with trtllm_fp4_block_scale_routed_moe with autotuner.

Issue:

Calling trtllm_fp4_block_scale_routed_moe with autotune results in an error,

(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800]   File "/home/varun-sundar-rabindranath/code/vllm/vllm/model_executor/layers/fused_moe/trtllm_moe.py", line 140, in apply
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800]     trtllm_fp4_block_scale_routed_moe(**kwargs)
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800]   File "/home/varun-sundar-rabindranath/code/vllm/vllm-test/lib/python3.12/site-packages/flashinfer/fused_moe/core.py", line 2061, in trtllm_fp4_block_scale_routed_moe
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800]     return get_trtllm_moe_sm100_module().trtllm_fp4_block_scale_moe(
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800]   File "/home/varun-sundar-rabindranath/code/vllm/vllm-test/lib/python3.12/site-packages/flashinfer/fused_moe/core.py", line 1540, in trtllm_fp4_block_scale_moe_op
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800]     _, tactic = tuner.choose_one(
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800]                 ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800]   File "/home/varun-sundar-rabindranath/code/vllm/vllm-test/lib/python3.12/site-packages/flashinfer/autotuner.py", line 477, in choose_one
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800]     r(tensors, tactic=-1, do_preparation=True, **kwargs)
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800]   File "/home/varun-sundar-rabindranath/code/vllm/vllm-test/lib/python3.12/site-packages/flashinfer/autotuner.py", line 217, in __call__
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800]     return self.forward(inputs, **kwargs)
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800]   File "/home/varun-sundar-rabindranath/code/vllm/vllm-test/lib/python3.12/site-packages/flashinfer/fused_moe/core.py", line 1086, in forward
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800]     moe_op.trtllm_fp4_block_scale_moe(
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800]   File "python/tvm_ffi/cython/function.pxi", line 744, in core.Function.__call__
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800]   File "python/tvm_ffi/cython/function.pxi", line 158, in core.TVMFFIPyArgSetterDLPackExchangeAPI_
(EngineCore_DP0 pid=2361841) ERROR 11-01 12:49:12 [core.py:800] RuntimeError: Cannot pack tensors on meta

Note that the stack trace is from vLLM, executed with the command,

VLLM_ALL2ALL_BACKEND="deepep_high_throughput" vllm serve openai/gpt-oss-20b --data-parallel-size 2 --tensor-parallel-size 1 --enable-expert-parallel --no-enable-prefix-caching --port 9010

but I believe it should be reproducible directly with flashinfer.

I did some investigation,

  • For trtllm_fp4_block_scale_routed_moe(), the routing logits are directly set to None here https://github.com/flashinfer-ai/flashinfer/blob/f9cd0345a162f4b19d62a0918ba027a3c59917a7/flashinfer/fused_moe/core.py#L2061 and, routing_logits are assigned on meta device here,
  • https://github.com/flashinfer-ai/flashinfer/blob/f9cd0345a162f4b19d62a0918ba027a3c59917a7/flashinfer/fused_moe/core.py#L1530C13-L1530C18 This causes the autotune to error-out with the above error.

Could someone please take a look. Thanks 🙌

@varun-sundar-rabindranath What vLLM version (commit) and what FlashInfer version did you use?

nvpohanh avatar Nov 03 '25 02:11 nvpohanh

How do you install deepep kernel? Which cuda version are you using? I tried https://github.com/vllm-project/vllm/tree/main/tools/ep_kernels, but seems some of the building steps need cuda 12.9 but some of them need 13.0...

elvischenv avatar Nov 03 '25 04:11 elvischenv

vLLM ToT + flashinfer ToT is runnable without VLLM_ALL2ALL_BACKEND="deepep_high_throughput"

python3 -m vllm.entrypoints.openai.api_server --model openai/gpt-oss-20b --data-parallel-size 2 --tensor-parallel-size 1 --enable-expert-parallel --no-enable-prefix-caching --port 8080 --max-model-len 8192

nvjullin avatar Nov 03 '25 09:11 nvjullin

The VLLM_ALL2ALL_BACKEND="deepep_high_throughput" is the only code path using the trtllm_fp4_block_scale_routed_moe api from flashinfer.

@nvjullin

python3 -m vllm.entrypoints.openai.api_server --model openai/gpt-oss-20b --data-parallel-size 2 --tensor-parallel-size 1 --enable-expert-parallel --no-enable-prefix-caching --port 8080 --max-model-len 8192

This works because the All2Alls used in this case, directly send over the routing logits and we end up using either trtllm_fp4_block_scale_moe or flashinfer_cutlass_fused_moe which has no issues with Autotune.

I'll see if I can get smaller repro example that doesn't use vllm.

How do you install deepep kernel? Which cuda version are you using? I tried https://github.com/vllm-project/vllm/tree/main/tools/ep_kernels, but seems some of the building steps need cuda 12.9 but some of them need 13.0...

@elvischenv please take a look at https://github.com/vllm-project/vllm/tree/3758757377b713b6acc997d0ac2c5dd49c332278/tools/ep_kernels

edit: I am using cuda version 13.0 on B200

@varun-sundar-rabindranath I did use that scripts but encountered errors:

-- Configuring done (1.2s)
-- Generating done (0.4s)
-- Build files have been written to: /workspace/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_build
+ cmake --build /workspace/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_build/ --target install
[2/132] Building CUDA object src/CMakeFiles/nvshmem.dir/host/init/init.cu.o
FAILED: [code=1] src/CMakeFiles/nvshmem.dir/host/init/init.cu.o
/usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DNVSHMEM_X86_64 -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -I/workspace/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_src/src/include -I/workspace/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_src/src/include/host/env -I/workspace/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_src/src/host/stream/coll -I/workspace/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_src/src/host/coll -I/workspace/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_src/src/host/topo -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem /usr/local/cuda/targets/x86_64-linux/include/cccl -O3 -DNDEBUG -std=c++11 "--generate-code=arch=compute_70,code=[sm_70]" "--generate-code=arch=compute_80,code=[sm_80]" "--generate-code=arch=compute_90,code=[sm_90]" "--generate-code=arch=compute_100,code=[compute_100,sm_100]" -Xcompiler=-fPIC -O3 --maxrregcount=32 -MD -MT src/CMakeFiles/nvshmem.dir/host/init/init.cu.o -MF src/CMakeFiles/nvshmem.dir/host/init/init.cu.o.d -x cu -rdc=true -c /workspace/vllm/tools/ep_kernels/ep_kernels_workspace/nvshmem_src/src/host/init/init.cu -o src/CMakeFiles/nvshmem.dir/host/init/init.cu.o
nvcc fatal   : Unsupported gpu architecture 'compute_70'

The script is still using old arch compute_70 that not supported by cuda 13.0

elvischenv avatar Nov 04 '25 03:11 elvischenv

@elvischenv how are you invoking the scripts ? like,

# for hopper
TORCH_CUDA_ARCH_LIST="9.0" bash install_python_libraries.sh
# for blackwell
TORCH_CUDA_ARCH_LIST="10.0" bash install_python_libraries.sh

? also if you are using uv can you set the PIP_CMD like in https://github.com/vllm-project/vllm/blob/3758757377b713b6acc997d0ac2c5dd49c332278/tools/ep_kernels/install_python_libraries.sh#L15

Yes, I am using TORCH_CUDA_ARCH_LIST="10.0" bash install_python_libraries.sh. nvcc fatal : Unsupported gpu architecture 'compute_70' is not related to uv, right? That error should be related to the nvshmem version. The old nvshmem still supports sm70 so it will build with sm70 but cuda 13.0 doesn't support sm70.

https://github.com/vllm-project/vllm/blob/14a125a06df7275923fe9748f67e27e449412d1f/tools/ep_kernels/install_python_libraries.sh#L24

From the script, it is using nvshmem 3.2.5 and it doesn't support cuda 13.0 https://docs.nvidia.com/nvshmem/release-notes-install-guide/prior-releases/release-3205.html#compatibility We may want to update to a version that support cuda 13.0, like the latest 3.4.5 https://docs.nvidia.com/nvshmem/release-notes-install-guide/release-notes/release-3405.html#compatibility

elvischenv avatar Nov 04 '25 04:11 elvischenv

@varun-sundar-rabindranath Could you share your installation process in detailed steps? For example, what base container did you use and what are the commands you used to install all the dependencies and vLLM? thanks!

nvpohanh avatar Nov 05 '25 08:11 nvpohanh

Hey guys. Sorry about the delay in getting a minimal repro - I have one now, PTAL. Thanks.

import flashinfer
import torch

from flashinfer import trtllm_fp4_block_scale_routed_moe
from flashinfer import autotune

def make_kwargs():

    # Reconstruct input tensors from vLLM.
    """
    (EngineCore_DP0 pid=121033) topk_ids : torch.int32 torch.Size([16384, 4]) (4, 1) cuda:0
    (EngineCore_DP0 pid=121033) routing_bias : None
    (EngineCore_DP0 pid=121033) hidden_states : torch.bfloat16 torch.Size([16384, 3072]) (3072, 1) cuda:0
    (EngineCore_DP0 pid=121033) hidden_states_scale : None
    (EngineCore_DP0 pid=121033) gemm1_weights : torch.uint8 torch.Size([16, 6144, 1536]) (9437184, 1536, 1) cuda:0
    (EngineCore_DP0 pid=121033) gemm1_weights_scale : torch.float8_e4m3fn torch.Size([16, 6144, 96]) (589824, 96, 1) cuda:0
    (EngineCore_DP0 pid=121033) gemm1_bias : torch.float32 torch.Size([16, 6144]) (6144, 1) cuda:0
    (EngineCore_DP0 pid=121033) gemm1_alpha : torch.float32 torch.Size([16]) (1,) cuda:0
    (EngineCore_DP0 pid=121033) gemm1_beta : torch.float32 torch.Size([16]) (1,) cuda:0
    (EngineCore_DP0 pid=121033) gemm1_clamp_limit : torch.float32 torch.Size([16]) (1,) cuda:0
    (EngineCore_DP0 pid=121033) gemm2_weights : torch.uint8 torch.Size([16, 3072, 1536]) (4718592, 1536, 1) cuda:0
    (EngineCore_DP0 pid=121033) gemm2_weights_scale : torch.float8_e4m3fn torch.Size([16, 3072, 96]) (294912, 96, 1) cuda:0
    (EngineCore_DP0 pid=121033) gemm2_bias : torch.float32 torch.Size([16, 3072]) (3072, 1) cuda:0
    (EngineCore_DP0 pid=121033) output1_scale_scalar : None
    (EngineCore_DP0 pid=121033) output1_scale_gate_scalar : None
    (EngineCore_DP0 pid=121033) output2_scale_scalar : None
    (EngineCore_DP0 pid=121033) num_experts : 32
    (EngineCore_DP0 pid=121033) top_k : 4
    (EngineCore_DP0 pid=121033) n_group : None
    (EngineCore_DP0 pid=121033) topk_group : None
    (EngineCore_DP0 pid=121033) intermediate_size : 3072
    (EngineCore_DP0 pid=121033) local_expert_offset : 0
    (EngineCore_DP0 pid=121033) local_num_experts : 16
    (EngineCore_DP0 pid=121033) routed_scaling_factor : None
    (EngineCore_DP0 pid=121033) tile_tokens_dim : None
    (EngineCore_DP0 pid=121033) routing_method_type : 1
    (EngineCore_DP0 pid=121033) do_finalize : True
    (EngineCore_DP0 pid=121033) output : torch.bfloat16 torch.Size([16384, 3072]) (3072, 1) cuda:0
    (EngineCore_DP0 pid=121033) tune_max_num_tokens : 1
    """

    M = 16
    TOPK = 4
    _topk_ids = torch.randint(low=0,
                             high=16,
                             size=(M, TOPK),
                             device="cuda")
    _topk_weights = torch.randn((M, TOPK),
                               dtype=torch.bfloat16,
                               device="cuda")

    # packed
    topk_ids = (_topk_ids.to(torch.int32) << 16) | _topk_weights.to(torch.bfloat16).view(torch.int16)
    hidden_states = torch.empty((M, 3072),
                                dtype=torch.bfloat16,
                                device="cuda")
    gemm1_weights = torch.empty((16, 6144, 1536),
                                dtype=torch.uint8,
                                device="cuda")
    gemm1_weights_scale = torch.empty((16, 6144, 96),
                                      dtype=torch.float8_e4m3fn,
                                      device="cuda") 
    gemm1_bias = torch.empty((16, 6144),
                             dtype=torch.float32,
                             device = "cuda") 
    gemm1_alpha = torch.empty((16,), dtype=torch.float32, device="cuda")
    gemm1_beta = torch.empty((16,), dtype=torch.float32, device="cuda") 
    gemm1_clamp_limit = torch.empty((16,), dtype=torch.float32, device="cuda")
    gemm2_weights = torch.empty((16, 3072, 1536), device="cuda", dtype=torch.uint8) 
    gemm2_weights_scale = torch.empty((16, 3072, 96),
                                      dtype=torch.float8_e4m3fn,
                                      device="cuda")
    gemm2_bias = torch.empty((16, 3072),
                             dtype=torch.float32,
                             device="cuda")
    output = torch.empty((M, 3072),
                         dtype = torch.bfloat16,
                         device = "cuda")

    kwargs = {
        "topk_ids": topk_ids,
        "routing_bias": None,
        "hidden_states": hidden_states,
        "hidden_states_scale": None,
        "gemm1_weights": gemm1_weights,
        "gemm1_weights_scale": gemm1_weights_scale,
        "gemm1_bias": gemm1_bias,
        "gemm1_alpha": gemm1_alpha,
        "gemm1_beta": gemm1_beta,
        "gemm1_clamp_limit": gemm1_clamp_limit,
        "gemm2_weights": gemm2_weights,
        "gemm2_weights_scale": gemm2_weights_scale,
        "gemm2_bias": gemm2_bias,
        "output1_scale_scalar": None,
        "output1_scale_gate_scalar": None,
        "output2_scale_scalar": None,
        "num_experts": 16,
        "top_k": TOPK,
        "n_group": None,
        "topk_group": None,
        "intermediate_size": 3072,
        "local_expert_offset": 0,
        "local_num_experts": 16,
        "routed_scaling_factor": None,
        "tile_tokens_dim": None,
        "routing_method_type": 1,
        "do_finalize": True,
        "output": output,
        "tune_max_num_tokens": 1,
    }
    return kwargs

def main(with_autotune: bool):
    print (f"Running with_autotune={with_autotune} ...")
    with autotune(with_autotune):
        trtllm_fp4_block_scale_routed_moe(**make_kwargs())

if __name__ == '__main__':
    main(with_autotune=False)
    main(with_autotune=True)

You should be able to see the error when running this script on a venv with just flashinfer-python=0.5.2. Thanks 🙌