sglang [Bug] Running DeepSeek-R1 on B200

Checklist

[x] I searched related issues but found no solution.
[x] The bug persists in the latest version.
[x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
[ ] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
[x] Please use English. Otherwise, it will be closed.

Describe the bug

Hitting get_trtllm_moe_sm100_module().trtllm_fp8_block_scale_moe when running DeepSeek-R1-0528

  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2847, in forward
    hidden_states = self.mlp(
                    ^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 763, in forward
    return self.forward_normal_dual_stream(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 797, in forward_normal_dual_stream
    final_hidden_states = self.experts(hidden_states, topk_output)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 1016, in forward
    final_hidden_states = self.quant_method.apply_with_router_logits(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8.py", line 1225, in apply_with_router_logits
    return trtllm_fp8_block_scale_moe(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/flashinfer/fused_moe/core.py", line 1801, in trtllm_fp8_block_scale_moe
    return get_trtllm_moe_sm100_module().trtllm_fp8_block_scale_moe(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/flashinfer/fused_moe/core.py", line 1379, in trtllm_fp8_block_scale_moe_op
    moe_op.trtllm_fp8_block_scale_moe(
  File "python/tvm_ffi/cython/function.pxi", line 901, in core.Function.__call__
TypeError: Mismatched type on argument #17 when calling: `trtllm_fp8_block_scale_moe(0: DLTensor*, 1: Optional<DLTensor*>, 2: DLTensor*, 3: DLTensor*, 4: DLTensor*, 5: DLTensor*, 6: DLTensor*, 7: DLTensor*, 8: DLTensor*, 9: int, 10: int, 11: Optional<int>, 12: Optional<int>, 13: int, 14: int, 15: int, 16: Optional<float>, 17: int, 18: bool, 19: int, 20: bool, 21: Array<int>) -> void`. Expected `int` but got `None

Reproduction

Environment: 2x B200:8, P6 instances.

Dockerfile:

FROM lmsysorg/sglang:latest
RUN apt-get update
RUN apt-get install -y linux-headers-generic
RUN apt -y install build-essential devscripts debhelper check libsubunit-dev fakeroot pkg-config dkms autoconf automake libtool m4 libnuma-dev
RUN set -eux; cd /tmp \
    && wget -q "https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/ubuntu22_04/x64/libgdrapi_2.5.1-1_amd64.Ubuntu22_04.deb" \
    && dpkg -i libgdrapi_2.5.1-1_amd64.Ubuntu22_04.deb || apt-get -y -f install \
    && rm -f libgdrapi_2.5.1-1_amd64.Ubuntu22_04.deb
RUN cd $HOME \
    && curl -O https://efa-installer.amazonaws.com/aws-efa-installer-1.43.2.tar.gz \
    && tar -xf aws-efa-installer-1.43.2.tar.gz \
    && cd aws-efa-installer \
    && ./efa_installer.sh -y --skip-kmod -g --no-verify
RUN cd /opt \
    && git clone --depth=1 https://github.com/aws/aws-ofi-nccl.git \
    && cd aws-ofi-nccl && ./autogen.sh \
    && ./configure --with-libfabric=/opt/amazon/efa --with-cuda=/usr/local/cuda --disable-tests \
    && make -C src -j && make -C src install
RUN cd $HOME \
    && git clone https://github.com/NVIDIA/nccl-tests.git \
    && cd nccl-tests \
    && export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH \
    && make -j MPI=1 MPI_HOME=/opt/amazon/openmpi NCCL_HOME=/opt/nccl/build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_100,code=sm_100"

# Build aws-ofi-nccl for NCCL over EFA
RUN cd /opt \
    && rm -rf aws-ofi-nccl \
    && git clone --depth=1 https://github.com/aws/aws-ofi-nccl.git \
    && cd aws-ofi-nccl && ./autogen.sh \
    && ./configure --with-libfabric=/opt/amazon/efa --with-cuda=/usr/local/cuda --disable-tests \
    && make -C src -j && make -C src install

# Defaults for NCCL over EFA
ENV NCCL_NET=OFI FI_PROVIDER=efa FI_EFA_USE_DEVICE_RDMA=1 FI_EFA_ENABLE_SHM_TRANSFER=1 NCCL_DEBUG=INFO \
    LD_LIBRARY_PATH=/usr/local/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/usr/local/cuda/lib64:${LD_LIBRARY_PATH}

Environment

root@ip-10-1-18-53:/workspace/uccl/ep/deep_ep_wrapper# python3 -m sglang.check_env

Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 580.95.05
PyTorch: 2.8.0+cu129
sglang: 0.5.5.post3
sgl_kernel: 0.3.17.post1
flashinfer_python: 0.5.2
flashinfer_cubin: 0.5.2
flashinfer_jit_cache: Module Not Found
triton: 3.4.0
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.2
fastapi: 0.121.2
hf_transfer: 0.1.9
huggingface_hub: 0.36.0
interegular: 0.3.3
modelscope: 1.31.0
orjson: 3.11.4
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.3
pydantic: 2.12.4
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.38.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.25
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.73.0
litellm: Module Not Found
decord2: 2.0.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity       GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    0-47    0          N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    0-47    0          N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    0-47    0          N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    0-47    0          N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    48-95   1          N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    48-95   1          N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    48-95   1          N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      48-95   1          N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Hypervisor vendor:: KVM
ulimit soft: 1048576

Nov 20 '25 21:11 MaoZiming

The current nightly FROM lmsysorg/sglang:nightly-dev-cu13-20251116-de7eaa7c still gives the same issue

  File "/usr/local/lib/python3.12/dist-packages/flashinfer/fused_moe/core.py", line 1379, in trtllm_fp8_block_scale_moe_op
    moe_op.trtllm_fp8_block_scale_moe(
  File "python/tvm_ffi/cython/function.pxi", line 901, in core.Function.__call__
TypeError: Mismatched type on argument #17 when calling: `trtllm_fp8_block_scale_moe(0: DLTensor*, 1: Optional<DLTensor*>, 2: DLTensor*, 3: DLTensor*, 4: DLTensor*, 5: DLTensor*, 6: DLTensor*, 7: DLTensor*, 8: DLTensor*, 9: int, 10: int, 11: Optional<int>, 12: Optional<int>, 13: int, 14: int, 15: int, 16: Optional<float>, 17: int, 18: bool, 19: int, 20: bool, 21: Array<int>) -> void`. Expected `int` but got `None`

Nov 20 '25 21:11 MaoZiming

Can you check if your sgl included this change?

https://github.com/sgl-project/sglang/commit/e389f91decdad61653edc57c765ef6041506e4a2 (Nov 17)

Nov 21 '25 21:11 kaixih

@kaixih After including this change, it works for me. Thanks!

Nov 24 '25 21:11 xxrjun