sglang
sglang copied to clipboard
[Bug] Running DeepSeek-R1 on B200
Checklist
- [x] I searched related issues but found no solution.
- [x] The bug persists in the latest version.
- [x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- [ ] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [x] Please use English. Otherwise, it will be closed.
Describe the bug
Hitting get_trtllm_moe_sm100_module().trtllm_fp8_block_scale_moe when running DeepSeek-R1-0528
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2847, in forward
hidden_states = self.mlp(
^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 763, in forward
return self.forward_normal_dual_stream(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 797, in forward_normal_dual_stream
final_hidden_states = self.experts(hidden_states, topk_output)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 1016, in forward
final_hidden_states = self.quant_method.apply_with_router_logits(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8.py", line 1225, in apply_with_router_logits
return trtllm_fp8_block_scale_moe(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/flashinfer/fused_moe/core.py", line 1801, in trtllm_fp8_block_scale_moe
return get_trtllm_moe_sm100_module().trtllm_fp8_block_scale_moe(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/flashinfer/fused_moe/core.py", line 1379, in trtllm_fp8_block_scale_moe_op
moe_op.trtllm_fp8_block_scale_moe(
File "python/tvm_ffi/cython/function.pxi", line 901, in core.Function.__call__
TypeError: Mismatched type on argument #17 when calling: `trtllm_fp8_block_scale_moe(0: DLTensor*, 1: Optional<DLTensor*>, 2: DLTensor*, 3: DLTensor*, 4: DLTensor*, 5: DLTensor*, 6: DLTensor*, 7: DLTensor*, 8: DLTensor*, 9: int, 10: int, 11: Optional<int>, 12: Optional<int>, 13: int, 14: int, 15: int, 16: Optional<float>, 17: int, 18: bool, 19: int, 20: bool, 21: Array<int>) -> void`. Expected `int` but got `None
Reproduction
Environment: 2x B200:8, P6 instances.
Dockerfile:
FROM lmsysorg/sglang:latest
RUN apt-get update
RUN apt-get install -y linux-headers-generic
RUN apt -y install build-essential devscripts debhelper check libsubunit-dev fakeroot pkg-config dkms autoconf automake libtool m4 libnuma-dev
RUN set -eux; cd /tmp \
&& wget -q "https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/ubuntu22_04/x64/libgdrapi_2.5.1-1_amd64.Ubuntu22_04.deb" \
&& dpkg -i libgdrapi_2.5.1-1_amd64.Ubuntu22_04.deb || apt-get -y -f install \
&& rm -f libgdrapi_2.5.1-1_amd64.Ubuntu22_04.deb
RUN cd $HOME \
&& curl -O https://efa-installer.amazonaws.com/aws-efa-installer-1.43.2.tar.gz \
&& tar -xf aws-efa-installer-1.43.2.tar.gz \
&& cd aws-efa-installer \
&& ./efa_installer.sh -y --skip-kmod -g --no-verify
RUN cd /opt \
&& git clone --depth=1 https://github.com/aws/aws-ofi-nccl.git \
&& cd aws-ofi-nccl && ./autogen.sh \
&& ./configure --with-libfabric=/opt/amazon/efa --with-cuda=/usr/local/cuda --disable-tests \
&& make -C src -j && make -C src install
RUN cd $HOME \
&& git clone https://github.com/NVIDIA/nccl-tests.git \
&& cd nccl-tests \
&& export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH \
&& make -j MPI=1 MPI_HOME=/opt/amazon/openmpi NCCL_HOME=/opt/nccl/build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_100,code=sm_100"
# Build aws-ofi-nccl for NCCL over EFA
RUN cd /opt \
&& rm -rf aws-ofi-nccl \
&& git clone --depth=1 https://github.com/aws/aws-ofi-nccl.git \
&& cd aws-ofi-nccl && ./autogen.sh \
&& ./configure --with-libfabric=/opt/amazon/efa --with-cuda=/usr/local/cuda --disable-tests \
&& make -C src -j && make -C src install
# Defaults for NCCL over EFA
ENV NCCL_NET=OFI FI_PROVIDER=efa FI_EFA_USE_DEVICE_RDMA=1 FI_EFA_ENABLE_SHM_TRANSFER=1 NCCL_DEBUG=INFO \
LD_LIBRARY_PATH=/usr/local/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
Environment
root@ip-10-1-18-53:/workspace/uccl/ep/deep_ep_wrapper# python3 -m sglang.check_env
Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 580.95.05
PyTorch: 2.8.0+cu129
sglang: 0.5.5.post3
sgl_kernel: 0.3.17.post1
flashinfer_python: 0.5.2
flashinfer_cubin: 0.5.2
flashinfer_jit_cache: Module Not Found
triton: 3.4.0
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.2
fastapi: 0.121.2
hf_transfer: 0.1.9
huggingface_hub: 0.36.0
interegular: 0.3.3
modelscope: 1.31.0
orjson: 3.11.4
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.3
pydantic: 2.12.4
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.38.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.25
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.73.0
litellm: Module Not Found
decord2: 2.0.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 0-47 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 0-47 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 0-47 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 0-47 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 48-95 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 48-95 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 48-95 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X 48-95 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Hypervisor vendor:: KVM
ulimit soft: 1048576
The current nightly FROM lmsysorg/sglang:nightly-dev-cu13-20251116-de7eaa7c still gives the same issue
File "/usr/local/lib/python3.12/dist-packages/flashinfer/fused_moe/core.py", line 1379, in trtllm_fp8_block_scale_moe_op
moe_op.trtllm_fp8_block_scale_moe(
File "python/tvm_ffi/cython/function.pxi", line 901, in core.Function.__call__
TypeError: Mismatched type on argument #17 when calling: `trtllm_fp8_block_scale_moe(0: DLTensor*, 1: Optional<DLTensor*>, 2: DLTensor*, 3: DLTensor*, 4: DLTensor*, 5: DLTensor*, 6: DLTensor*, 7: DLTensor*, 8: DLTensor*, 9: int, 10: int, 11: Optional<int>, 12: Optional<int>, 13: int, 14: int, 15: int, 16: Optional<float>, 17: int, 18: bool, 19: int, 20: bool, 21: Array<int>) -> void`. Expected `int` but got `None`
Can you check if your sgl included this change?
https://github.com/sgl-project/sglang/commit/e389f91decdad61653edc57c765ef6041506e4a2 (Nov 17)
@kaixih After including this change, it works for me. Thanks!