sglang [MOE] enable efficient moe_alignment multi-blocks execution (3x~6x)

Motivation

The new algorithm is the adpation and the follow up of moe-align-with-multiple-blocks-execution in PR#3137 and PR#2970

base (master) bench	opt bench	opt correctness	GPU Type
			A100

The overall acceleration under A100 chip could be up to 6x and the adaption can easily achieve 3x overall acceleration in all critcal cases:

opt bench (all cases)	opt bench (snapshot)	GPU
		A100
		MI100 (gfx908)

Modifications

moe_align_kernel.cu : implemented new algorithm which enables efficient excution in multi-blocks version and enable supported experts number other than 256.
include/utils.h : it is a AMD GPU friendly algorithm; it provides even better acceleration compared to chips from NVIDIA

Checklist

[x] Format your code according to the Code Formatting with Pre-Commit.
[x] Add unit tests as outlined in the Running Unit Tests.
[ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
[x] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
[ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
[x] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Feb 16 '25 15:02 yiakwy-xpu-ml-framework-team

Great work!

Feb 17 '25 01:02 lambert0312

Great work!

Thank you ! I am working on ROCM (MI210) platform. Will update soon.

@BBuf could you help to test on H200 platform ? Thank you!

Feb 17 '25 01:02 yiakwy-xpu-ml-framework-team

Thank you ! I am working on ROCM (MI210) platform. Will update soon.

Can you verify the A800 environment? @yiakwy-xpu-ml-framework-team

Feb 17 '25 01:02 lambert0312

Thank you ! I am working on ROCM (MI210) platform. Will update soon.

Can you verify the A800 environment? @yiakwy-xpu-ml-framework-team

The only difference between A800 and A100 is perhaps NVLINK speed? Since we don't use nvlink for this op, I think it should work as well.

Feb 17 '25 02:02 yiakwy-xpu-ml-framework-team

Thank you ! I am working on ROCM (MI210) platform. Will update soon.

Can you verify the A800 environment? @yiakwy-xpu-ml-framework-team

The only difference between A800 and A100 is perhaps NVLINK speed? Since we don't use nvlink for this op, I think it should work as well.

I will try it

Feb 17 '25 03:02 lambert0312

Great work!

Thank you ! I am working on ROCM (MI210) platform. Will update soon.

@BBuf could you help to test on H200 platform ? Thank you!

I ran into this issue on mi300x - /sgl-workspace/sglang/sgl-kernel/src/sgl-kernel/csrc/moe_align_kernel.hip:315:5: error: no matching function for call to 'hipLaunchCooperativeKernel' when I do "cd sgl-kernel; python setup_rocm.py install"

Feb 18 '25 16:02 andyluo7

Great work!

Thank you ! I am working on ROCM (MI210) platform. Will update soon. @BBuf could you help to test on H200 platform ? Thank you!

I ran into this issue on mi300x - /sgl-workspace/sglang/sgl-kernel/src/sgl-kernel/csrc/moe_align_kernel.hip:315:5: error: no matching function for call to 'hipLaunchCooperativeKernel' when I do "cd sgl-kernel; python setup_rocm.py install"

Hi @andyluo7 ! The function has been tested on ROCM 6.3 and works well (see the benchmark info). Are you working on this branch "yiakwy-xpu-ml-framework-team:optimize_moe_align_v3" ?

As for signature of "hipLaunchCooperativeKernel", I added support in this the header file "include/util.h":

#if defined(__HIP_PLATFORM_AMD__)

#include <hip/hip_cooperative_groups.h>
#include <hip/hip_runtime.h>

static __inline__ __host__ __device__ hipError_t cudaLaunchCooperativeKernel(const void* f, dim3 gridDim,
                                                                             dim3 blockDimX, void** kernelParams) {
  return hipLaunchCooperativeKernel(f, gridDim, blockDimX, kernelParams, 0, hipStreamDefault);
}

#endif

It should not be a problem. The amd codes were verified in amd platform, then I submitted changes in another machine. Let me check if anything missing.

Feb 19 '25 01:02 yiakwy-xpu-ml-framework-team

@yiakwy-xpu-ml-framework-team I rebuilt the kernel using the new code, and the following error occurred when starting:

[2025-02-19 02:32:32 TP29] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 237, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 304, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 387, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 380, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 465, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/external_utils.py", line 40, in inner
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 874, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 835, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 766, in forward
    if not forward_batch.forward_mode.is_idle():
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 773, in torch_dynamo_resume_in_forward_at_766
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 524, in forward
    forward_batch.forward_mode.is_extend()
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 531, in torch_dynamo_resume_in_forward_at_524
    return self.forward_absorb(positions, hidden_states, forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 623, in forward_absorb
    attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 67, in forward
    return forward_batch.attn_backend.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 66, in forward
    if forward_batch.forward_mode.is_decode():
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 1269, in __call__
    return self._torchdynamo_orig_callable(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 1064, in __call__
    result = self._inner_convert(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 526, in __call__
    return _compile(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 952, in _compile
    raise InternalTorchDynamoError(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 924, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 666, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
  File "/usr/local/lib/python3.10/dist-packages/torch/_utils_internal.py", line 87, in wrapper_function
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 699, in _compile_inner
    out_code = transform_code_object(code, transform)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/bytecode_transformation.py", line 1322, in transform_code_object
    transformations(instructions, code_options)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 231, in _fn
    torch.cuda.set_rng_state(cuda_rng_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/random.py", line 77, in set_rng_state
    _lazy_call(cb)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 249, in _lazy_call
    callable()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/random.py", line 75, in cb
    default_generator.set_state(new_state_copy)
torch._dynamo.exc.InternalTorchDynamoError: RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.



You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 240, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 224, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 741, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 239, in __init__
    raise Exception(
Exception: Capture cuda graph failed: RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.



You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
3. disable torch compile by not using --enable-torch-compile
4. set --cuda-graph-max-bs to a smaller value (e.g., 32)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

Feb 19 '25 03:02 lambert0312

@HaiShaw Could you please take a look?

Feb 19 '25 17:02 zhaochenyang20

@lambert0312 let me double check on both platform today. What chips you used ?

Feb 20 '25 02:02 yiakwy-xpu-ml-framework-team

@lambert0312 let me double check on both platform today. What chips you used ?

A800 thanks @yiakwy-xpu-ml-framework-team

Feb 20 '25 04:02 lambert0312

@zhaochenyang20 I have revert commit to 6b08bf5. Once review is done, we could rebase onto main branch to resolve conflicts.

Please let me do rebase merge later to simplify the messages and resolving potential conflicts! But really thank you for trying it out.

Feb 20 '25 08:02 yiakwy-xpu-ml-framework-team

@zhaochenyang20 I have revert commit to 6b08bf5. Once review is done, we could rebase onto main branch to resolve conflicts.

Please let me do rebase merge later to simplify the messages and resolving potential conflicts! But really thank you for trying it out.

@yiakwy-xpu-ml-framework-team Does this version support A800?

Feb 20 '25 08:02 lambert0312

Reproduction

AMD (pass)

build

cd sgl-kernel && python setup_rocm.py install

Snapshot 1 : 截屏2025-02-20 18 15 31

Snapshot 2: 截屏2025-02-20 18 48 57

Snapshot 3: 截屏2025-02-20 18 49 07

test

cd benchmark/kernels/fused_moe_triton && python benchmark_deepseekv3_moe_align_blocks.py --verify

benchmark

cd benchmark/kernels/fused_moe_triton && python benchmark_deepseekv3_moe_align_blocks.py --save_path ./

Note: building in mI100, one should modify the arch "gfx908"

Note : to run in AMD correctly you must install pytorch correctly. Here is reference of requirements to build docker with ROCM 6.3 :

torch @ https://repo.radeon.com/rocm/manylinux/rocm-rel-6.3/torch-2.3.0%2Brocm6.3.0-cp312-cp312-linux_x86_64.whl

torchvision @ https://repo.radeon.com/rocm/manylinux/rocm-rel-6.3/torchvision-0.19.0%2Brocm6.3.0-cp312-cp312-linux_x86_64.whl

torchaudio @ https://repo.radeon.com/rocm/manylinux/rocm-rel-6.3/torchaudio-2.3.0%2Brocm6.3.0-cp312-cp312-linux_x86_64.whl

Details can be found in this project Tools-dockerhub. It is a one stop repo to get everything needed for rocm done.

FYI @andyluo7 , please let me know if the issue persists.

NV (pass)

GPU : A800

build

cd sgl-kernel && pip install -e . -v

NOTE : in CUDA platform you should NOT use "python setup.py install" (the binary is not correct); The issue will be fix later. You Must install with pip so that correct dependencies will be installed.

Snapshot 1 : 截屏2025-02-20 19 02 35

Snapshot 2 :

Snapshot 3 :

FYI @lambert0312 , I am not sure wether your problem related to cuda graph or some other functions out of the scope.

Could you send me reproduction code and can you make sure the problem caused by kernel in this PR ?

Feb 20 '25 10:02 yiakwy-xpu-ml-framework-team

@yiakwy-xpu-ml-framework-team Thanks for the reply. I will try it again according to the steps tomorrow. Logically speaking, it will be built using your patch.

Feb 20 '25 14:02 lambert0312

@lambert0312 Thank you . Could you kindly delete the quote in the last reply to make the message neat ?

Here is my suggestion in case you face it again:

make sure codes run smoothly without this PR and let me understand how to reproduce it if the issue persists; info on the num_tokens, num_experts, bs, seq will be helpful;
to maximize the performance, the kernel uses more registers and more shared memory (but usually released after execution by compiler to reduce liveness of intermediate); changes are that the cuda graph need to reduce the capture.
the input data may not be well aligned (power of 16 for example), because our tests only cover data with size of power of 2;
if the issue persists, I will update an even better kernel for this purpose with less regsiters will be in usage.

Feb 20 '25 14:02 yiakwy-xpu-ml-framework-team

@yiakwy-xpu-ml-framework-team Hey. Yi, thanks so much for help. Could you tell others how to run your branch to profiling it? We can ask community users to do this.

Feb 20 '25 19:02 zhaochenyang20

I will test it on H800.

Feb 23 '25 07:02 fsygd

Here is my result. @yiakwy-xpu-ml-framework-team @zhaochenyang20

opt bench (all cases)	opt bench (snapshot)	GPU
		H100

Feb 23 '25 15:02 fsygd

Great! @yiakwy-xpu-ml-framework-team I won't rebase your PR. Please rebase it on your side and if @BBuf agree with the PR. I can merge it.

Feb 23 '25 23:02 zhaochenyang20

I will test it on H200.

Feb 24 '25 05:02 fsygd

@fzyzcjy Thanks so much. And @BBuf , do you think we can merge it after clear profiling?

Feb 24 '25 07:02 zhaochenyang20

@fzyzcjy Thanks so much. And @BBuf , do you think we can merge it after clear profiling?

Yeah, and we should do acc test in DeepSeek V3, refer to:

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8

Feb 24 '25 07:02 BBuf

I will test it on H200.

Here is my result. @yiakwy-xpu-ml-framework-team @zhaochenyang20

opt bench (all cases)	opt bench (snapshot)	GPU
		H200

Feb 24 '25 08:02 fsygd

fix unit test :

issue: cumsum_buffer is empty initialized in latest unit test.

Feb 27 '25 11:02 yiakwy-xpu-ml-framework-team

@BBuf everything is cool on my side. Could you give a final look and approve it?

Feb 27 '25 16:02 zhaochenyang20

Thanks for @fsygd reporting, the kernel has problems when used together with cudagraph in end2end mode (verifed in Qwen MoE, DeepSeek V3 MoE) by enforcing fused moe to pick up "sgl_moe_align_block_size" implementation.

The bug is tricky, since it can only be produced when used inside cuda graph mode and the debug messages are hence too large to read (with AMD_LOG_LEVEL=4, 8 GB messages produced). In NV GPU, the issue can be temporialy resolved by disabling cuda graph.

The kernel is fast with aggressive usage shared memory resources and HBM buffer for blockwise reduction. Hence resources bugget may be limited when multiple concurrent kerenels executed.

   # cuda graph capture with multiple copies
    def capture(self):
        with graph_capture() as graph_capture_context:
            self.stream = graph_capture_context.stream
            capture_range = (
                tqdm.tqdm(self.capture_bs)
                if get_tensor_model_parallel_rank() == 0
                else self.capture_bs
            )
            for bs in capture_range:
                print("[CudaGraph::capture] bs : {}/{}".format(bs, self.capture_bs[-1]))
                with patch_model(
                    self.model_runner.model,
                    bs in self.compile_bs,
                    num_tokens=bs * self.num_tokens_per_bs,
                    tp_group=self.model_runner.tp_group,
                ) as forward:
                    (
                        graph,
                        output_buffers,
                    ) = self.capture_one_batch_size(bs, forward)
                    self.graphs[bs] = graph
                    self.output_buffers[bs] = output_buffers

                # Save gemlite cache after each capture
                save_gemlite_cache()

Note cuda graph will keep multiple copies of kernels (multiple code size) to deal with different batch sizes in deocidng stage. Hence KV cache will be dramatically increased (20 GB in NV GPU , 70 GB in AMD GPU).

I am debugging it in MI300X and GPU with special debgging agents and try to find an optimal solution with it . Once it done with expected e2e results (with some improvements), I will share with you.

FYI @lambert0312 @andyluo7 @BBuf @zhaochenyang20

Mar 04 '25 10:03 yiakwy-xpu-ml-framework-team

@yiakwy-xpu-ml-framework-team yi, THANKS so much

Mar 04 '25 18:03 zhaochenyang20