[MOE] enable efficient moe_alignment multi-blocks execution (3x~6x)
Motivation
The new algorithm is the adpation and the follow up of moe-align-with-multiple-blocks-execution in PR#3137 and PR#2970
| base (master) bench | opt bench | opt correctness | GPU Type |
|---|---|---|---|
| A100 |
The overall acceleration under A100 chip could be up to 6x and the adaption can easily achieve 3x overall acceleration in all critcal cases:
| opt bench (all cases) | opt bench (snapshot) | GPU |
|---|---|---|
| A100 | ||
| MI100 (gfx908) |
Modifications
-
moe_align_kernel.cu : implemented new algorithm which enables efficient excution in multi-blocks version and enable supported experts number other than 256.
-
include/utils.h : it is a AMD GPU friendly algorithm; it provides even better acceleration compared to chips from NVIDIA
Checklist
- [x] Format your code according to the Code Formatting with Pre-Commit.
- [x] Add unit tests as outlined in the Running Unit Tests.
- [ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
- [x] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
- [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
- [x] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.
Great work!
Great work!
Thank you ! I am working on ROCM (MI210) platform. Will update soon.
@BBuf could you help to test on H200 platform ? Thank you!
Thank you ! I am working on ROCM (MI210) platform. Will update soon.
Can you verify the A800 environment? @yiakwy-xpu-ml-framework-team
Thank you ! I am working on ROCM (MI210) platform. Will update soon.
Can you verify the A800 environment? @yiakwy-xpu-ml-framework-team
The only difference between A800 and A100 is perhaps NVLINK speed? Since we don't use nvlink for this op, I think it should work as well.
Thank you ! I am working on ROCM (MI210) platform. Will update soon.
Can you verify the A800 environment? @yiakwy-xpu-ml-framework-team
The only difference between A800 and A100 is perhaps NVLINK speed? Since we don't use nvlink for this op, I think it should work as well.
I will try it
Great work!
Thank you ! I am working on ROCM (MI210) platform. Will update soon.
@BBuf could you help to test on H200 platform ? Thank you!
I ran into this issue on mi300x - /sgl-workspace/sglang/sgl-kernel/src/sgl-kernel/csrc/moe_align_kernel.hip:315:5: error: no matching function for call to 'hipLaunchCooperativeKernel' when I do "cd sgl-kernel; python setup_rocm.py install"
Great work!
Thank you ! I am working on ROCM (MI210) platform. Will update soon. @BBuf could you help to test on H200 platform ? Thank you!
I ran into this issue on mi300x - /sgl-workspace/sglang/sgl-kernel/src/sgl-kernel/csrc/moe_align_kernel.hip:315:5: error: no matching function for call to 'hipLaunchCooperativeKernel' when I do "cd sgl-kernel; python setup_rocm.py install"
Hi @andyluo7 ! The function has been tested on ROCM 6.3 and works well (see the benchmark info). Are you working on this branch "yiakwy-xpu-ml-framework-team:optimize_moe_align_v3" ?
As for signature of "hipLaunchCooperativeKernel", I added support in this the header file "include/util.h":
#if defined(__HIP_PLATFORM_AMD__)
#include <hip/hip_cooperative_groups.h>
#include <hip/hip_runtime.h>
static __inline__ __host__ __device__ hipError_t cudaLaunchCooperativeKernel(const void* f, dim3 gridDim,
dim3 blockDimX, void** kernelParams) {
return hipLaunchCooperativeKernel(f, gridDim, blockDimX, kernelParams, 0, hipStreamDefault);
}
#endif
It should not be a problem. The amd codes were verified in amd platform, then I submitted changes in another machine. Let me check if anything missing.
@yiakwy-xpu-ml-framework-team I rebuilt the kernel using the new code, and the following error occurred when starting:
[2025-02-19 02:32:32 TP29] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 237, in __init__
self.capture()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 304, in capture
) = self.capture_one_batch_size(bs, forward)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 387, in capture_one_batch_size
run_once()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 380, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 465, in _fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/external_utils.py", line 40, in inner
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 874, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 835, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 766, in forward
if not forward_batch.forward_mode.is_idle():
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 773, in torch_dynamo_resume_in_forward_at_766
hidden_states = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 524, in forward
forward_batch.forward_mode.is_extend()
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 531, in torch_dynamo_resume_in_forward_at_524
return self.forward_absorb(positions, hidden_states, forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 623, in forward_absorb
attn_output = self.attn_mqa(q_input, k_input, v_input, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 67, in forward
return forward_batch.attn_backend.forward(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/__init__.py", line 66, in forward
if forward_batch.forward_mode.is_decode():
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 1269, in __call__
return self._torchdynamo_orig_callable(
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 1064, in __call__
result = self._inner_convert(
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 526, in __call__
return _compile(
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 952, in _compile
raise InternalTorchDynamoError(
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 924, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 666, in compile_inner
return _compile_inner(code, one_graph, hooks, transform)
File "/usr/local/lib/python3.10/dist-packages/torch/_utils_internal.py", line 87, in wrapper_function
return function(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 699, in _compile_inner
out_code = transform_code_object(code, transform)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/bytecode_transformation.py", line 1322, in transform_code_object
transformations(instructions, code_options)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 231, in _fn
torch.cuda.set_rng_state(cuda_rng_state)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/random.py", line 77, in set_rng_state
_lazy_call(cb)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 249, in _lazy_call
callable()
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/random.py", line 75, in cb
default_generator.set_state(new_state_copy)
torch._dynamo.exc.InternalTorchDynamoError: RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 240, in __init__
self.tp_worker = TpWorkerClass(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in __init__
self.model_runner = ModelRunner(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 224, in __init__
self.init_cuda_graphs()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 741, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 239, in __init__
raise Exception(
Exception: Capture cuda graph failed: RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
3. disable torch compile by not using --enable-torch-compile
4. set --cuda-graph-max-bs to a smaller value (e.g., 32)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose
@HaiShaw Could you please take a look?
@lambert0312 let me double check on both platform today. What chips you used ?
@lambert0312 let me double check on both platform today. What chips you used ?
A800 thanks @yiakwy-xpu-ml-framework-team
@zhaochenyang20 I have revert commit to 6b08bf5. Once review is done, we could rebase onto main branch to resolve conflicts.
Please let me do rebase merge later to simplify the messages and resolving potential conflicts! But really thank you for trying it out.
@zhaochenyang20 I have revert commit to 6b08bf5. Once review is done, we could rebase onto main branch to resolve conflicts.
Please let me do rebase merge later to simplify the messages and resolving potential conflicts! But really thank you for trying it out.
@yiakwy-xpu-ml-framework-team Does this version support A800?
Reproduction
AMD (pass)
build
cd sgl-kernel && python setup_rocm.py install
Snapshot 1 :
Snapshot 2:
Snapshot 3:
test
cd benchmark/kernels/fused_moe_triton && python benchmark_deepseekv3_moe_align_blocks.py --verify
benchmark
cd benchmark/kernels/fused_moe_triton && python benchmark_deepseekv3_moe_align_blocks.py --save_path ./
Note: building in mI100, one should modify the arch "gfx908"
Note : to run in AMD correctly you must install pytorch correctly. Here is reference of requirements to build docker with ROCM 6.3 :
torch @ https://repo.radeon.com/rocm/manylinux/rocm-rel-6.3/torch-2.3.0%2Brocm6.3.0-cp312-cp312-linux_x86_64.whl
torchvision @ https://repo.radeon.com/rocm/manylinux/rocm-rel-6.3/torchvision-0.19.0%2Brocm6.3.0-cp312-cp312-linux_x86_64.whl
torchaudio @ https://repo.radeon.com/rocm/manylinux/rocm-rel-6.3/torchaudio-2.3.0%2Brocm6.3.0-cp312-cp312-linux_x86_64.whl
Details can be found in this project Tools-dockerhub. It is a one stop repo to get everything needed for rocm done.
FYI @andyluo7 , please let me know if the issue persists.
NV (pass)
GPU : A800
build
cd sgl-kernel && pip install -e . -v
- NOTE : in CUDA platform you should NOT use "python setup.py install" (the binary is not correct); The issue will be fix later. You Must install with pip so that correct dependencies will be installed.
Snapshot 1 :
Snapshot 2 :
Snapshot 3 :
FYI @lambert0312 , I am not sure wether your problem related to cuda graph or some other functions out of the scope.
Could you send me reproduction code and can you make sure the problem caused by kernel in this PR ?
@yiakwy-xpu-ml-framework-team Thanks for the reply. I will try it again according to the steps tomorrow. Logically speaking, it will be built using your patch.
@lambert0312 Thank you . Could you kindly delete the quote in the last reply to make the message neat ?
Here is my suggestion in case you face it again:
- make sure codes run smoothly without this PR and let me understand how to reproduce it if the issue persists; info on the num_tokens, num_experts, bs, seq will be helpful;
- to maximize the performance, the kernel uses more registers and more shared memory (but usually released after execution by compiler to reduce liveness of intermediate); changes are that the cuda graph need to reduce the capture.
- the input data may not be well aligned (power of 16 for example), because our tests only cover data with size of power of 2;
- if the issue persists, I will update an even better kernel for this purpose with less regsiters will be in usage.
@yiakwy-xpu-ml-framework-team Hey. Yi, thanks so much for help. Could you tell others how to run your branch to profiling it? We can ask community users to do this.
I will test it on H800.
Here is my result. @yiakwy-xpu-ml-framework-team @zhaochenyang20
| opt bench (all cases) | opt bench (snapshot) | GPU |
|---|---|---|
| H100 |
Great! @yiakwy-xpu-ml-framework-team I won't rebase your PR. Please rebase it on your side and if @BBuf agree with the PR. I can merge it.
I will test it on H200.
@fzyzcjy Thanks so much. And @BBuf , do you think we can merge it after clear profiling?
@fzyzcjy Thanks so much. And @BBuf , do you think we can merge it after clear profiling?
Yeah, and we should do acc test in DeepSeek V3, refer to:
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
I will test it on H200.
Here is my result. @yiakwy-xpu-ml-framework-team @zhaochenyang20
| opt bench (all cases) | opt bench (snapshot) | GPU |
|---|---|---|
| H200 |
fix unit test :
issue: cumsum_buffer is empty initialized in latest unit test.
@BBuf everything is cool on my side. Could you give a final look and approve it?
Thanks for @fsygd reporting, the kernel has problems when used together with cudagraph in end2end mode (verifed in Qwen MoE, DeepSeek V3 MoE) by enforcing fused moe to pick up "sgl_moe_align_block_size" implementation.
The bug is tricky, since it can only be produced when used inside cuda graph mode and the debug messages are hence too large to read (with AMD_LOG_LEVEL=4, 8 GB messages produced). In NV GPU, the issue can be temporialy resolved by disabling cuda graph.
The kernel is fast with aggressive usage shared memory resources and HBM buffer for blockwise reduction. Hence resources bugget may be limited when multiple concurrent kerenels executed.
# cuda graph capture with multiple copies
def capture(self):
with graph_capture() as graph_capture_context:
self.stream = graph_capture_context.stream
capture_range = (
tqdm.tqdm(self.capture_bs)
if get_tensor_model_parallel_rank() == 0
else self.capture_bs
)
for bs in capture_range:
print("[CudaGraph::capture] bs : {}/{}".format(bs, self.capture_bs[-1]))
with patch_model(
self.model_runner.model,
bs in self.compile_bs,
num_tokens=bs * self.num_tokens_per_bs,
tp_group=self.model_runner.tp_group,
) as forward:
(
graph,
output_buffers,
) = self.capture_one_batch_size(bs, forward)
self.graphs[bs] = graph
self.output_buffers[bs] = output_buffers
# Save gemlite cache after each capture
save_gemlite_cache()
Note cuda graph will keep multiple copies of kernels (multiple code size) to deal with different batch sizes in deocidng stage. Hence KV cache will be dramatically increased (20 GB in NV GPU , 70 GB in AMD GPU).
I am debugging it in MI300X and GPU with special debgging agents and try to find an optimal solution with it . Once it done with expected e2e results (with some improvements), I will share with you.
FYI @lambert0312 @andyluo7 @BBuf @zhaochenyang20
@yiakwy-xpu-ml-framework-team yi, THANKS so much