sglang
sglang copied to clipboard
Fix channel-wise INT8 moe config tuning error
Motivation
When tuning the channel-wise INT8 moe model configuration, the following error occurs:
ray.exceptions.RayTaskError(AssertionError): ray::BenchmarkWorker.tune() (pid=723, ip=172.16.65.5, actor_id=1015c7ff7374564f63308ee801000000, repr=<tuning_fused_moe_triton.BenchmarkWorker object at 0x7f4f7c419c60>)
File "/sgl-workspace/sglang/benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py", line 307, in tune
kernel_time = benchmark_config(
File "/sgl-workspace/sglang/benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py", line 146, in benchmark_config
run()
File "/sgl-workspace/sglang/benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py", line 127, in run
fused_moe(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1613, in fused_moe
return fused_experts(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1258, in fused_experts
torch.ops.sglang.inplace_fused_experts(
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1123, in __call__
return self._op(*args, **(kwargs or {}))
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1123, in __call__
return self._op(*args, **(kwargs or {}))
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1097, in inplace_fused_experts
fused_experts_impl(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1435, in fused_experts_impl
invoke_fused_moe_kernel(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 786, in invoke_fused_moe_kernel
assert (
AssertionError: int8 quantization only supports channel-wise quantization except for block-wise quantization
Modifications
- Modify
tuning_fused_moe_triton.pyand add the correspondingper_channel_quantparameter - Update instructions and necessary function comments
Checklist
- [x] Format your code according to the Code Formatting with Pre-Commit.
- [ ] Add unit tests as outlined in the Running Unit Tests.
- [x] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
- [ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
- [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
- [ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.
@merrymercy @zhyncs Do you have time to take a look?
@zhyncs Do you have time to take a look?
@zhyncs See if there are any other problems. If not, please merge them. Thank you.
is this pr ready to merge? @zhyncs @BBuf