sglang icon indicating copy to clipboard operation
sglang copied to clipboard

Fix channel-wise INT8 moe config tuning error

Open lambert0312 opened this issue 1 year ago • 1 comments

Motivation

When tuning the channel-wise INT8 moe model configuration, the following error occurs:

ray.exceptions.RayTaskError(AssertionError): ray::BenchmarkWorker.tune() (pid=723, ip=172.16.65.5, actor_id=1015c7ff7374564f63308ee801000000, repr=<tuning_fused_moe_triton.BenchmarkWorker object at 0x7f4f7c419c60>)
  File "/sgl-workspace/sglang/benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py", line 307, in tune
    kernel_time = benchmark_config(
  File "/sgl-workspace/sglang/benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py", line 146, in benchmark_config
    run()
  File "/sgl-workspace/sglang/benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py", line 127, in run
    fused_moe(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1613, in fused_moe
    return fused_experts(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1258, in fused_experts
    torch.ops.sglang.inplace_fused_experts(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1123, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1123, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1097, in inplace_fused_experts
    fused_experts_impl(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1435, in fused_experts_impl
    invoke_fused_moe_kernel(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 786, in invoke_fused_moe_kernel
    assert (
AssertionError: int8 quantization only supports channel-wise quantization except for block-wise quantization

Modifications

  • Modify tuning_fused_moe_triton.py and add the corresponding per_channel_quant parameter
  • Update instructions and necessary function comments

Checklist

  • [x] Format your code according to the Code Formatting with Pre-Commit.
  • [ ] Add unit tests as outlined in the Running Unit Tests.
  • [x] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
  • [ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
  • [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
  • [ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

lambert0312 avatar Apr 29 '25 06:04 lambert0312

@merrymercy @zhyncs Do you have time to take a look?

lambert0312 avatar May 06 '25 00:05 lambert0312

@zhyncs Do you have time to take a look?

lambert0312 avatar Jul 03 '25 23:07 lambert0312

@zhyncs See if there are any other problems. If not, please merge them. Thank you.

lambert0312 avatar Jul 22 '25 00:07 lambert0312

is this pr ready to merge? @zhyncs @BBuf

lambert0312 avatar Jul 29 '25 08:07 lambert0312