Cheng Wan
Cheng Wan
## Motivation This code confronts AssertionError: ```python import torch from sglang.srt.layers.moe.fused_moe_triton.fused_moe import fused_moe N = 64 * 1024 + 10 E = 8 H = 1024 I = 4096 x...
## Motivation This PR can partially address #3633. ## Modifications We reuse the memory of `intermediate_cache1` to create `intermediate_cache3`. Here is the test script ```python import torch from sglang.srt.layers.moe.fused_moe_triton.fused_moe import...
## Motivation There were two issues related to `--moe-dense-tp-size=1` with DP attention: - The CUDA graph runner overestimated the buffer size required by DP attention, which may prevent the execution...