Ke Bao comments

Results 60 comments of


                                            Ke Bao

RuntimeError: Triton Error [CUDA]: device kernel image is invalid

Maybe you can ref https://github.com/triton-lang/triton/issues/4172 and https://github.com/InternLM/lmdeploy/pull/1621#issuecomment-2179731554

Support piecewise cuda graph for Qwen3-next

/tag-and-rerun-ci

[Feature] Expert parallelism support

> Does MoE-EP have any support? I have implemented MoE-EP. @xiaobochen123 We are going to implement it with a DP + EP approach for throughput gains. Currently, DP attention is...

Feature DeepSeek V3/R1 INT8 Quantization (block-wise)

@zhyncs @merrymercy LGTM, could you help review and merge?

[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100

@hariag could you share the commands for 8*H200?

[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100

Could you try to add `--disable-overlap-schedule` and test it again?

[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100

> I attached the server side log, please check it. > [debug.log](https://github.com/user-attachments/files/18856437/debug.log) I checked the log, it seems an issue for `sgl_kernels.fp8_blockwise_scaled_mm` cc: @zhyncs @yizhang2077

[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100

@Lzhang-hub Did you try the latest main branch?

[Bug] sglang crashed when use enable_dp_attention running DeepSeekV3 on 2x8xH100

@lshmouse @ToughK The dp attention is aimed to improve throughput for large batch size (>128). The latency is higher than TP.

[2/2] support dp mla

Could you add DP attention in the benchmarks?