Baizhou Zhang comments

Results 79 comments of


                                            Baizhou Zhang

[Feat] Support FlashMLA backend with MTP and FP8 KV cache

It's recommended to compare the performance between enabling and disabling MTP with the following script: ```bash python3 python/sglang/test/send_one.py ``` Please paste the results

[Feat] Support FlashMLA backend with MTP and FP8 KV cache

@quinnrong94 The added tests cannot pass CI, please have a look

[Feat] Support FlashMLA backend with MTP and FP8 KV cache

Hi @quinnrong94 , can you take a look at this CI fail? https://github.com/sgl-project/sglang/actions/runs/14996032913/job/42130798605?pr=6109

[Feat] Support FlashMLA backend with MTP and FP8 KV cache

> > Hi @quinnrong94 , can you take a look at this CI fail? https://github.com/sgl-project/sglang/actions/runs/14996032913/job/42130798605?pr=6109 > > Hi @Fridge003 , I saw flashMLA test failed in CI, I wonder if...

[Feat] Support FlashMLA backend with MTP and FP8 KV cache

For Future PRs: - Do some profiling and check whether there is any bubble caused by synchronization between CPU & GPU - Support speculative-num-steps > 1 - Support topk >...

[Bug] When deploying the two - machine DeepSeek R1 on the H200, the worker machines will get stuck at the stage of loading weights.

cc @zhyncs

[Bug] When deploying the two - machine DeepSeek R1 on the H200, the worker machines will get stuck at the stage of loading weights.

Hi @isky-cd , #3424 seems to be fixed by PR #3709. Could you please pull the latest branch and see whether this bug can be solved?

Baizhou Zhang

[Feat] Support FlashMLA backend with MTP and FP8 KV cache

[Feat] Support FlashMLA backend with MTP and FP8 KV cache

[Feat] Support FlashMLA backend with MTP and FP8 KV cache

[Feat] Support FlashMLA backend with MTP and FP8 KV cache

[Feat] Support FlashMLA backend with MTP and FP8 KV cache

[Bug] When deploying the two - machine DeepSeek R1 on the H200, the worker machines will get stuck at the stage of loading weights.

[Bug] When deploying the two - machine DeepSeek R1 on the H200, the worker machines will get stuck at the stage of loading weights.

[build issue] cuda12.8 torch 2.7 compatiable issue

[Feature] lora serving performance

[Bug] MLA slower than default for small context long outputs and generating bad output reproducibly