Stefan He comments

Results 32 comments of


                                            Stefan He

[Feat] Enable PDL automatically on Hopper architecture

LFG!

benchmark is low on B200

@tridao wonder when will FA3 blackwell version came out? Looking forward to it!

veRL-SGLang slower than expected (GH200)

![Image](https://github.com/user-attachments/assets/08b77197-bb47-4a99-9a2e-9af7a6d09c5c) @EduardDurech Hi Eduard, thanks for your detailed profiling. I've done some profiling from our side by running QWen 7B GRPO using almost the same setup as the verl's recipe....

veRL-SGLang slower than expected (GH200)

@EduardDurech Hi Eduard, tbh I don't have insightful update but just to share what i did: Some interesting finding: - In CUDA 12.6, sgl and vllm are on par -...

veRL-SGLang slower than expected (GH200)

> * It is weird that gen is slower in veRL though than standalone, no? SGLang is roughly twice the throughput for me in normal inference [veRL-SGLang slower than expected...

feat: mtp support dp-attention

> Open DP attention, MTP, cuda graph found that the performance dropped very much, analyzed and found that it was because the reception rate dropped very much. This caused the...

[Bug] FA3 KV-Cache-Fp8

@pengcuo Currently the FA3 API only support headdim

[Bug] The performance of the FA3 attention backend on Hopper is not up to expect.

> me too... > > # Environment > > sglang image == 0.4.6.post2.cu124 > model == Qwen3/Qwen3-235B-A22B > FA3 attention backend > > > # Result > > sglang Mean...

Can Flash Attention 3 run on A100？

Hi @tridao, really appreciate your work! I'm curious about what "FA3 Ampere" refers to. As I understand it, most of FA3's improvements come from Hopper GPU features. So how does...

[Core feature] Support cache overwrite flag at task level

@kumare3 Thanks for reply. Regarding the `cache_version` solution, it would work for adhoc/exprimental workflow code (tho it requires some level of understanding of how caching works in Flyte, I know...