[sgl-kernel] Support PDL for activatons
Motivation
Dependent on https://github.com/sgl-project/sglang/pull/5981, upgrades flashinfer in sgl-kernel to enable PDL in activation functions. Will update to verify compilation
Modifications
Checklist
- [ ] Format your code according to the Code Formatting with Pre-Commit.
- [ ] Add unit tests as outlined in the Running Unit Tests.
- [ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
- [ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
- [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
- [ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.
I see, this modification help some activation kernel use PDL? @Edenzzzz
I see, this modification help some activation kernel use PDL? @Edenzzzz
Yes this should reduce launch overhead further combined with PDL in norm.
Build successful on H100
Please resolve the conflict
After building with torch 2.6 + cuda 12.4, I hit an error
ImportError: /usr/local/lib/python3.10/dist-packages/sgl_kernel/common_ops.abi3.so: undefined symbol: _Z18ep_moe_pre_reorderN2at6TensorES0_S0_S0_S0_lllb
@yinfan98 any suggestions?
btw why is CMAKE_BUILD_PARALLEL_LEVEL not used for make build? It takes a very long time
btw why is
CMAKE_BUILD_PARALLEL_LEVELnot used formake build? It takes a very long time
Need ninja
Reply gemini review plz
Hi @yinfan98, I noticed your request to reply with a review. I can provide a code review for the pull request, but I need to be explicitly invoked to do so using the command /gemini review in a new issue comment. Please use that command if you'd like me to perform a review.
I triggered the CI.
@yinfan98 Any ideas about the undefined symbol?
@yinfan98 Any ideas about the undefined symbol?
@Edenzzzz Is this problem solved? Can you check which kernel triggered this bug?
@yinfan98 Any ideas about the undefined symbol?
@Edenzzzz Is this problem solved? Can you check which kernel triggered this bug?
It seems different everytime I recompile
@yinfan98 Any ideas about the undefined symbol?
@Edenzzzz Is this problem solved? Can you check which kernel triggered this bug?
It seems different everytime I recompile
Have you tried pull the latest sglang main branch and compile again?
@Fridge003 @yinfan98 I believe this is ready to merge
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --disable-cuda-graph
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 4000 --random-output 200 --request-rate 8 --num-prompt 480
Before PR
After PR
Can you please paste some profiling results between enabling/disabling PDL?
It's in above, before PR
It's in above, before PR
I mean visualization of kernel timeline
The kernel runtime varies a lot between calls, but with PDL there's no inter-kernel gap, because it's able to launch as soon as some blocks in the prev kernel start exiting I guess
With PDL
Without PDL
The kernel runtime varies a lot between calls, but with PDL there's no inter-kernel gap, because it's able to launch as soon as some blocks in the prev kernel start exiting I guess
With PDL
Without PDL
Just Curious, is there any possibility of overlapping it with prior kernels?
I think they are already overlapped, at least for the local variable init part
Can you please post some accuracy tests for the kernels you improved?
@Fridge003 flashinfer mistakenly used int32_t to cast pos_ids.data_ptr(), which caused the rope precision error. Now it passes locally
@yinfan98 Hi could you merge this if it seems good? Thanks.
hi @zhyncs could you please merge this?
@Fridge003 Could you trigger the CI again?
/gemini review
@Fridge003 Have you seen the deepep failure elsewhere? it should kernel launch fail, but I didn't change launch parameters.
/tag-and-rerun-ci again
To my understanding, this PR is still needed because currently we use the CPP interface of activations from Flashinfer, right? Because I found for other kernels from FI, PDL is automatically determined
Yes this is basically migrating the launch logic from flashinfer while supporting ROCM in the other path


