sglang [sgl-kernel] Support PDL for activatons

Motivation

Dependent on https://github.com/sgl-project/sglang/pull/5981, upgrades flashinfer in sgl-kernel to enable PDL in activation functions. Will update to verify compilation

Modifications

Checklist

[ ] Format your code according to the Code Formatting with Pre-Commit.
[ ] Add unit tests as outlined in the Running Unit Tests.
[ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
[ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
[ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
[ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

May 29 '25 02:05 Edenzzzz

I see, this modification help some activation kernel use PDL? @Edenzzzz

May 29 '25 06:05 FlamingoPg

I see, this modification help some activation kernel use PDL? @Edenzzzz

Yes this should reduce launch overhead further combined with PDL in norm.

May 29 '25 16:05 Edenzzzz

Build successful on H100

May 29 '25 18:05 Edenzzzz

Please resolve the conflict

Jun 02 '25 04:06 Fridge003

After building with torch 2.6 + cuda 12.4, I hit an error ImportError: /usr/local/lib/python3.10/dist-packages/sgl_kernel/common_ops.abi3.so: undefined symbol: _Z18ep_moe_pre_reorderN2at6TensorES0_S0_S0_S0_lllb @yinfan98 any suggestions?

Jun 02 '25 18:06 Edenzzzz

btw why is CMAKE_BUILD_PARALLEL_LEVEL not used for make build? It takes a very long time

Jun 02 '25 20:06 Edenzzzz

btw why is CMAKE_BUILD_PARALLEL_LEVEL not used for make build? It takes a very long time

Need ninja

Jun 04 '25 05:06 FlamingoPg

Reply gemini review plz

Jun 04 '25 06:06 FlamingoPg

Hi @yinfan98, I noticed your request to reply with a review. I can provide a code review for the pull request, but I need to be explicitly invoked to do so using the command /gemini review in a new issue comment. Please use that command if you'd like me to perform a review.

Jun 04 '25 06:06 gemini-code-assist[bot]

I triggered the CI.

Jun 04 '25 06:06 FlamingoPg

@yinfan98 Any ideas about the undefined symbol?

Jun 04 '25 20:06 Edenzzzz

@yinfan98 Any ideas about the undefined symbol?

@Edenzzzz Is this problem solved? Can you check which kernel triggered this bug?

Jun 06 '25 18:06 Fridge003

@yinfan98 Any ideas about the undefined symbol?

@Edenzzzz Is this problem solved? Can you check which kernel triggered this bug?

It seems different everytime I recompile

Jun 06 '25 19:06 Edenzzzz

@yinfan98 Any ideas about the undefined symbol?

@Edenzzzz Is this problem solved? Can you check which kernel triggered this bug?

It seems different everytime I recompile

Have you tried pull the latest sglang main branch and compile again?

Jun 06 '25 19:06 Fridge003

@Fridge003 @yinfan98 I believe this is ready to merge python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --disable-cuda-graph python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 4000 --random-output 200 --request-rate 8 --num-prompt 480

Before PR

After PR

Jun 07 '25 17:06 Edenzzzz

Can you please paste some profiling results between enabling/disabling PDL?

Jun 07 '25 17:06 Fridge003

It's in above, before PR

Jun 07 '25 18:06 Edenzzzz

It's in above, before PR

I mean visualization of kernel timeline

Jun 07 '25 18:06 Fridge003

The kernel runtime varies a lot between calls, but with PDL there's no inter-kernel gap, because it's able to launch as soon as some blocks in the prev kernel start exiting I guess

With PDL

Without PDL

Jun 07 '25 22:06 Edenzzzz

The kernel runtime varies a lot between calls, but with PDL there's no inter-kernel gap, because it's able to launch as soon as some blocks in the prev kernel start exiting I guess

With PDL

Without PDL

Just Curious, is there any possibility of overlapping it with prior kernels?

Jun 08 '25 00:06 Fridge003

I think they are already overlapped, at least for the local variable init part

Jun 08 '25 01:06 Edenzzzz

Can you please post some accuracy tests for the kernels you improved?

Jun 08 '25 01:06 Fridge003

@Fridge003 flashinfer mistakenly used int32_t to cast pos_ids.data_ptr(), which caused the rope precision error. Now it passes locally

Jun 08 '25 02:06 Edenzzzz

@yinfan98 Hi could you merge this if it seems good? Thanks.

Jun 29 '25 05:06 Edenzzzz

hi @zhyncs could you please merge this?

Jul 12 '25 17:07 Edenzzzz

@Fridge003 Could you trigger the CI again?

Jul 27 '25 17:07 Edenzzzz

/gemini review

Jul 31 '25 15:07 Edenzzzz

@Fridge003 Have you seen the deepep failure elsewhere? it should kernel launch fail, but I didn't change launch parameters.

Nov 04 '25 21:11 Edenzzzz

/tag-and-rerun-ci again

Nov 07 '25 19:11 b8zhong

To my understanding, this PR is still needed because currently we use the CPP interface of activations from Flashinfer, right? Because I found for other kernels from FI, PDL is automatically determined

Yes this is basically migrating the launch logic from flashinfer while supporting ROCM in the other path

Nov 08 '25 02:11 Edenzzzz