sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[sgl-kernel] Support PDL for activatons

Open Edenzzzz opened this issue 7 months ago • 30 comments

Motivation

Dependent on https://github.com/sgl-project/sglang/pull/5981, upgrades flashinfer in sgl-kernel to enable PDL in activation functions. Will update to verify compilation

Modifications

Checklist

  • [ ] Format your code according to the Code Formatting with Pre-Commit.
  • [ ] Add unit tests as outlined in the Running Unit Tests.
  • [ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
  • [ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
  • [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
  • [ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Edenzzzz avatar May 29 '25 02:05 Edenzzzz

I see, this modification help some activation kernel use PDL? @Edenzzzz

FlamingoPg avatar May 29 '25 06:05 FlamingoPg

I see, this modification help some activation kernel use PDL? @Edenzzzz

Yes this should reduce launch overhead further combined with PDL in norm.

Edenzzzz avatar May 29 '25 16:05 Edenzzzz

Build successful on H100

Edenzzzz avatar May 29 '25 18:05 Edenzzzz

Please resolve the conflict

Fridge003 avatar Jun 02 '25 04:06 Fridge003

After building with torch 2.6 + cuda 12.4, I hit an error ImportError: /usr/local/lib/python3.10/dist-packages/sgl_kernel/common_ops.abi3.so: undefined symbol: _Z18ep_moe_pre_reorderN2at6TensorES0_S0_S0_S0_lllb @yinfan98 any suggestions?

Edenzzzz avatar Jun 02 '25 18:06 Edenzzzz

btw why is CMAKE_BUILD_PARALLEL_LEVEL not used for make build? It takes a very long time

Edenzzzz avatar Jun 02 '25 20:06 Edenzzzz

btw why is CMAKE_BUILD_PARALLEL_LEVEL not used for make build? It takes a very long time

Need ninja

FlamingoPg avatar Jun 04 '25 05:06 FlamingoPg

Reply gemini review plz

FlamingoPg avatar Jun 04 '25 06:06 FlamingoPg

Hi @yinfan98, I noticed your request to reply with a review. I can provide a code review for the pull request, but I need to be explicitly invoked to do so using the command /gemini review in a new issue comment. Please use that command if you'd like me to perform a review.

gemini-code-assist[bot] avatar Jun 04 '25 06:06 gemini-code-assist[bot]

I triggered the CI.

FlamingoPg avatar Jun 04 '25 06:06 FlamingoPg

@yinfan98 Any ideas about the undefined symbol?

Edenzzzz avatar Jun 04 '25 20:06 Edenzzzz

@yinfan98 Any ideas about the undefined symbol?

@Edenzzzz Is this problem solved? Can you check which kernel triggered this bug?

Fridge003 avatar Jun 06 '25 18:06 Fridge003

@yinfan98 Any ideas about the undefined symbol?

@Edenzzzz Is this problem solved? Can you check which kernel triggered this bug?

It seems different everytime I recompile image

Edenzzzz avatar Jun 06 '25 19:06 Edenzzzz

@yinfan98 Any ideas about the undefined symbol?

@Edenzzzz Is this problem solved? Can you check which kernel triggered this bug?

It seems different everytime I recompile image

Have you tried pull the latest sglang main branch and compile again?

Fridge003 avatar Jun 06 '25 19:06 Fridge003

@Fridge003 @yinfan98 I believe this is ready to merge python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --disable-cuda-graph python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 4000 --random-output 200 --request-rate 8 --num-prompt 480

Before PR

image

After PR

image

Edenzzzz avatar Jun 07 '25 17:06 Edenzzzz

Can you please paste some profiling results between enabling/disabling PDL?

Fridge003 avatar Jun 07 '25 17:06 Fridge003

It's in above, before PR

Edenzzzz avatar Jun 07 '25 18:06 Edenzzzz

It's in above, before PR

I mean visualization of kernel timeline

Fridge003 avatar Jun 07 '25 18:06 Fridge003

The kernel runtime varies a lot between calls, but with PDL there's no inter-kernel gap, because it's able to launch as soon as some blocks in the prev kernel start exiting I guess

With PDL

image

Without PDL

image

Edenzzzz avatar Jun 07 '25 22:06 Edenzzzz

The kernel runtime varies a lot between calls, but with PDL there's no inter-kernel gap, because it's able to launch as soon as some blocks in the prev kernel start exiting I guess

With PDL

image

Without PDL

image

Just Curious, is there any possibility of overlapping it with prior kernels?

Fridge003 avatar Jun 08 '25 00:06 Fridge003

I think they are already overlapped, at least for the local variable init part

Edenzzzz avatar Jun 08 '25 01:06 Edenzzzz

Can you please post some accuracy tests for the kernels you improved?

Fridge003 avatar Jun 08 '25 01:06 Fridge003

@Fridge003 flashinfer mistakenly used int32_t to cast pos_ids.data_ptr(), which caused the rope precision error. Now it passes locally

Edenzzzz avatar Jun 08 '25 02:06 Edenzzzz

@yinfan98 Hi could you merge this if it seems good? Thanks.

Edenzzzz avatar Jun 29 '25 05:06 Edenzzzz

hi @zhyncs could you please merge this?

Edenzzzz avatar Jul 12 '25 17:07 Edenzzzz

@Fridge003 Could you trigger the CI again?

Edenzzzz avatar Jul 27 '25 17:07 Edenzzzz

/gemini review

Edenzzzz avatar Jul 31 '25 15:07 Edenzzzz

@Fridge003 Have you seen the deepep failure elsewhere? it should kernel launch fail, but I didn't change launch parameters.

Edenzzzz avatar Nov 04 '25 21:11 Edenzzzz

/tag-and-rerun-ci again

b8zhong avatar Nov 07 '25 19:11 b8zhong

To my understanding, this PR is still needed because currently we use the CPP interface of activations from Flashinfer, right? Because I found for other kernels from FI, PDL is automatically determined

Yes this is basically migrating the launch logic from flashinfer while supporting ROCM in the other path

Edenzzzz avatar Nov 08 '25 02:11 Edenzzzz