sglang Fix allgather ops inside cuda graphs

Fix allgather ops inside cuda graphs

Open nvcastet opened this issue 1 week ago • 19 comments

Fixes #3424

TLDR: sglang uses pynccl or their customalllreduce instead of pytorch ProcessGroup when graph capturing. Allgather in sglang code base directly uses pytorch allgather instead of the sglang abstraction to decide which implementation to pick. Allgather is used in DP-attention and also to gather logits across TP dim. The fix is to perform the allgather via the abstraction so that the same NCCL communicator won't be used inside and outside graph captures.

Motivation

Modifications

Checklist

[ ] Format your code according to the Code Formatting with Pre-Commit.
[ ] Add unit tests as outlined in the Running Unit Tests.
[ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
[ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
[ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
[ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Feb 19 '25 22:02 nvcastet

sglang sglang copied to clipboard

Fix allgather ops inside cuda graphs

Motivation

Modifications

Checklist

sglang
sglang copied to clipboard