sglang
sglang copied to clipboard
Fix allgather ops inside cuda graphs
Fixes #3424
TLDR: sglang uses pynccl or their customalllreduce instead of pytorch ProcessGroup when graph capturing. Allgather in sglang code base directly uses pytorch allgather instead of the sglang abstraction to decide which implementation to pick. Allgather is used in DP-attention and also to gather logits across TP dim. The fix is to perform the allgather via the abstraction so that the same NCCL communicator won't be used inside and outside graph captures.
Motivation
Modifications
Checklist
- [ ] Format your code according to the Code Formatting with Pre-Commit.
- [ ] Add unit tests as outlined in the Running Unit Tests.
- [ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
- [ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
- [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
- [ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.