[bugfix] fix prefill TBO crashes when attn_tp_size > 1
Motivation
Issues #12757 and #10863 indicate that the current TBO implementation still has several bugs. Although PRs such as #11423 and #13082 address part of the problem, some configurations remain unfixed.
For example, a qwen3 235B attn_tp_size = 8 configuration continues to fail after PR#11423:
python -m sglang.launch_server --model-path /path/to/Qwen3-235B-A22B-FP8 --tp-size 8 --context-length 32768 --host 0.0.0.0 --port 8000 --random-seed 42 --max-prefill-tokens 20480 --mem-fraction-static 0.83 --max-running-requests 128 --disable-radix-cache --chunked-prefill-size -1 --ep-size 8 --moe-a2a-backend deepep --deepep-mode auto --attention-backend fa3 --enable-two-batch-overlap
Modifications
This PR builds on PR#11423 and additionally fixes:
- A prefill TBO failure when
--deepep-mode=autois used. - Crashes occurring when
attn_tp_size > 1and input length is not divisible byattn_tp_size.
Accuracy Tests
Benchmarking and Profiling
Checklist
- [ ] Format your code according to the Format code with pre-commit.
- [ ] Add unit tests according to the Run and add unit tests.
- [ ] Update documentation according to Write documentations.
- [ ] Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
- [ ] Follow the SGLang code style guidance.
- [ ] Work with maintainers to merge your PR. See the PR Merge Process
Summary of Changes
Hello @yuhyao, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request provides crucial bug fixes for the Two-Batch Overlap (TBO) prefill mechanism, specifically addressing crashes that occur when attention tensor parallelism (attn_tp_size) is greater than one. The changes involve a strategic refactoring of TBO batch preparation, a more robust approach to splitting token sequences that respects tensor parallelism boundaries, and a streamlined process for managing data parallelism buffers. These modifications are designed to enhance the stability and correctness of the TBO implementation in distributed environments, resolving previously identified issues.
Highlights
- Refactored TBO Preparation: The logic for preparing Two-Batch Overlap (TBO) batches has been refactored, moving the
TboForwardBatchPreparer.preparecall to a more appropriate stage within the batch processing lifecycle. - Attention Tensor Parallelism Alignment: Token splitting mechanisms for TBO have been updated to ensure that the split points are aligned with the
attn_tp_size(attention tensor parallelism size), preventing crashes in distributed configurations. - Simplified Two-Chunk Split Condition: The condition for enabling two-chunk splitting in TBO has been simplified to always return true, removing previous token distribution threshold checks.
- Consistent Data Parallelism Buffer Management: The
set_dp_buffer_lenfunction is now unconditionally called, ensuring consistent management of data parallelism buffer lengths.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.
Fixed a bug by changing the padding logic from "pad first, then split aligned to attn_tp_size" to "split first, then pad each TBO child."
Could add TBO to the deepep tests (Test 10-19 and Test 40-49) in https://github.com/sgl-project/sglang/blob/main/test/manual/ep/test_hybrid_dp_ep_tp_mtp.py to verify the functionality of your fix?
/tag-and-rerun-ci
@yuhyao Thank you for your excellent contribution. We do not plan to support TBO for allreduce / allgather-based dispatching.