sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[bugfix] fix prefill TBO crashes when attn_tp_size > 1

Open yuhyao opened this issue 1 month ago • 1 comments

Motivation

Issues #12757 and #10863 indicate that the current TBO implementation still has several bugs. Although PRs such as #11423 and #13082 address part of the problem, some configurations remain unfixed.

For example, a qwen3 235B attn_tp_size = 8 configuration continues to fail after PR#11423:

python -m sglang.launch_server --model-path /path/to/Qwen3-235B-A22B-FP8 --tp-size 8 --context-length 32768 --host 0.0.0.0 --port 8000 --random-seed 42 --max-prefill-tokens 20480 --mem-fraction-static 0.83 --max-running-requests 128 --disable-radix-cache --chunked-prefill-size -1 --ep-size 8 --moe-a2a-backend deepep --deepep-mode auto --attention-backend fa3 --enable-two-batch-overlap

Modifications

This PR builds on PR#11423 and additionally fixes:

  • A prefill TBO failure when --deepep-mode=auto is used.
  • Crashes occurring when attn_tp_size > 1 and input length is not divisible by attn_tp_size.

Accuracy Tests

Benchmarking and Profiling

Checklist

yuhyao avatar Nov 21 '25 11:11 yuhyao

Summary of Changes

Hello @yuhyao, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request provides crucial bug fixes for the Two-Batch Overlap (TBO) prefill mechanism, specifically addressing crashes that occur when attention tensor parallelism (attn_tp_size) is greater than one. The changes involve a strategic refactoring of TBO batch preparation, a more robust approach to splitting token sequences that respects tensor parallelism boundaries, and a streamlined process for managing data parallelism buffers. These modifications are designed to enhance the stability and correctness of the TBO implementation in distributed environments, resolving previously identified issues.

Highlights

  • Refactored TBO Preparation: The logic for preparing Two-Batch Overlap (TBO) batches has been refactored, moving the TboForwardBatchPreparer.prepare call to a more appropriate stage within the batch processing lifecycle.
  • Attention Tensor Parallelism Alignment: Token splitting mechanisms for TBO have been updated to ensure that the split points are aligned with the attn_tp_size (attention tensor parallelism size), preventing crashes in distributed configurations.
  • Simplified Two-Chunk Split Condition: The condition for enabling two-chunk splitting in TBO has been simplified to always return true, removing previous token distribution threshold checks.
  • Consistent Data Parallelism Buffer Management: The set_dp_buffer_len function is now unconditionally called, ensuring consistent management of data parallelism buffer lengths.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

gemini-code-assist[bot] avatar Nov 21 '25 11:11 gemini-code-assist[bot]

Fixed a bug by changing the padding logic from "pad first, then split aligned to attn_tp_size" to "split first, then pad each TBO child."

yuhyao avatar Nov 24 '25 07:11 yuhyao

Could add TBO to the deepep tests (Test 10-19 and Test 40-49) in https://github.com/sgl-project/sglang/blob/main/test/manual/ep/test_hybrid_dp_ep_tp_mtp.py to verify the functionality of your fix?

ch-wan avatar Nov 25 '25 18:11 ch-wan

/tag-and-rerun-ci

ch-wan avatar Nov 30 '25 23:11 ch-wan

@yuhyao Thank you for your excellent contribution. We do not plan to support TBO for allreduce / allgather-based dispatching.

ch-wan avatar Nov 30 '25 23:11 ch-wan