[Fix]: fix assert errors in high-concurrency scenarios during PD.

Open zeroorhero opened this issue 7 months ago • 0 comments

Motivation

Modifications

Checklist

[x] Format your code according to the Code Formatting with Pre-Commit.
[ ] Add unit tests as outlined in the Running Unit Tests.
[ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
[ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
[ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
[ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

BUG Reproduction

In disaggregatin PD high-concurrency scenarios, the decoder node may encounter assertion errors，as mentioned in this issue https://github.com/sgl-project/sglang/issues/6133. image (3)

Root Cause Analysis

Logs show Decode out of memory happened. The call stack is as follows:

scheduler.py::update_running_batch()
       # Selects requests to evict KV cache based on current VRAM
    -> schedule_batch.py::retract_decode()
           # Clears metadata of selected requests
        -> schedule_batch.py::reset_for_retract()
               # Re-adds requests to disagg_decode_prealloc_queue
            -> scheduler.py::_extend_requests_to_queue()

When handling retracted requests, the system must recompute the KV cache through prefill operations after eviction. However, an incomplete metadata cleanup in schedule_batch.py::reset_for_retract() led to assertion failures during KV cache transfer operations.

May 12 '25 12:05 zeroorhero