sglang
sglang copied to clipboard
[Fix]: fix assert errors in high-concurrency scenarios during PD.
Motivation
Modifications
Checklist
- [x] Format your code according to the Code Formatting with Pre-Commit.
- [ ] Add unit tests as outlined in the Running Unit Tests.
- [ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
- [ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
- [ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
- [ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.
BUG Reproduction
In disaggregatin PD high-concurrency scenarios, the decoder node may encounter assertion errors,as mentioned in this issue https://github.com/sgl-project/sglang/issues/6133.
Root Cause Analysis
Logs show Decode out of memory happened. The call stack is as follows:
scheduler.py::update_running_batch()
# Selects requests to evict KV cache based on current VRAM
-> schedule_batch.py::retract_decode()
# Clears metadata of selected requests
-> schedule_batch.py::reset_for_retract()
# Re-adds requests to disagg_decode_prealloc_queue
-> scheduler.py::_extend_requests_to_queue()
When handling retracted requests, the system must recompute the KV cache through prefill operations after eviction. However, an incomplete metadata cleanup in schedule_batch.py::reset_for_retract() led to assertion failures during KV cache transfer operations.