[1/2] support mla dp: custom alltoall

Open xu-yfei opened this issue 8 months ago • 0 comments

Motivation

This pr is for dp mla #5001

About dp mla: On an 8*H20(96GB), weight mem usage=87.19 GB when --dp-size 4 --enable-dp-attention, not enough memory left.

This optimization is similar to data parallelism attention, but it applies to MLA core instead of the entire attention. Compared with data parallelism attention, it does not additionally increase the memory occupied by weights. It allows for a significant reduction in the KV cache size and enables larger batch sizes with only 1-3ms additional decode latency.

On an 8×H20 (96GB) node, with data parallelism MLA enabled, we have achieved up to 1.85x(dp=4) ~ 2.34x(dp=8) decoding throughput improvement compared to the previous version. And, the number of kvcaches has been increased by 3.3x(dp=4) and 6.6x(dp=8).

Modifications

Implement a custom all_to_all for dynamic input/output splitting (split_sizes) based on sglang allreduce IPC, currently limited to single-machine use.

Checklist

[x] Format your code according to the Code Formatting with Pre-Commit.
[x] Add unit tests as outlined in the Running Unit Tests.
[ ] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
[ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
[ ] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
[ ] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Apr 02 '25 15:04 xu-yfei