vllm
vllm copied to clipboard
[V1] EP + DP Attention
Based on https://github.com/vllm-project/vllm/pull/13591
DP+EP implemented via collective ops in the fused_moe layer's forward pass.
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @tlrmchlsmth.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @tlrmchlsmth.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
GSM8k results with this PR plus @njhill's #13923 merged in on a neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8
lm_eval --model local-completions --tasks gsm8k --model_args model=neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8,base_url=http://127.0.0.1:8192/v1/completions,num_concurrent=5,max_retries=3,tokenized_requests=False --limit 100
This PR:
VLLM_USE_V1=1 VLLM_TEST_ENABLE_EP=1 vllm serve neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 --tensor_parallel_size=2 --data_parallel_size=2 --port 8192 --enforce-eager
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.61|± | 0.049|
| | |strict-match | 5|exact_match|↑ | 0.61|± | 0.049|
Main
vllm serve neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 --tensor_parallel_size=2 --port 8192
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.62|± |0.0488|
| | |strict-match | 5|exact_match|↑ | 0.61|± |0.0490|
Must disable CUDA graphs by default when using DP+EP before landing, as it will deadlock otherwise
let's do it in https://github.com/vllm-project/vllm/blob/989f4f430cd74a14d539d8b59b9d239301f1bdcd/vllm/platforms/cuda.py#L111 ?
Must disable CUDA graphs by default when using DP+EP before landing, as it will deadlock otherwise
let's do it in
https://github.com/vllm-project/vllm/blob/989f4f430cd74a14d539d8b59b9d239301f1bdcd/vllm/platforms/cuda.py#L111
?
Nice, TIL about check_and_update_config, thanks!
Will we refer to deepseek and only ep & dp?
• Prefilling Phase [Routed Expert EP32, MLA/Shared Expert DP32]: Each deployment unit spans 4 nodes with 32 redundant routed experts, where each GPU handles 9 routed experts and 1 shared expert. • Decoding Phase [Routed Expert EP144, MLA/Shared Expert DP144]: Each deployment unit spans 18 nodes with 32 redundant routed experts, where each GPU manages 2 routed experts and 1 shared expert.
can we use "EP/TP MoE + DP Attention" on V0 ?
can we use "EP/TP MoE + DP Attention" on V0 ?
No, DP is only added in V1.
can we use "EP/TP MoE + DP Attention" on V0 ?
No, DP is only added in V1.
get ~
can we use "EP/TP MoE + DP Attention" on V0 ?
No, DP is only added in V1.
Can I assume that most latest features and optimizations will only be available at v1, and will not be backported to v0?
most latest features and optimizations will only be available at v1, and will not be backported to v0
yes.
Does it support multi-node deployment, or can it only be deployed on a single machine?
Does it support multi-node deployment, or can it only be deployed on a single machine?
@tlrmchlsmth
@v-lmn yes it supports multi-node.
As with all things there are some caveats:
- I haven't tested multinode
- The server being added in https://github.com/vllm-project/vllm/pull/13923 does not support multi-node yet but will in a subsequent PR
- The collective ops are suboptimal especially in the multinode case - we haven't integrated DeepEP yet.
@v-lmn @tlrmchlsmth I have tested multi-node, It works. BTW, Do we plan to support cuda graph? I notice that sglang support cuda graph in attention-dp
@ZeldaHuang Nice, thanks for testing and letting me know! Yes we'll support CUDA Graphs in a future PR
vllm ep看上去还没有实现通信(dispatch+ combine)和计算的overlap?这在deepseek这样的moe模型中显得很重要。这块工作vllm有计划吗?
vllm ep看上去还没有实现通信(dispatch+ combine)和计算的overlap?这在deepseek这样的moe模型中显得很重要。这块工作vllm有计划吗?
Yes, this is in progress!
do current version support EP shards in world size?
How can I use DP for Attention module and use EP for Expert module? I have 2 nodes and 16 GPU. I can't find how to run this in document. @youkaichao @tlrmchlsmth
@ZeldaHuang Nice, thanks for testing and letting me know! Yes we'll support CUDA Graphs in a future PR
Hi @tlrmchlsmth , could you provide any insights on what needs to be addressed to support CUDA Graphs? I've observed that in this location, CUDA Graphs is disabled when DP (Data Parallelism) is enabled.