vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[V1] EP + DP Attention

Open tlrmchlsmth opened this issue 9 months ago • 4 comments

Based on https://github.com/vllm-project/vllm/pull/13591

DP+EP implemented via collective ops in the fused_moe layer's forward pass.

tlrmchlsmth avatar Feb 26 '25 22:02 tlrmchlsmth

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

github-actions[bot] avatar Feb 26 '25 22:02 github-actions[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @tlrmchlsmth.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Feb 28 '25 08:02 mergify[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @tlrmchlsmth.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Mar 03 '25 02:03 mergify[bot]

GSM8k results with this PR plus @njhill's #13923 merged in on a neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8

lm_eval --model local-completions --tasks gsm8k --model_args model=neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8,base_url=http://127.0.0.1:8192/v1/completions,num_concurrent=5,max_retries=3,tokenized_requests=False --limit 100

This PR:

VLLM_USE_V1=1 VLLM_TEST_ENABLE_EP=1 vllm serve neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 --tensor_parallel_size=2 --data_parallel_size=2 --port 8192 --enforce-eager

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.61|±  | 0.049|
|     |       |strict-match    |     5|exact_match|↑  | 0.61|±  | 0.049|

Main

vllm serve neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 --tensor_parallel_size=2 --port 8192

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.62|±  |0.0488|
|     |       |strict-match    |     5|exact_match|↑  | 0.61|±  |0.0490|

tlrmchlsmth avatar Mar 04 '25 01:03 tlrmchlsmth

Must disable CUDA graphs by default when using DP+EP before landing, as it will deadlock otherwise

let's do it in https://github.com/vllm-project/vllm/blob/989f4f430cd74a14d539d8b59b9d239301f1bdcd/vllm/platforms/cuda.py#L111 ?

youkaichao avatar Mar 04 '25 03:03 youkaichao

Must disable CUDA graphs by default when using DP+EP before landing, as it will deadlock otherwise

let's do it in

https://github.com/vllm-project/vllm/blob/989f4f430cd74a14d539d8b59b9d239301f1bdcd/vllm/platforms/cuda.py#L111

?

Nice, TIL about check_and_update_config, thanks!

tlrmchlsmth avatar Mar 04 '25 03:03 tlrmchlsmth

Will we refer to deepseek and only ep & dp?

• Prefilling Phase [Routed Expert EP32, MLA/Shared Expert DP32]: Each deployment unit spans 4 nodes with 32 redundant routed experts, where each GPU handles 9 routed experts and 1 shared expert. • Decoding Phase [Routed Expert EP144, MLA/Shared Expert DP144]: Each deployment unit spans 18 nodes with 32 redundant routed experts, where each GPU manages 2 routed experts and 1 shared expert.

DeepTecher avatar Mar 05 '25 03:03 DeepTecher

can we use "EP/TP MoE + DP Attention" on V0 ?

DefTruth avatar Mar 05 '25 05:03 DefTruth

can we use "EP/TP MoE + DP Attention" on V0 ?

No, DP is only added in V1.

youkaichao avatar Mar 05 '25 06:03 youkaichao

can we use "EP/TP MoE + DP Attention" on V0 ?

No, DP is only added in V1.

get ~

DefTruth avatar Mar 05 '25 06:03 DefTruth

can we use "EP/TP MoE + DP Attention" on V0 ?

No, DP is only added in V1.

Can I assume that most latest features and optimizations will only be available at v1, and will not be backported to v0?

WhoisZihan avatar Mar 05 '25 07:03 WhoisZihan

most latest features and optimizations will only be available at v1, and will not be backported to v0

yes.

youkaichao avatar Mar 05 '25 07:03 youkaichao

Does it support multi-node deployment, or can it only be deployed on a single machine?

v-lmn avatar Mar 05 '25 08:03 v-lmn

Does it support multi-node deployment, or can it only be deployed on a single machine?

@tlrmchlsmth

v-lmn avatar Mar 06 '25 01:03 v-lmn

@v-lmn yes it supports multi-node.

As with all things there are some caveats:

  • I haven't tested multinode
  • The server being added in https://github.com/vllm-project/vllm/pull/13923 does not support multi-node yet but will in a subsequent PR
  • The collective ops are suboptimal especially in the multinode case - we haven't integrated DeepEP yet.

tlrmchlsmth avatar Mar 06 '25 02:03 tlrmchlsmth

@v-lmn @tlrmchlsmth I have tested multi-node, It works. BTW, Do we plan to support cuda graph? I notice that sglang support cuda graph in attention-dp

ZeldaHuang avatar Mar 06 '25 02:03 ZeldaHuang

@ZeldaHuang Nice, thanks for testing and letting me know! Yes we'll support CUDA Graphs in a future PR

tlrmchlsmth avatar Mar 06 '25 02:03 tlrmchlsmth

vllm ep看上去还没有实现通信(dispatch+ combine)和计算的overlap?这在deepseek这样的moe模型中显得很重要。这块工作vllm有计划吗?

xiuxin121 avatar Mar 26 '25 06:03 xiuxin121

vllm ep看上去还没有实现通信(dispatch+ combine)和计算的overlap?这在deepseek这样的moe模型中显得很重要。这块工作vllm有计划吗?

Yes, this is in progress!

tlrmchlsmth avatar Mar 27 '25 01:03 tlrmchlsmth

do current version support EP shards in world size?

zsnoob avatar Mar 29 '25 23:03 zsnoob

How can I use DP for Attention module and use EP for Expert module? I have 2 nodes and 16 GPU. I can't find how to run this in document. @youkaichao @tlrmchlsmth

nannaer avatar Apr 23 '25 06:04 nannaer

@ZeldaHuang Nice, thanks for testing and letting me know! Yes we'll support CUDA Graphs in a future PR

Hi @tlrmchlsmth , could you provide any insights on what needs to be addressed to support CUDA Graphs? I've observed that in this location, CUDA Graphs is disabled when DP (Data Parallelism) is enabled.

ZhongYingMatrix avatar Apr 23 '25 08:04 ZhongYingMatrix