DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

Ada Lovelace support

Open xutizhou opened this issue 10 months ago • 8 comments

Hi team, Thank you for your excellent work, I wonder if this repo could support Ada Lovelace architecture such as L20 GPU.

Thanks

xutizhou avatar Feb 25 '25 03:02 xutizhou

First of all, I'm not a member of the team.

In my understanding, as long as you have cluster environments with RDMA (usually IB NICs and the corresponding software environment ), NVLink between GPUs, and those environments meet the NVSHMEM requirements, it may be usable.

NorthSecond avatar Feb 25 '25 04:02 NorthSecond

Could you please confirm whether the Ada Lovelace architecture GPUs support GPU Direct RDMA (GDR) and GPU Direct Async (IBGDA)? If so, DeepEP should also be able to run on this architecture.

haswelliris avatar Feb 25 '25 09:02 haswelliris

First of all, I'm not a member of the team.

In my understanding, as long as you have cluster environments with RDMA (usually IB NICs and the corresponding software environment ), NVLink between GPUs, and those environments meet the NVSHMEM requirements, it may be usable.

Can not work. NVSHMEM does not rely on NVLink. I've tried it on one node with 8 L20 cards. It just won't run successfully. After running for a while, it will report an error. It seems that a certain kernel execution has gone wrong. Can Lyric Zhao give me some hints?

Image

wangzhen2271 avatar Mar 11 '25 10:03 wangzhen2271

First of all, I'm not a member of the team.

In my understanding, as long as you have cluster environments with RDMA (usually IB NICs and the corresponding software environment ), NVLink between GPUs, and those environments meet the NVSHMEM requirements, it may be usable.

Can not work. NVSHMEM does not rely on NVLink. I've tried it on one node with 8 L20 cards. It just won't run successfully. After running for a while, it will report an error. It seems that a certain kernel execution has gone wrong. Can Lyric Zhao give me some hints?

Image

Hi,I wonder have you successfully deployed deepep on L20?

Xiaofei-fei avatar Jun 10 '25 06:06 Xiaofei-fei

First of all, I'm not a member of the team.

In my understanding, as long as you have cluster environments with RDMA (usually IB NICs and the corresponding software environment ), NVLink between GPUs, and those environments meet the NVSHMEM requirements, it may be usable.

Can not work. NVSHMEM does not rely on NVLink. I've tried it on one node with 8 L20 cards. It just won't run successfully. After running for a while, it will report an error. It seems that a certain kernel execution has gone wrong. Can Lyric Zhao give me some hints? Image

Hi,I wonder have you successfully deployed deepep on L20? Hi, @Xiaofei-fei I met the same issue as u, have you deployed it successfully?

MengYu10151 avatar Aug 19 '25 04:08 MengYu10151

First of all, I'm not a member of the team.

In my understanding, as long as you have cluster environments with RDMA (usually IB NICs and the corresponding software environment ), NVLink between GPUs, and those environments meet the NVSHMEM requirements, it may be usable.

Can not work. NVSHMEM does not rely on NVLink. I've tried it on one node with 8 L20 cards. It just won't run successfully. After running for a while, it will report an error. It seems that a certain kernel execution has gone wrong. Can Lyric Zhao give me some hints? Image

Hi,I wonder have you successfully deployed deepep on L20? Hi, @Xiaofei-fei I met the same issue as u, have you deployed it successfully?

We have resolved most of the issues in intranode mode and can now run together with sglang, but some problems are still being worked on.

Xiaofei-fei avatar Aug 19 '25 06:08 Xiaofei-fei

First of all, I'm not a member of the team.

In my understanding, as long as you have cluster environments with RDMA (usually IB NICs and the corresponding software environment ), NVLink between GPUs, and those environments meet the NVSHMEM requirements, it may be usable.

Can not work. NVSHMEM does not rely on NVLink. I've tried it on one node with 8 L20 cards. It just won't run successfully. After running for a while, it will report an error. It seems that a certain kernel execution has gone wrong. Can Lyric Zhao give me some hints? Image

Hi,I wonder have you successfully deployed deepep on L20? Hi, @Xiaofei-fei I met the same issue as u, have you deployed it successfully?

btw,I noticed your technical talk on deploying DeepEP on PCIe GPUs, and I am very interested in the idea of merging low-latency and normal-related kernels. Could you please provide a contact so that I can discuss the technical details further?

Xiaofei-fei avatar Sep 08 '25 09:09 Xiaofei-fei

First of all, I'm not a member of the team.

In my understanding, as long as you have cluster environments with RDMA (usually IB NICs and the corresponding software environment ), NVLink between GPUs, and those environments meet the NVSHMEM requirements, it may be usable.

Can not work. NVSHMEM does not rely on NVLink. I've tried it on one node with 8 L20 cards. It just won't run successfully. After running for a while, it will report an error. It seems that a certain kernel execution has gone wrong. Can Lyric Zhao give me some hints? Image

Hi,I wonder have you successfully deployed deepep on L20? Hi, @Xiaofei-fei I met the same issue as u, have you deployed it successfully?

btw,I noticed your technical talk on deploying DeepEP on PCIe GPUs, and I am very interested in the idea of merging low-latency and normal-related kernels. Could you please provide a contact so that I can discuss the technical details further?

Really appreciate for your attention to our work!Actually we‘ve already submit a PR to support normal mode w/o NVL https://github.com/deepseek-ai/DeepEP/pull/375 ,and you can contact me via wechat misty10151,thx:)

MengYu10151 avatar Sep 08 '25 10:09 MengYu10151