yanminjia

Results 14 comments of yanminjia

Adaptive routing is not enabled on our RoCE network. In fact, adaptive routing is not always available on RoCE network. Additionally, based on my understanding, adaptive routing is used for...

I think congestion would happen if traffic of multiple rdma_ranks goes into the same target rdma_rank at the same time unless some kind of traffic scheduling mechanism was implemented in...

> We tested an early version of DeepEP (without atomic operations) on a RoCE network. When using the default ECMP routing, we also observed huge performance degradation. However, after switching...

> Hello, > > We also test `test_internode.py` over RoCE network with 4 H800 servers, each equipped with 8 GPUs and 8 dual-port CX7 RNICs. We got similar throughput, about...

> > We tested an early version of DeepEP (without atomic operations) on a RoCE network. When using the default ECMP routing, we also observed huge performance degradation. However, after...

> > Have you observed any congestion in your NIC counters? Yes, indeed. If multiple senders transfer data to a single receiver at the same time, on the leaf switch,...

Please note RoCE is pretty different from IB with respect to flow control. I'm quite surprised if you guys didn't see any traffic congestion when run DeepEP over RoCE. Basically,...

Hi @sphish , by the way, do you have the test result on RoCE network? I'm a little bit confused why PFC is configured if no congestion caused by DeepEP...

Many thanks. Added you to my weChat contact list. :)

> > You can shared Chinese version in issues (new issue is also OK) as well (or a forked repo link or blog link) 👍🏻 > > OK, I will...