tensorpipe icon indicating copy to clipboard operation
tensorpipe copied to clipboard

AWS EFA workaround

Open kumpera opened this issue 3 years ago • 4 comments

While proper EFA support doesn't land, it would be great to detect and disable IB in those cases to improve the OOTB experience of using TensorPipe.

We're incorporating a workaround on PyTorch: https://github.com/pytorch/pytorch/pull/77363

kumpera avatar May 12 '22 19:05 kumpera

What does EFA support currently look like? cc @lw

cadedaniel avatar May 25 '22 01:05 cadedaniel

What does EFA support currently look like? cc @lw

I see this issue https://github.com/pytorch/pytorch/issues/65022, looks like because EFA doesn't implement all of the Infiniband features, TensorPipe fails on EFA.

Is there any idea how much work it would be to add EFA support to TensorPipe?

cadedaniel avatar May 25 '22 21:05 cadedaniel

The best case scenario is that the SRQ (shared receive queue), which I mentioned in the other issue, is the only feature gap of EFA, as it should be possible to detect that it's missing and implement a workaround in that case. If so, then the current backends might just work with minimal changes.

However, AFAIK Amazon doesn't officially support using EFA via the libibverbs API hence, even if it might work now, we might get issues in the future, or we might not be able to access all capabilities. The most robust solution would be to implement new backends which most likely will share the exact design of today's IB backends, but which use the libfabric API (which I haven't looked into and haven't learned yet).

lw avatar May 30 '22 09:05 lw

Alright, thanks @lw! I may be interested in contributing a backend which works on EFA, will reach out if so.

cadedaniel avatar May 31 '22 21:05 cadedaniel