AWS EFA workaround
While proper EFA support doesn't land, it would be great to detect and disable IB in those cases to improve the OOTB experience of using TensorPipe.
We're incorporating a workaround on PyTorch: https://github.com/pytorch/pytorch/pull/77363
What does EFA support currently look like? cc @lw
What does EFA support currently look like? cc @lw
I see this issue https://github.com/pytorch/pytorch/issues/65022, looks like because EFA doesn't implement all of the Infiniband features, TensorPipe fails on EFA.
Is there any idea how much work it would be to add EFA support to TensorPipe?
The best case scenario is that the SRQ (shared receive queue), which I mentioned in the other issue, is the only feature gap of EFA, as it should be possible to detect that it's missing and implement a workaround in that case. If so, then the current backends might just work with minimal changes.
However, AFAIK Amazon doesn't officially support using EFA via the libibverbs API hence, even if it might work now, we might get issues in the future, or we might not be able to access all capabilities. The most robust solution would be to implement new backends which most likely will share the exact design of today's IB backends, but which use the libfabric API (which I haven't looked into and haven't learned yet).
Alright, thanks @lw! I may be interested in contributing a backend which works on EFA, will reach out if so.