yu lin
yu lin
> Support for fault tolerance and elasticity This is a quasi-fault tolerance since NCCL communication must always be recreated when an error occurs. However, it's still worth implementing because recreating...
/lgtm /approve
Thanks @johnugeorge , I will discuss arena roadmap with contributors.
Thank you for your PR, but it's a breaking change that will prevent installations of the older version of arena from being upgraded to the latest version.
Considering that there are currently multiple versions of the mpi-operator (mpijob v1 in training-operator and mpijob v2 in mpi-operator), further observation is required for some time.
@samzong We have this plan, and we are also considering supporting Ray.
cc @cheyang @xieydd
> Dependencies and Go version also need to be upgraded. Agree.
> is this issue https://github.com/kubeflow/arena/issues/51 plan to supported this year? we can give some contribute, if you need. @chacha923 Thanks! If you have enough time to work on this.