yu lin

Results 18 comments of yu lin

> Support for fault tolerance and elasticity This is a quasi-fault tolerance since NCCL communication must always be recreated when an error occurs. However, it's still worth implementing because recreating...

Thanks @johnugeorge , I will discuss arena roadmap with contributors.

Thank you for your PR, but it's a breaking change that will prevent installations of the older version of arena from being upgraded to the latest version.

Considering that there are currently multiple versions of the mpi-operator (mpijob v1 in training-operator and mpijob v2 in mpi-operator), further observation is required for some time.

@samzong We have this plan, and we are also considering supporting Ray.

> Dependencies and Go version also need to be upgraded. Agree.

> is this issue https://github.com/kubeflow/arena/issues/51 plan to supported this year? we can give some contribute, if you need. @chacha923 Thanks! If you have enough time to work on this.