Xin Wang comments

Results 15 comments of


                                            Xin Wang

Training job restart enhancement

One example is NCCL communication stuck due to GPU failure, eventually it will timeout and kueblet will react on that, but that might be a long wait depends on the...

Training job restart enhancement

@andreyvelich In short, we want fast recovery before NCCL time out, instead of using NCCL time out to trigger our recovery/error handler. This is mainly because we saw other signals...

Training job restart enhancement

> What kind of signals do you monitor to detect such failures ? Do you track GPU utilization via Nvidia DCGM exporter or something else ? We have our own...

Training job restart enhancement

Any further input for above issue?

Training job restart enhancement

Sure, besides https://github.com/kubeflow/training-operator/blob/1f336d01af2c1e305bd6e660e079ffea107a51a9/docs/proposals/2170-kubeflow-training-v2/README.md#user-roles-diagram, is there any other google docs I could read to catch up the latest Kubeflow Training V2 discussion?