Xin Wang

Results 15 comments of Xin Wang

One example is NCCL communication stuck due to GPU failure, eventually it will timeout and kueblet will react on that, but that might be a long wait depends on the...

@andreyvelich In short, we want fast recovery before NCCL time out, instead of using NCCL time out to trigger our recovery/error handler. This is mainly because we saw other signals...

> What kind of signals do you monitor to detect such failures ? Do you track GPU utilization via Nvidia DCGM exporter or something else ? We have our own...

Any further input for above issue?

Sure, besides https://github.com/kubeflow/training-operator/blob/1f336d01af2c1e305bd6e660e079ffea107a51a9/docs/proposals/2170-kubeflow-training-v2/README.md#user-roles-diagram, is there any other google docs I could read to catch up the latest Kubeflow Training V2 discussion?