Xin Wang
Xin Wang
One example is NCCL communication stuck due to GPU failure, eventually it will timeout and kueblet will react on that, but that might be a long wait depends on the...
@andreyvelich In short, we want fast recovery before NCCL time out, instead of using NCCL time out to trigger our recovery/error handler. This is mainly because we saw other signals...
> What kind of signals do you monitor to detect such failures ? Do you track GPU utilization via Nvidia DCGM exporter or something else ? We have our own...
Any further input for above issue?
Sure, besides https://github.com/kubeflow/training-operator/blob/1f336d01af2c1e305bd6e660e079ffea107a51a9/docs/proposals/2170-kubeflow-training-v2/README.md#user-roles-diagram, is there any other google docs I could read to catch up the latest Kubeflow Training V2 discussion?