One question about "GPU Hardware Failure Detection"
Inside one ray-cluster, if one GPU(ray work node) have hardware issue, how to recover this ray-cluster? -Just replace this worker GPU? -Replace this ray-cluster? Set this cluster Failure.
Could you provide any details about "GPU Hardware Failure Detection: Proactive detection of GPU hardware issues."
@gegenhua913 thanks for your interest. Actually GPU hardware failure detection right now is a separate project, we have not fully integrate with upper layer applications.
Technically, GPU failure should fail the vLLM instances which results in some failure node. RayCluster controller get changes to reconcile this cluster and replace with new. But honestly, I have not experience this issue yet and have not fully tested this workflow.
Do you have any concerns on the failure recovery side?
Thanks for replying.