One question about "GPU Hardware Failure Detection"

Open gegenhua913 opened this issue 7 months ago • 2 comments

Inside one ray-cluster, if one GPU(ray work node) have hardware issue, how to recover this ray-cluster? -Just replace this worker GPU? -Replace this ray-cluster? Set this cluster Failure.

Could you provide any details about "GPU Hardware Failure Detection: Proactive detection of GPU hardware issues."

May 22 '25 08:05 gegenhua913

@gegenhua913 thanks for your interest. Actually GPU hardware failure detection right now is a separate project, we have not fully integrate with upper layer applications.

Technically, GPU failure should fail the vLLM instances which results in some failure node. RayCluster controller get changes to reconcile this cluster and replace with new. But honestly, I have not experience this issue yet and have not fully tested this workflow.

Do you have any concerns on the failure recovery side?

May 23 '25 21:05 Jeffwan

Thanks for replying.

May 24 '25 07:05 gegenhua913