aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

One question about "GPU Hardware Failure Detection"

Open gegenhua913 opened this issue 7 months ago • 2 comments

Image

Inside one ray-cluster, if one GPU(ray work node) have hardware issue, how to recover this ray-cluster? -Just replace this worker GPU? -Replace this ray-cluster? Set this cluster Failure.

Could you provide any details about "GPU Hardware Failure Detection: Proactive detection of GPU hardware issues."

gegenhua913 avatar May 22 '25 08:05 gegenhua913

@gegenhua913 thanks for your interest. Actually GPU hardware failure detection right now is a separate project, we have not fully integrate with upper layer applications.

Technically, GPU failure should fail the vLLM instances which results in some failure node. RayCluster controller get changes to reconcile this cluster and replace with new. But honestly, I have not experience this issue yet and have not fully tested this workflow.

Do you have any concerns on the failure recovery side?

Jeffwan avatar May 23 '25 21:05 Jeffwan

Image Thanks for replying.

gegenhua913 avatar May 24 '25 07:05 gegenhua913