aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

Qn about GPU Hardware Failure Detection feature

Open emeraldbay opened this issue 7 months ago • 5 comments

Hi:

In the release notes it is mentioned aibrix supports 'GPU Hardware Failure Detection: Proactive detection of GPU hardware issues'. There is a separate package https://github.com/aibrix/ai-accelerator-tool/tree/main/pkg/diagnose, but that seems a basic nvidia-smi based check and it is not clear how it integrates with aibrix main package. Could anyone help comment on this part? Did I miss anything or this feature is still under planning?

Thanks

emeraldbay avatar May 18 '25 06:05 emeraldbay

Could anyone help comment on this? @Jeffwan Thanks

emeraldbay avatar May 19 '25 00:05 emeraldbay

@emeraldbay Thanks for reaching out. We have not kicked off the integration work yet. We planned to support some request migration flow but do not have bandwidth at this moment. Are you looking for something specific? I would love to hear the requirement and see anything we can help

Jeffwan avatar May 20 '25 23:05 Jeffwan

Thanks for the info. https://github.com/aibrix/ai-accelerator-tool seems very basic and we are not interested using that, it would be better when you plan for the diagnose integration interface, Aibrix could allow for the 3rd party diagnose/signals. BTW, Is the diagnose tool integration planned in v0.4.0 roadmap? I did not see it in https://github.com/vllm-project/aibrix/issues/1098

emeraldbay avatar May 21 '25 15:05 emeraldbay

@emeraldbay it really depends on the requirements from users. if there're enough interest or suggestion from user side, we will prioritize the work. Seems you expect some general interface which can be easily extend to different hardwares? and probably also some user scenarios to fully unlock the diagnose tool's potential?

Jeffwan avatar May 21 '25 16:05 Jeffwan

Hardware will still be Nvidia GPU, but we are looking for some alternatives for https://github.com/aibrix/ai-accelerator-tool since its logic is very basic. Yes, general interface will help, and we also try to understand how you plan to integrate hardware failure signal with autoscaler

emeraldbay avatar May 21 '25 16:05 emeraldbay