Qn about GPU Hardware Failure Detection feature
Hi:
In the release notes it is mentioned aibrix supports 'GPU Hardware Failure Detection: Proactive detection of GPU hardware issues'. There is a separate package https://github.com/aibrix/ai-accelerator-tool/tree/main/pkg/diagnose, but that seems a basic nvidia-smi based check and it is not clear how it integrates with aibrix main package. Could anyone help comment on this part? Did I miss anything or this feature is still under planning?
Thanks
Could anyone help comment on this? @Jeffwan Thanks
@emeraldbay Thanks for reaching out. We have not kicked off the integration work yet. We planned to support some request migration flow but do not have bandwidth at this moment. Are you looking for something specific? I would love to hear the requirement and see anything we can help
Thanks for the info. https://github.com/aibrix/ai-accelerator-tool seems very basic and we are not interested using that, it would be better when you plan for the diagnose integration interface, Aibrix could allow for the 3rd party diagnose/signals. BTW, Is the diagnose tool integration planned in v0.4.0 roadmap? I did not see it in https://github.com/vllm-project/aibrix/issues/1098
@emeraldbay it really depends on the requirements from users. if there're enough interest or suggestion from user side, we will prioritize the work. Seems you expect some general interface which can be easily extend to different hardwares? and probably also some user scenarios to fully unlock the diagnose tool's potential?
Hardware will still be Nvidia GPU, but we are looking for some alternatives for https://github.com/aibrix/ai-accelerator-tool since its logic is very basic. Yes, general interface will help, and we also try to understand how you plan to integrate hardware failure signal with autoscaler