Xin Wang
Xin Wang
> I guess we are not running the tests from #3293 on this PR. We would need to do that to ensure package has been installed properly. I have enabled...
Hi, what is latest for this PR?
Could anyone help comment on this? @Jeffwan Thanks
Thanks for the info. https://github.com/aibrix/ai-accelerator-tool seems very basic and we are not interested using that, it would be better when you plan for the diagnose integration interface, Aibrix could allow...
Hardware will still be Nvidia GPU, but we are looking for some alternatives for https://github.com/aibrix/ai-accelerator-tool since its logic is very basic. Yes, general interface will help, and we also try...
Thanks. Could you please provide some contexts about the difference between Kubeflow training operator v2 vs. JobSet? Is JobSet expected to eventually replace Kubeflow training operator in terms of training...
Thanks. @tenzen-y Could you please help comment for above questions?
Any update on this?
When GPU failure happens, training job might just stuck and pod does not exit with failure.