Xin Wang

Results 15 comments of Xin Wang

> I guess we are not running the tests from #3293 on this PR. We would need to do that to ensure package has been installed properly. I have enabled...

Could anyone help comment on this? @Jeffwan Thanks

Thanks for the info. https://github.com/aibrix/ai-accelerator-tool seems very basic and we are not interested using that, it would be better when you plan for the diagnose integration interface, Aibrix could allow...

Hardware will still be Nvidia GPU, but we are looking for some alternatives for https://github.com/aibrix/ai-accelerator-tool since its logic is very basic. Yes, general interface will help, and we also try...

Thanks. Could you please provide some contexts about the difference between Kubeflow training operator v2 vs. JobSet? Is JobSet expected to eventually replace Kubeflow training operator in terms of training...

Thanks. @tenzen-y Could you please help comment for above questions?

When GPU failure happens, training job might just stuck and pod does not exit with failure.