[Train] Decouple device-related modules and add Huawei NPU support to Ray Train
Why are these changes needed?
We are looking to expand the hardware support range of Ray Train by incorporating Huawei Ascend NPU support.
However, as the number of hardware types increases, scattered and device-specific modifications have been made to the code, which can impact future compatibility and maintainability.
To address this, we have extracted the device-related modules from Ray Train and consolidated them into the accelerator_utils. This allows for greater independence among the device-specific code, resulting in improved maintainability.
Related issue number
Checks
- [ ] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [ ] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
This is our plan to enhance the support for third-party devices in Ray Train, which will also contribute to expanding device compatibility in Rllib. Looking forward to receiving your feedback @woshiyyya . Kindly invite developers working with HPU, NPU, and AMD GPUs to stay informed and engaged with these updates. @kira-lin @matthewdeng @vickytsang @nemo9cby @Bye-legumes
Thanks for the contribution! We'll review it soon:)
Hi @woshiyyya, may I ask if the modifications made above have met your expectations? If you have any other concerns, we would be happy to provide further information.
@liuxsh9 Sure! will take another look these days!
@woshiyyya can you follow up? im marking it as p2 for now @liuxsh9 as not seeing any users blocked on this; do tell me if wrong though and we can juggle priority here.
Hi @anyscalesam @liuxsh9 , This PR involves major changes to Ray Train equipment management, which needs to be fully tested to ensure stability. We will discuss with Train team for the next steps and give an update soon.
@woshiyyya This looks like a nice accelerator abstraction , and we plan to refactor the Intel GPU backend as well . A modification PR would be following this . Thanks for the design implementation: @liuxsh9
@liuxsh9 thanks for the update! I left some more comments.
Also, @harborn @kira-lin can you help review the HPU part of this PR? Thank you!
@woshiyyya Thanks for the review! We've addressed the feedback and made changes. Please take another look.
Thanks for the review! We have made the following major adjustments:
- Carefully controlled the scope of
DeviceManager, eliminating unnecessary abstractions and parameter passing. - Reviewed and cleaned up unnecessary functions.
- Added unit tests.
Please take a look. @justinvyu
Starting a release test sanity check here: https://buildkite.com/ray-project/release/builds/21289
Hi @woshiyyya , I think I've addressed all the concerns you raised in previous review. However, the PR is blocked on your approval. Could you please take another look and let me know if everything looks good now?