ray [Train] Decouple device-related modules and add Huawei NPU support to Ray Train

Why are these changes needed?

We are looking to expand the hardware support range of Ray Train by incorporating Huawei Ascend NPU support.

However, as the number of hardware types increases, scattered and device-specific modifications have been made to the code, which can impact future compatibility and maintainability.

To address this, we have extracted the device-related modules from Ray Train and consolidated them into the accelerator_utils. This allows for greater independence among the device-specific code, resulting in improved maintainability.

Related issue number

Checks

[ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[ ] I've run scripts/format.sh to lint the changes in this PR.
[ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
[ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(

Mar 18 '24 12:03 liuxsh9

This is our plan to enhance the support for third-party devices in Ray Train, which will also contribute to expanding device compatibility in Rllib. Looking forward to receiving your feedback @woshiyyya . Kindly invite developers working with HPU, NPU, and AMD GPUs to stay informed and engaged with these updates. @kira-lin @matthewdeng @vickytsang @nemo9cby @Bye-legumes

Mar 19 '24 06:03 liuxsh9

Thanks for the contribution! We'll review it soon:)

Mar 19 '24 22:03 woshiyyya

Hi @woshiyyya, may I ask if the modifications made above have met your expectations? If you have any other concerns, we would be happy to provide further information.

Apr 11 '24 11:04 liuxsh9

@liuxsh9 Sure! will take another look these days!

Apr 12 '24 23:04 woshiyyya

@woshiyyya can you follow up? im marking it as p2 for now @liuxsh9 as not seeing any users blocked on this; do tell me if wrong though and we can juggle priority here.

May 14 '24 16:05 anyscalesam

Hi @anyscalesam @liuxsh9 , This PR involves major changes to Ray Train equipment management, which needs to be fully tested to ensure stability. We will discuss with Train team for the next steps and give an update soon.

May 14 '24 17:05 woshiyyya

@woshiyyya This looks like a nice accelerator abstraction , and we plan to refactor the Intel GPU backend as well . A modification PR would be following this . Thanks for the design implementation: @liuxsh9

May 22 '24 04:05 abhilash1910

@liuxsh9 thanks for the update! I left some more comments.

Also, @harborn @kira-lin can you help review the HPU part of this PR? Thank you!

Jun 30 '24 08:06 woshiyyya

@woshiyyya Thanks for the review! We've addressed the feedback and made changes. Please take another look.

Jul 16 '24 02:07 liuxsh9

Thanks for the review! We have made the following major adjustments:

Carefully controlled the scope of DeviceManager, eliminating unnecessary abstractions and parameter passing.
Reviewed and cleaned up unnecessary functions.
Added unit tests.

Please take a look. @justinvyu

Jul 30 '24 01:07 liuxsh9

Starting a release test sanity check here: https://buildkite.com/ray-project/release/builds/21289

Aug 22 '24 16:08 justinvyu

Hi @woshiyyya , I think I've addressed all the concerns you raised in previous review. However, the PR is blocked on your approval. Could you please take another look and let me know if everything looks good now?

Sep 03 '24 01:09 liuxsh9