[Feature package] Full feature support with Ascend NPU
Background
Ascend is a full-stack AI computing infrastructure for industry applications and services based on Huawei Ascend processors and software. For more information about Ascend, see Ascend Community.
CANN (Compute Architecture of Neural Networks), developped by Huawei, is a heterogeneous computing architecture for AI.
Pytorch has officially announced support for Ascend NPU (through key PrivateUse1), please see the PrivateUse1 tutorial here.
Previous work
NPU accelerator support has already been merged (see #3595, #3831), which makes it possible to use NPU as a backend accelerator for basic training and inferencing tasks. However, to achieve full support, more features need to be implemented.
Sub tasks
Here is a list of features that need to be implemented or tested.
| status | title | assigned to |
|---|---|---|
| Done | Unit tests | @RUAN-ZX |
| Done | FP16 | @minchao-sun, @wuhhu |
| Done | BF16 | @minchao-sun, @wuhhu |
| Done | Gradient Accumulation | @minchao-sun, @wuhhu |
| Done | Data Parallelism | @minchao-sun, @wuhhu |
| Done | Pipeline Parallelism | @RUAN-ZX |
| Done | Zero1 | @misstek |
| Done | Zero2 | @misstek |
| Done | Zero3 | @misstek |
| Done | Activation Checkpointing | @CurryRice233 |
| Done | Fused Adam | @CurryRice233 |
| Done | Mixture of Experts (MoE) | @wangshuai09 |
| Processing | RLHF | @wangshuai09 @CurryRice233 |
| Done | ZeRO Offload | @hipudding |
| Processing | ZeRO Infinity | @misstek |
| Done | 1-bit Adam | @RUAN-ZX |
| Done | 1-bit LAMB | @RUAN-ZX |
| Done | 0/1 Adam | @minchao-sun |
| Processing | Curriculum Learning | @minchao-sun |
| Processing | Layer Dropping | @minchao-sun |
I see npu FusedAdam is implemented with torch_npu.npu_apply_adam_w. In the future when implement new features, does NPU intend to support through torch_npu, or may also implement kernel in DeepSpeed as well?
I see npu FusedAdam is implemented with
torch_npu.npu_apply_adam_w. In the future when implement new features, does NPU intend to support through torch_npu, or may also implement kernel in DeepSpeed as well?
The NPU supports two modes. Personally, I prefer the first one, where users can directly invoke the interface, regardless of the implementation