DeepSpeed [Feature package] Full feature support with Ascend NPU

Background

Ascend is a full-stack AI computing infrastructure for industry applications and services based on Huawei Ascend processors and software. For more information about Ascend, see Ascend Community.

CANN (Compute Architecture of Neural Networks), developped by Huawei, is a heterogeneous computing architecture for AI.

Pytorch has officially announced support for Ascend NPU (through key PrivateUse1), please see the PrivateUse1 tutorial here.

Previous work

NPU accelerator support has already been merged (see #3595, #3831), which makes it possible to use NPU as a backend accelerator for basic training and inferencing tasks. However, to achieve full support, more features need to be implemented.

Sub tasks

Here is a list of features that need to be implemented or tested.

status	title	assigned to
Done	Unit tests	@RUAN-ZX
Done	FP16	@minchao-sun, @wuhhu
Done	BF16	@minchao-sun, @wuhhu
Done	Gradient Accumulation	@minchao-sun, @wuhhu
Done	Data Parallelism	@minchao-sun, @wuhhu
Done	Pipeline Parallelism	@RUAN-ZX
Done	Zero1	@misstek
Done	Zero2	@misstek
Done	Zero3	@misstek
Done	Activation Checkpointing	@CurryRice233
Done	Fused Adam	@CurryRice233
Done	Mixture of Experts (MoE)	@wangshuai09
Processing	RLHF	@wangshuai09 @CurryRice233
Done	ZeRO Offload	@hipudding
Processing	ZeRO Infinity	@misstek
Done	1-bit Adam	@RUAN-ZX
Done	1-bit LAMB	@RUAN-ZX
Done	0/1 Adam	@minchao-sun
Processing	Curriculum Learning	@minchao-sun
Processing	Layer Dropping	@minchao-sun

Oct 26 '23 02:10 hipudding

I see npu FusedAdam is implemented with torch_npu.npu_apply_adam_w. In the future when implement new features, does NPU intend to support through torch_npu, or may also implement kernel in DeepSpeed as well?

Nov 22 '23 07:11 delock

I see npu FusedAdam is implemented with torch_npu.npu_apply_adam_w. In the future when implement new features, does NPU intend to support through torch_npu, or may also implement kernel in DeepSpeed as well?

The NPU supports two modes. Personally, I prefer the first one, where users can directly invoke the interface, regardless of the implementation

Nov 22 '23 08:11 CurryRice233