verl icon indicating copy to clipboard operation
verl copied to clipboard

[Hardware] Add support for Huawei Ascend NPU

Open Chendong98 opened this issue 9 months ago • 10 comments

  1. Single Controller:

    • Change placement group resources from GPU to NPU
    • Made modifications to integrate Huawei’s HCCL
  2. Megatron:

    • Adapte Megatron to Huawei Ascend NPU using MindSpeed,and upgrade Megatron to version 0.6.0 to comply with MindSpeed’s requirements.
    • Adapte Megatron-core 0.6.0’s ParamAndGradBuffer when synchronizing the weights between Megatron-LM and vLLM
    • Replace operators in ParallelLlamaModel, including RMSNORM, flash attention, ROPE, and pad/unpad.
  3. vLLM:

    • Use this PR for vLLM Ascend support.
    • Add the SPMD version of vLLM 0.6.4post1

Chendong98 avatar Feb 04 '25 17:02 Chendong98

Just wonder does FSDP backend work with NPU?

vermouth1992 avatar Feb 05 '25 02:02 vermouth1992

Just wonder does FSDP backend work with NPU?

The FSDP backend can work with NPU, but there are two issues to be addressed:

  1. torch.logsumexp does not support bf16 on NPU;
  2. FlashAttention-2 should be disabled in Transformers.

Chendong98 avatar Feb 06 '25 07:02 Chendong98

got ModuleNotFoundError: No module named 'flash_attn' error

huangk10 avatar Feb 10 '25 04:02 huangk10

The code throws this error after commenting out the flash atten import. “”“ work = group.broadcast([tensor], opts) RuntimeError: create:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:91 HCCL function error: HcclCommInitRootInfo(numRanks, &rootInfo, rank, &(comm->hcclComm_)), error code is 2 [ERROR] 2025-02-10-19:56:18 (PID:1057704, Device:0, RankID:1) ERR02200 DIST call hccl api failed. ”“”

huangk10 avatar Feb 10 '25 11:02 huangk10

The code throws this error after commenting out the flash atten import. “”“ work = group.broadcast([tensor], opts) RuntimeError: create:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:91 HCCL function error: HcclCommInitRootInfo(numRanks, &rootInfo, rank, &(comm->hcclComm_)), error code is 2 [ERROR] 2025-02-10-19:56:18 (PID:1057704, Device:0, RankID:1) ERR02200 DIST call hccl api failed. ”“”

Thank you for your feedback! We've addressed the issue you mentioned in the latest commit.

Chendong98 avatar Feb 10 '25 15:02 Chendong98

work = group.broadcast([tensor], opts) RuntimeError: create:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:91 HCCL function error: HcclCommInitRootInfo(numRanks, &rootInfo, rank, &(comm->hcclComm_)), error code is 2 [ERROR] 2025-02-10-19:56:18 (PID:1057704, Device:0, RankID:1) ERR02200 DIST call hccl api failed.

The code throws this error after commenting out the flash atten import. “”“ work = group.broadcast([tensor], opts) RuntimeError: create:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:91 HCCL function error: HcclCommInitRootInfo(numRanks, &rootInfo, rank, &(comm->hcclComm_)), error code is 2 [ERROR] 2025-02-10-19:56:18 (PID:1057704, Device:0, RankID:1) ERR02200 DIST call hccl api failed. ”“”

Thank you for your feedback! We've addressed the issue you mentioned in the latest commit.

it works

huangk10 avatar Feb 11 '25 00:02 huangk10

Dude, Do you work in Huawei? Hope to contact with you~

Viper403 avatar Feb 11 '25 06:02 Viper403

Dude, Do you work in Huawei? Hope to contact with you~

Yes, I work at Huawei, and my email is [email protected]

Chendong98 avatar Feb 11 '25 16:02 Chendong98

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Feb 26 '25 00:02 CLAassistant

This helps a lot. When will this feature be merged?

dawnranger avatar Feb 27 '25 02:02 dawnranger