verl
verl copied to clipboard
[Hardware] Add support for Huawei Ascend NPU
-
Single Controller:
- Change placement group resources from GPU to NPU
- Made modifications to integrate Huawei’s HCCL
-
Megatron:
- Adapte Megatron to Huawei Ascend NPU using MindSpeed,and upgrade Megatron to version 0.6.0 to comply with MindSpeed’s requirements.
- Adapte Megatron-core 0.6.0’s
ParamAndGradBufferwhen synchronizing the weights between Megatron-LM and vLLM - Replace operators in
ParallelLlamaModel, includingRMSNORM,flash attention,ROPE, andpad/unpad.
-
vLLM:
- Use this PR for vLLM Ascend support.
- Add the SPMD version of vLLM 0.6.4post1
Just wonder does FSDP backend work with NPU?
Just wonder does FSDP backend work with NPU?
The FSDP backend can work with NPU, but there are two issues to be addressed:
torch.logsumexpdoes not supportbf16on NPU;FlashAttention-2should be disabled inTransformers.
got ModuleNotFoundError: No module named 'flash_attn' error
The code throws this error after commenting out the flash atten import. “”“ work = group.broadcast([tensor], opts) RuntimeError: create:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:91 HCCL function error: HcclCommInitRootInfo(numRanks, &rootInfo, rank, &(comm->hcclComm_)), error code is 2 [ERROR] 2025-02-10-19:56:18 (PID:1057704, Device:0, RankID:1) ERR02200 DIST call hccl api failed. ”“”
The code throws this error after commenting out the flash atten import. “”“ work = group.broadcast([tensor], opts) RuntimeError: create:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:91 HCCL function error: HcclCommInitRootInfo(numRanks, &rootInfo, rank, &(comm->hcclComm_)), error code is 2 [ERROR] 2025-02-10-19:56:18 (PID:1057704, Device:0, RankID:1) ERR02200 DIST call hccl api failed. ”“”
Thank you for your feedback! We've addressed the issue you mentioned in the latest commit.
work = group.broadcast([tensor], opts) RuntimeError: create:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:91 HCCL function error: HcclCommInitRootInfo(numRanks, &rootInfo, rank, &(comm->hcclComm_)), error code is 2 [ERROR] 2025-02-10-19:56:18 (PID:1057704, Device:0, RankID:1) ERR02200 DIST call hccl api failed.
The code throws this error after commenting out the flash atten import. “”“ work = group.broadcast([tensor], opts) RuntimeError: create:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:91 HCCL function error: HcclCommInitRootInfo(numRanks, &rootInfo, rank, &(comm->hcclComm_)), error code is 2 [ERROR] 2025-02-10-19:56:18 (PID:1057704, Device:0, RankID:1) ERR02200 DIST call hccl api failed. ”“”
Thank you for your feedback! We've addressed the issue you mentioned in the latest commit.
it works
Dude, Do you work in Huawei? Hope to contact with you~
Dude, Do you work in Huawei? Hope to contact with you~
Yes, I work at Huawei, and my email is [email protected]
This helps a lot. When will this feature be merged?