verl [Hardware] Support AMD (Rocm kernel)

[Hardware] Support AMD (Rocm kernel)

Open yushengsu-thu opened this issue 9 months ago • 6 comments

This code base supports AMD GPUs [Bugs Free]

[Done] Fix (AMD) torch and Ray issue
[Done] Add the conditions to enable (AMD) torch in the codebase
[Done] AMD rocm Dockerfile
[Done] Throughput metric - Tokens/Sec/GPU

[Pass test cases] run_qwen2-7b_seq_balance.sh, run_qwen2-7b_rm_seq_balance.sh

[Done] Convergence test done
[Done] Throughput test done

Tutorial:

[Done] AMD-doc

Review Modification:

[Done] Unified .gitignore
[Done] Make support AMD parts to be the general case [AKA. to.cuda() --> .to(torch.cuda.current_device())]
[Done] Ensure throughput and accuracy still remain the same on NV H100 GPUs with this version codebase.

Special thanks for the collaboration and help from: (SGLang) @zhaochenyang20 (VeRL) @PeterSH6 (AMD) @yushengsu-thu @vickytsang @xiaodoyu (AnyScale) @hongpeng-guo @kevin85421

Feb 24 '25 06:02 yushengsu-thu

Nice work!

Feb 24 '25 23:02 zhaochenyang20

All committers have signed the CLA.

Feb 26 '25 00:02 CLAassistant

@yushengsu-thu conflicts need to solve

Feb 26 '25 02:02 zhaochenyang20

@PeterSH6 I've merged the conflict/recommendation parts and updated them to the latest up-stream VeRL. Please kindly let me know if there are still issues.

This code base supports AMD GPUs

[Bugs Free] [Done] Fix (AMD) torch and Ray issue [Done] Add the conditions to enable (AMD) torch in the codebase [Done] AMD rocm Dockerfile [Done] Throughput metric - Tokens/Sec/GPU

[Pass test cases] run_qwen2-7b_seq_balance.sh, run_qwen2-7b_rm_seq_balance.sh [Done] Convergence test done [Done] Throughput test done

Tutorial: [Done] AMD-doc

Mar 02 '25 08:03 yushengsu-thu

The PR LGTM. The CI machine for AMD is not ready yet. I think we can merge it first by making sure that these modification will not break other hardware? cc: @vermouth1992

Mar 05 '25 08:03 PeterSH6

LGTM.

Mar 05 '25 08:03 vermouth1992

verl verl copied to clipboard

[Hardware] Support AMD (Rocm kernel)

verl
verl copied to clipboard