verl icon indicating copy to clipboard operation
verl copied to clipboard

[Hardware] Support AMD (Rocm kernel)

Open yushengsu-thu opened this issue 9 months ago • 6 comments

This code base supports AMD GPUs [Bugs Free]

  • [Done] Fix (AMD) torch and Ray issue
  • [Done] Add the conditions to enable (AMD) torch in the codebase
  • [Done] AMD rocm Dockerfile
  • [Done] Throughput metric - Tokens/Sec/GPU

[Pass test cases] run_qwen2-7b_seq_balance.sh, run_qwen2-7b_rm_seq_balance.sh

  • [Done] Convergence test done
  • [Done] Throughput test done

Tutorial:

Review Modification:

  • [Done] Unified .gitignore
  • [Done] Make support AMD parts to be the general case [AKA. to.cuda() --> .to(torch.cuda.current_device())]
  • [Done] Ensure throughput and accuracy still remain the same on NV H100 GPUs with this version codebase.

Special thanks for the collaboration and help from: (SGLang) @zhaochenyang20 (VeRL) @PeterSH6 (AMD) @yushengsu-thu @vickytsang @xiaodoyu (AnyScale) @hongpeng-guo @kevin85421

yushengsu-thu avatar Feb 24 '25 06:02 yushengsu-thu

Nice work!

zhaochenyang20 avatar Feb 24 '25 23:02 zhaochenyang20

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Feb 26 '25 00:02 CLAassistant

@yushengsu-thu conflicts need to solve

zhaochenyang20 avatar Feb 26 '25 02:02 zhaochenyang20

@PeterSH6 I've merged the conflict/recommendation parts and updated them to the latest up-stream VeRL. Please kindly let me know if there are still issues.

This code base supports AMD GPUs

[Bugs Free] [Done] Fix (AMD) torch and Ray issue [Done] Add the conditions to enable (AMD) torch in the codebase [Done] AMD rocm Dockerfile [Done] Throughput metric - Tokens/Sec/GPU

[Pass test cases] run_qwen2-7b_seq_balance.sh, run_qwen2-7b_rm_seq_balance.sh [Done] Convergence test done [Done] Throughput test done

Tutorial: [Done] AMD-doc

yushengsu-thu avatar Mar 02 '25 08:03 yushengsu-thu

The PR LGTM. The CI machine for AMD is not ready yet. I think we can merge it first by making sure that these modification will not break other hardware? cc: @vermouth1992

PeterSH6 avatar Mar 05 '25 08:03 PeterSH6

LGTM.

vermouth1992 avatar Mar 05 '25 08:03 vermouth1992