verl
verl copied to clipboard
[Hardware] Support AMD (Rocm kernel)
This code base supports AMD GPUs [Bugs Free]
- [Done] Fix (AMD) torch and Ray issue
- [Done] Add the conditions to enable (AMD) torch in the codebase
- [Done] AMD rocm Dockerfile
- [Done] Throughput metric - Tokens/Sec/GPU
[Pass test cases] run_qwen2-7b_seq_balance.sh, run_qwen2-7b_rm_seq_balance.sh
- [Done] Convergence test done
- [Done] Throughput test done
Tutorial:
- [Done] AMD-doc
Review Modification:
- [Done] Unified .gitignore
- [Done] Make support AMD parts to be the general case [AKA.
to.cuda()-->.to(torch.cuda.current_device())] - [Done] Ensure throughput and accuracy still remain the same on NV H100 GPUs with this version codebase.
Special thanks for the collaboration and help from: (SGLang) @zhaochenyang20 (VeRL) @PeterSH6 (AMD) @yushengsu-thu @vickytsang @xiaodoyu (AnyScale) @hongpeng-guo @kevin85421
Nice work!
@yushengsu-thu conflicts need to solve
@PeterSH6 I've merged the conflict/recommendation parts and updated them to the latest up-stream VeRL. Please kindly let me know if there are still issues.
This code base supports AMD GPUs
[Bugs Free] [Done] Fix (AMD) torch and Ray issue [Done] Add the conditions to enable (AMD) torch in the codebase [Done] AMD rocm Dockerfile [Done] Throughput metric - Tokens/Sec/GPU
[Pass test cases] run_qwen2-7b_seq_balance.sh, run_qwen2-7b_rm_seq_balance.sh [Done] Convergence test done [Done] Throughput test done
Tutorial: [Done] AMD-doc
The PR LGTM. The CI machine for AMD is not ready yet. I think we can merge it first by making sure that these modification will not break other hardware? cc: @vermouth1992
LGTM.