[runtime env]: Integrating ROCm Systems Profiler to Ray worker process
Why are these changes needed?
Reference https://github.com/ray-project/ray/pull/39998
ROCm Systems Profiler is an AMD profiling tool for AMD based GPUs and CPUs.
This PR adds ROCm Systems Profiler integration with Ray using runtime_env. Similar to nsight, Currently rocprof-sys-python can't profile the GPU usage from Ray tasks/actors since the processes that can be traced by rocprof-sys-python must be driver processes and it's subprocesses, whereas Ray tasks/actors are run by worker process. Thus, we added rocprof-sys-python native to runtime_env in order to modify the worker process to run with rocprof-sys-python which can produce the report for each worker processes once it exits.
Unlike nsight, since the options of ROCm Systems Profiler can be controlled by both environment variables and CLI arguments, and they are not overlapping, we introduced "env" and "args" to the flags to make this possible.
Refactoring the code so that we share the same code between nsight and ROCm Systems Profiler is also possible, but since this will introduce breaking changes to the flags design, this will need approval from the maintainers to proceed.
Related issue number
Checks
- [X] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [X] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
- If you'd like to keep this open, just leave any comment, and the stale label will be removed.
This is a good addition to get ray working on AMD GPUs, so I would hope to get this reviewed and merged soon.
@vickytsang could you help review this one?
@vickytsang could you help review this one?
Yes I will take a look
LGTM. However, in rocm version>=6.3 rocm/omnitrace has been rebranded/replaced by rocm/rocprofiler-systems (see documentation). I strongly recommend updating this to work with rocprofiler-systems for better support.
Thanks for reviewing @vickytsang ! Unfortunately, my environment still sticks to ROCm 6.2.4, and I would need help updating this to work with rocprofiler-systems.
Thanks for reviewing @vickytsang ! Unfortunately, my environment still sticks to ROCm 6.2.4, and I would need help updating this to work with rocprofiler-systems.
rocprofiler-systems works with rocm6.2. It would be great if you can make the update in this PR so the wider ROCm community can make use of this change. I can assist with any issues you might have.
Oh okay, sure, I didn't know that, then I'll check it out when I have time!
Briefly checked the document, and looks like a verbatim replace can work here. I will test this more comprehensively later, but feel free to let me know if you find any issues meanwhile.
LGTM. @HollowMan6 thank you for contributing to the ROCm ecosystem
This pull request has been automatically marked as stale because it has not had any activity for 14 days. It will be closed in another 14 days if no further activity occurs. Thank you for your contributions.
You can always ask for help on our discussion forum or Ray's public slack channel.
If you'd like to keep this open, just leave any comment, and the stale label will be removed.
To keep this open. cc: @jjyao
@jjyao can we merge this one?
This pull request has been automatically marked as stale because it has not had any activity for 14 days. It will be closed in another 14 days if no further activity occurs. Thank you for your contributions.
You can always ask for help on our discussion forum or Ray's public slack channel.
If you'd like to keep this open, just leave any comment, and the stale label will be removed.
Keep this Open
@HollowMan6 I've rebased latest master and kicked off CI tests
The linter was failing due to a minor issue. I pushed a commit to fix it. In the future, you can use pre-commit:
pip install -U pre-commit
pre-commit install # from inside Ray directory
pre-commit run --all-files # it'll also run when you push changes by default
@edoakes Looks like the CI has passed, but the auto-merge doesn't work here, maybe it needs a manual merge.
Thanks @HollowMan6 !