ray icon indicating copy to clipboard operation
ray copied to clipboard

[runtime env]: Integrating ROCm Systems Profiler to Ray worker process

Open HollowMan6 opened this issue 1 year ago • 12 comments

Why are these changes needed?

Reference https://github.com/ray-project/ray/pull/39998

ROCm Systems Profiler is an AMD profiling tool for AMD based GPUs and CPUs.

This PR adds ROCm Systems Profiler integration with Ray using runtime_env. Similar to nsight, Currently rocprof-sys-python can't profile the GPU usage from Ray tasks/actors since the processes that can be traced by rocprof-sys-python must be driver processes and it's subprocesses, whereas Ray tasks/actors are run by worker process. Thus, we added rocprof-sys-python native to runtime_env in order to modify the worker process to run with rocprof-sys-python which can produce the report for each worker processes once it exits.

Unlike nsight, since the options of ROCm Systems Profiler can be controlled by both environment variables and CLI arguments, and they are not overlapping, we introduced "env" and "args" to the flags to make this possible.

Refactoring the code so that we share the same code between nsight and ROCm Systems Profiler is also possible, but since this will introduce breaking changes to the flags design, this will need approval from the maintainers to proceed.

Related issue number

Checks

  • [X] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [X] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

HollowMan6 avatar Nov 03 '24 16:11 HollowMan6

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar Feb 01 '25 00:02 stale[bot]

This is a good addition to get ray working on AMD GPUs, so I would hope to get this reviewed and merged soon.

HollowMan6 avatar Feb 02 '25 08:02 HollowMan6

@vickytsang could you help review this one?

jjyao avatar May 13 '25 20:05 jjyao

@vickytsang could you help review this one?

Yes I will take a look

vickytsang avatar May 13 '25 22:05 vickytsang

LGTM. However, in rocm version>=6.3 rocm/omnitrace has been rebranded/replaced by rocm/rocprofiler-systems (see documentation). I strongly recommend updating this to work with rocprofiler-systems for better support.

vickytsang avatar May 14 '25 15:05 vickytsang

Thanks for reviewing @vickytsang ! Unfortunately, my environment still sticks to ROCm 6.2.4, and I would need help updating this to work with rocprofiler-systems.

HollowMan6 avatar May 14 '25 16:05 HollowMan6

Thanks for reviewing @vickytsang ! Unfortunately, my environment still sticks to ROCm 6.2.4, and I would need help updating this to work with rocprofiler-systems.

rocprofiler-systems works with rocm6.2. It would be great if you can make the update in this PR so the wider ROCm community can make use of this change. I can assist with any issues you might have.

vickytsang avatar May 14 '25 16:05 vickytsang

Oh okay, sure, I didn't know that, then I'll check it out when I have time!

HollowMan6 avatar May 14 '25 17:05 HollowMan6

Briefly checked the document, and looks like a verbatim replace can work here. I will test this more comprehensively later, but feel free to let me know if you find any issues meanwhile.

HollowMan6 avatar May 14 '25 22:05 HollowMan6

LGTM. @HollowMan6 thank you for contributing to the ROCm ecosystem

vickytsang avatar May 15 '25 15:05 vickytsang

This pull request has been automatically marked as stale because it has not had any activity for 14 days. It will be closed in another 14 days if no further activity occurs. Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions[bot] avatar Jun 17 '25 00:06 github-actions[bot]

To keep this open. cc: @jjyao

HollowMan6 avatar Jun 17 '25 04:06 HollowMan6

@jjyao can we merge this one?

vickytsang avatar Jun 19 '25 20:06 vickytsang

This pull request has been automatically marked as stale because it has not had any activity for 14 days. It will be closed in another 14 days if no further activity occurs. Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions[bot] avatar Jul 04 '25 00:07 github-actions[bot]

Keep this Open

HollowMan6 avatar Jul 06 '25 08:07 HollowMan6

@HollowMan6 I've rebased latest master and kicked off CI tests

edoakes avatar Jul 08 '25 17:07 edoakes

The linter was failing due to a minor issue. I pushed a commit to fix it. In the future, you can use pre-commit:

pip install -U pre-commit
pre-commit install # from inside Ray directory
pre-commit run --all-files # it'll also run when you push changes by default

edoakes avatar Jul 09 '25 13:07 edoakes

@edoakes Looks like the CI has passed, but the auto-merge doesn't work here, maybe it needs a manual merge.

HollowMan6 avatar Jul 10 '25 07:07 HollowMan6

Thanks @HollowMan6 !

edoakes avatar Jul 10 '25 15:07 edoakes