rocm_smi_lib icon indicating copy to clipboard operation
rocm_smi_lib copied to clipboard

Fix [Not supported] status for get_compute_process_info_by_pid

Open vstempen opened this issue 1 year ago • 2 comments

On some systems [rocm-smi --showpids] reports get_compute_process_info_by_pid, Not supported on the given system [PID] [PROCESS NAME] 1 UNKNOWN UNKNOWN UNKNOWN

get_compute_process_info_by_pid fails because cu_occupancy debugfs method is not provided on some graphics cards and GFX revisions by design

Proposing a change to return success status when only cu_occupancy debugfs method is not found and provide cu_occupancy invalidation value to mark only this parameter as UNKNOWN

vstempen avatar Jan 24 '24 20:01 vstempen

Thanks for the change @vstempen !

Just FYI - all our changes go through internal gerrit and then get published to github. Github PRs are OK but might be less visible.

dmitrii-galantsev avatar Jan 24 '24 22:01 dmitrii-galantsev

Merged internally, should make it up to develop branch in the next day. @bill-shuzhou-liu is asking: "is this only applied to cu, or also applied to sdma and vram?"

dmitrii-galantsev avatar Feb 13 '24 23:02 dmitrii-galantsev

@dmitrii-galantsev Is this fix available in latest ROCm 6.1.1? Thanks!

ppanchad-amd avatar May 15 '24 18:05 ppanchad-amd

merged in 677433b @ppanchad-amd Not sure. Please get rocm-smi version with rocm-smi --version and see if the commit is ahead of the one linked above.

dmitrii-galantsev avatar Jul 08 '24 16:07 dmitrii-galantsev

Still see this error on rocm 6.2

yx-lamini avatar Aug 24 '24 01:08 yx-lamini

@yx-lamini would you be able to provide more details regarding your system configuration so we can reproduce the issue? Thanks!

tcgu-amd avatar Aug 27 '24 14:08 tcgu-amd

@yx-lamini would you be able to provide more details regarding your system configuration so we can reproduce the issue? Thanks!

Yes, of cuz. What do you need? I am running rocm-smi on a mi300 8GPU server with the vanilla rocm 6.2.0 runtime installed.

yx-lamini avatar Aug 28 '24 22:08 yx-lamini

@yx-lamini I saw your comment here https://github.com/ROCm/ROCm/issues/2595. Is the problem you are experiencing related to that issue? (If so, I will close this PR and track the problem on the other issue). Thanks!

tcgu-amd avatar Aug 29 '24 20:08 tcgu-amd

@yx-lamini I saw your comment here ROCm/ROCm#2595. Is the problem you are experiencing related to that issue? (If so, I will close this PR and track the problem on the other issue). Thanks!

Yes, that works. Sorry for spamming between multiple places.

yx-lamini avatar Aug 29 '24 22:08 yx-lamini