rocm-smi fails during initialization if old AMD GPUs are present
I have been using the deprecated rocm-smi for a while now to monitor the status of my GPUs. I have a FirePro S10000 (Tahiti), which works with amdgpu, but does not provide, or only provides on a different path, some of the hardware interfaces expected from newer GPUs (for example voltages, clocks, power draw/cap and gpu_busy_percent). This caused the now-deprecated rocm-smi to show a warning about being unable to read gpu_busy_percent, but otherwise it worked.
This new rocm-smi version sadly straight-up fails to deal with this and errors out during initialization.
> /opt/rocm/bin/rocm-smi
rsmi_init() failed
Exception caught: rsmi_init.
ERROR:root:ROCm SMI returned 8 (the expected value is 0)
I have already narrowed this initialization problem down to an attempt to read /sys/class/hwmon/hwmon2/in0_label, which does not exist on monitors of the Tahiti GPUs. This leads to the program to attempt to find "" within kVoltSensorNameMap, which throws an exception (Map::at).
Even without this issue, these GPUs don't provide a frequency table (as far as I know), which causes another exception:
» ./rocm-smi
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
python3: [..]/src/rocm_smi_lib-rocm-4.1.0/src/rocm_smi.cc:895: rsmi_status_t get_frequencies(amd::smi::DevInfoTypes, uint32_t, rsmi_frequencies_t*, uint32_t*): Assertion `f->frequency[i-1] <= f->frequency[i]' failed.
[1] 69803 abort (core dumped) ./rocm-smi
I don't expect rocm-smi to support these old GPUs, but it would be good if it still worked when old GPUs are present. Let me know if you need more information.
Relevant part of lspci:
0a:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ba)
0b:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ba)
0b:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ba)
0c:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti PRO GL [FirePro Series]
0c:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti HDMI Audio [Radeon HD 7870 XT / 7950/7970]
0d:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti PRO GL [FirePro Series]
0e:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev c1)
0f:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch
10:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] (rev c1)
10:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device ab28
Hardware monitor files of Tahiti:
» ls /sys/class/drm/card0/device/hwmon/hwmon2/
device fan1_input freq1_input freq2_input name pwm1 pwm1_max subsystem temp1_crit_hyst temp1_label
fan1_enable fan1_target freq1_label freq2_label power pwm1_enable pwm1_min temp1_crit temp1_input uevent
Hardware monitor files of Navi21:
» ls /sys/class/drm/card2/device/hwmon/hwmon4
device fan1_target in0_input power1_cap pwm1_max temp1_emergency temp2_emergency temp3_emergency
fan1_enable freq1_input in0_label power1_cap_max pwm1_min temp1_input temp2_input temp3_input
fan1_input freq1_label name power1_cap_min subsystem temp1_label temp2_label temp3_label
fan1_max freq2_input power pwm1 temp1_crit temp2_crit temp3_crit uevent
fan1_min freq2_label power1_average pwm1_enable temp1_crit_hyst temp2_crit_hyst temp3_crit_hyst