fix(ui): enable rocm-smi support by correcting flags and parsing
ai-toolkit runs on systems with AMD GPUs but displays an error about 'nvidia-smi' in the dashboard when doing so.
This patch removes the hard-coded dependency on 'nvidia-smi' allowing ai-toolkit to operate with either 'nvidia-smi' or 'rocm-smi'. It first checks for 'nvidia-smi' and then checks for 'rocm-smi' which may cause an issue if both are installed but it solves a need today.
Thinking this might be a path issue (where is your rocm-smi installed?) I will update the logic to check for rocm-smi in this order:
which rocm-smi2, Check for $ROCM_PATH/bin/rocm-smi- Check for /usr/bin/rocm-smi
- Check for /opt/rocm/bin/rocm-smi
There was a parsing issue when handling the JSON output of rocm-smi creating a phantom device.
Forgive me.. I've been laying down the ROCm to the extent that I need the CompyUI with Zluda and then I figured out if the ROCm was properly laid. Now I'll make a separate PyTorch 3.12 folder to lay the ROCm and try it out there..
Now tested ROCm 7.1.1 installed venv.
But.. not work well..
I'm sorry I didn't notice you're testing with a Windows system. rocm-smi is only available on Linux or WSL. Might be able to use hipinfo.exe on Windows to enumerate the devices but I don't think that has dynamic performance statistics for power/utilization/mem, so stats would show "0".
I don't currently have a way of testing this though and I think for Windows maybe using "Get-Counter" for dynamic performance counters could be the way to go.
This now uses amd-smi by default with fallback to rocm-smi. And where amd-smi doesn't fully support a GPU (eg: Strix iGPU) we use the sysfs hwmon metrics. This also allows us to show "VRAM" and "GTT" (shared memory) used by an APU.