ai-toolkit icon indicating copy to clipboard operation
ai-toolkit copied to clipboard

fix(ui): enable rocm-smi support by correcting flags and parsing

Open Soddentrough opened this issue 2 weeks ago • 11 comments

ai-toolkit runs on systems with AMD GPUs but displays an error about 'nvidia-smi' in the dashboard when doing so.

This patch removes the hard-coded dependency on 'nvidia-smi' allowing ai-toolkit to operate with either 'nvidia-smi' or 'rocm-smi'. It first checks for 'nvidia-smi' and then checks for 'rocm-smi' which may cause an issue if both are installed but it solves a need today.

Soddentrough avatar Dec 07 '25 09:12 Soddentrough

rocmcommit01 rocmcommit02 Thank you for proceeding with the modification!! But it's not running properly.. I look forward to seeing good results in the future!!

dkspwndj avatar Dec 07 '25 11:12 dkspwndj

image

Thinking this might be a path issue (where is your rocm-smi installed?) I will update the logic to check for rocm-smi in this order:

  1. which rocm-smi 2, Check for $ROCM_PATH/bin/rocm-smi
  2. Check for /usr/bin/rocm-smi
  3. Check for /opt/rocm/bin/rocm-smi

Soddentrough avatar Dec 07 '25 22:12 Soddentrough

image

There was a parsing issue when handling the JSON output of rocm-smi creating a phantom device.

Soddentrough avatar Dec 08 '25 06:12 Soddentrough

Forgive me.. I've been laying down the ROCm to the extent that I need the CompyUI with Zluda and then I figured out if the ROCm was properly laid. Now I'll make a separate PyTorch 3.12 folder to lay the ROCm and try it out there..

dkspwndj avatar Dec 08 '25 14:12 dkspwndj

Now tested ROCm 7.1.1 installed venv. But.. not work well.. aitoookiterr

dkspwndj avatar Dec 08 '25 14:12 dkspwndj

I'm sorry I didn't notice you're testing with a Windows system. rocm-smi is only available on Linux or WSL. Might be able to use hipinfo.exe on Windows to enumerate the devices but I don't think that has dynamic performance statistics for power/utilization/mem, so stats would show "0".

I don't currently have a way of testing this though and I think for Windows maybe using "Get-Counter" for dynamic performance counters could be the way to go.

Soddentrough avatar Dec 08 '25 20:12 Soddentrough

image

This now uses amd-smi by default with fallback to rocm-smi. And where amd-smi doesn't fully support a GPU (eg: Strix iGPU) we use the sysfs hwmon metrics. This also allows us to show "VRAM" and "GTT" (shared memory) used by an APU.

Soddentrough avatar Dec 08 '25 22:12 Soddentrough