nvitop icon indicating copy to clipboard operation
nvitop copied to clipboard

Support for AMD ROCm devices

Open Junyi-99 opened this issue 11 months ago • 5 comments

Issue Type

  • Feature implementation

Description

I've implemented ROCm support in nvitop, enabling it to run on AMD GPUs. This feature has been tested on mi50, mi100, and mi210 machines and is confirmed to maintain full functionality for NVIDIA GPUs.

Motivation and Context

Really need nvitop on AMD GPUs.

#74

Testing

Tested on

mi50

mi100

mi210

Images / Videos

mi100

(top: nvitop, bottom-left: rocm-smi, bottom-right: pytorch code)

Junyi-99 avatar Mar 11 '24 13:03 Junyi-99

Hi @Junyi-99, thanks for the contribution! Is there any PyPI package that provides the ROCm-SMI bindings like nvidia-ml-py for the NVIDIA NVML library? Maybe we should ship the ROCm support with:

pip3 install nvitop[rocm]

XuehaiPan avatar Mar 15 '24 15:03 XuehaiPan

Oh, I think it's a very good suggestion to ship through nvitop[rocm]. Currently, there is a ROCm binding, but it is not that functional.

Junyi-99 avatar Mar 15 '24 16:03 Junyi-99

+1 I'd love to have this support, how is the development going?

hartmark avatar Aug 08 '24 16:08 hartmark

+1 It would be great to have this for MI300X

kswain55 avatar Aug 08 '24 19:08 kswain55

trying this now with hf autotrain, AMD Radeon 7900XT Navi31 gfx1100 with pip install git+https://github.com/XuehaiPan/nvitop.git

I still receive the errors:

Your installed package `nvidia-ml-py` is corrupted. Skip patch functions `nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses`. You may get incorrect or incomplete results. Please consider reinstall package `nvidia-ml-py` via `pip3 install --force-reinstall nvidia-ml-py nvitop`.
Your installed package `nvidia-ml-py` is corrupted. Skip patch functions `nvmlDeviceGetMemoryInfo`. You may get incorrect or incomplete results. Please consider reinstall package `nvidia-ml-py` via `pip3 install --force-reinstall nvidia-ml-py nvitop`.

unclemusclez avatar Aug 22 '24 03:08 unclemusclez

@Junyi-99 Would it be possible to use the rocmsmi repo as a submodule instead? Are there any modifications beyond formatting?

Also please note that we're working on migration to AMDSMI and it would be much better long-term to use that :). ROCMSMI will eventually be deprecated.

In fact RDC migrated to amdsmi somewhat recently.

Cheers! -- Dev from SMI team at AMD.

dmitrii-galantsev avatar Sep 03 '24 21:09 dmitrii-galantsev

Is there any PyPI package that provides the ROCm-SMI bindings like nvidia-ml-py for the NVIDIA NVML library?

@XuehaiPan This is planned for amdsmi :)

dmitrii-galantsev avatar Sep 03 '24 21:09 dmitrii-galantsev

some more info.

  • You can build and install amdsmi python package fairly easily.
# if on ubuntu get dependencies:
# sudo apt install git python3 python3-pip cmake clang build-essential pkg-config libdrm-dev
git clone https://github.com/ROCm/amdsmi &&
cd amdsmi &&
cmake -B build &&
make -C build -j $(nproc) &&
cd build/py-interface/python_package &&
python3 -m pip install .

Now you should be able to use the api: https://github.com/ROCm/amdsmi/tree/amd-staging/py-interface#usage

  • amd-smi process returns some useful info. Here is me running rocm-validation-suite in the background on dual NV21s:
$ amd-smi process
GPU: 0
    PROCESS_INFO:
        NAME: rvs
        PID: 468813
        MEMORY_USAGE:
            GTT_MEM: 2.1 MB
            CPU_MEM: 253.1 MB
            VRAM_MEM: 1.1 GB
        MEM_USAGE: 1.4 GB
        USAGE:
            GFX: 0 ns
            ENC: 0 ns

GPU: 1
    PROCESS_INFO:
        NAME: rvs
        PID: 468813
        MEMORY_USAGE:
            GTT_MEM: 2.1 MB
            CPU_MEM: 253.1 MB
            VRAM_MEM: 1.1 GB
        MEM_USAGE: 1.4 GB
        USAGE:
            GFX: 0 ns
            ENC: 0 ns

dmitrii-galantsev avatar Sep 03 '24 22:09 dmitrii-galantsev

this works for wsl2?

some more info.

* You can build and install amdsmi python package fairly easily.
# if on ubuntu get dependencies:
# sudo apt install git python3 python3-pip cmake clang build-essential pkg-config libdrm-dev
git clone https://github.com/ROCm/amdsmi &&
cd amdsmi &&
cmake -B build &&
make -C build -j $(nproc) &&
cd build/py-interface/python_package &&
python3 -m pip install .

Now you should be able to use the api: https://github.com/ROCm/amdsmi/tree/amd-staging/py-interface#usage

* `amd-smi process` returns some useful info. Here is me running [rocm-validation-suite](https://github.com/ROCm/ROCmValidationSuite/) in the background on dual NV21s:
$ amd-smi process
GPU: 0
    PROCESS_INFO:
        NAME: rvs
        PID: 468813
        MEMORY_USAGE:
            GTT_MEM: 2.1 MB
            CPU_MEM: 253.1 MB
            VRAM_MEM: 1.1 GB
        MEM_USAGE: 1.4 GB
        USAGE:
            GFX: 0 ns
            ENC: 0 ns

GPU: 1
    PROCESS_INFO:
        NAME: rvs
        PID: 468813
        MEMORY_USAGE:
            GTT_MEM: 2.1 MB
            CPU_MEM: 253.1 MB
            VRAM_MEM: 1.1 GB
        MEM_USAGE: 1.4 GB
        USAGE:
            GFX: 0 ns
            ENC: 0 ns

unclemusclez avatar Sep 03 '24 22:09 unclemusclez

@unclemusclez AFAIK - no. SMI needs access to amdgpu driver. rule of thumb, if /sys/class/drm/card*/device/gpu_metrics exists - SMI will work.

dmitrii-galantsev avatar Sep 04 '24 14:09 dmitrii-galantsev

@dmitrii-galantsev I'll try it this weekend.

Junyi-99 avatar Sep 05 '24 04:09 Junyi-99