amdsmi icon indicating copy to clipboard operation
amdsmi copied to clipboard

[Issue]: Why is E-SMI required if amdsmi is "a successor to rocm_smi_lib and esmi_ib_library."

Open garrettbyrd opened this issue 8 months ago • 1 comments

Problem Description

If amdsmi is being marketed as "a successor to rocm_smi_lib and esmi_ib_library", why is E-SMI still required to get CPU information via amdsmi? Is this a wording issue in the README, or is it planned the amdsmi will be a standalone library?

Here is an example.

#include <iostream>
#include <unistd.h>
#include <amd_smi/amdsmi.h>

int main() {
    amdsmi_status_t status = amdsmi_init(AMDSMI_INIT_AMD_GPUS);
    if (status != AMDSMI_STATUS_SUCCESS) {
       std::cerr << "Failed to initialize AMD SMI library" << std::endl;
       return -1;
    }

    uint32_t socket_count = 0;
    status = amdsmi_get_socket_handles(&socket_count, nullptr);
    std::cout << "Socket Total: " << socket_count << std::endl;

    status = amdsmi_shut_down();

    return 0;
}

using the line amdsmi_status_t status = amdsmi_init(AMDSMI_INIT_AMD_GPUS); works as expected. Output:

Socket Total: 2

However, when running a similar line for CPUs:

#include <iostream>
#include <unistd.h>
#include <amd_smi/amdsmi.h>

int main() {
    amdsmi_status_t status = amdsmi_init(AMDSMI_INIT_AMD_CPUS);
    if (status != AMDSMI_STATUS_SUCCESS) {
       std::cerr << "Failed to initialize AMD SMI library" << std::endl;
       return -1;
    }

    uint32_t socket_count = 0;
    status = amdsmi_get_socket_handles(&socket_count, nullptr);
    std::cout << "Socket Total: " << socket_count << std::endl;

    status = amdsmi_shut_down();

    return 0;
}

Output:

        ESMI Not initialized, drivers not found 
Failed to initialize AMD SMI library

I get a similar error when trying to use AMDSMI_INIT_AMD_APUS on MI300A APUs.

Related, this second example on this page fails to compile, and the documentation provides no indication the esmi is required for this.

Example:

#include <iostream>
#include <vector>
#include "amd_smi/amdsmi.h"

int main(int argc, char **argv) {
    amdsmi_status_t ret;
    uint32_t socket_count = 0;

    // Initialize amdsmi for AMD CPUs
    ret = amdsmi_init(AMDSMI_INIT_AMD_CPUS);

    ret = amdsmi_get_socket_handles(&socket_count, nullptr);

    // Allocate the memory for the sockets
    std::vector<amdsmi_socket_handle> sockets(socket_count);

    // Get the sockets of the system
    ret = amdsmi_get_socket_handles(&socket_count, &sockets[0]);

    std::cout << "Total Socket: " << socket_count << std::endl;

    // For each socket, get cpus
    for (uint32_t i = 0; i < socket_count; i++) {
        uint32_t cpu_count = 0;

        // Set processor type as AMDSMI_PROCESSOR_TYPE_AMD_CPU
        processor_type_t processor_type = AMDSMI_PROCESSOR_TYPE_AMD_CPU;
        ret = amdsmi_get_processor_handles_by_type(sockets[i], processor_type, nullptr, &cpu_count);

        // Allocate the memory for the cpus
        std::vector<amdsmi_processor_handle> plist(cpu_count);

        // Get the cpus for each socket
        ret = amdsmi_get_processor_handles_by_type(sockets[i], processor_type, &plist[0], &cpu_count);

        for (uint32_t index = 0; index < plist.size(); index++) {
            uint32_t socket_power;
            std::cout<<"CPU "<<index<<"\t"<< std::endl;
            std::cout<<"Power (Watts): ";

            ret = amdsmi_get_cpu_socket_power(plist[index], &socket_power);
            if(ret != AMDSMI_STATUS_SUCCESS)
                std::cout<<"Failed to get cpu socket power"<<"["<<index<<"] , Err["<<ret<<"] "<< std::endl;

            if (!ret) {
                std::cout<<static_cast<double>(socket_power)/1000<<std::endl;
            }
            std::cout<<std::endl;
        }
    }

    // Clean up resources allocated at amdsmi_init
    ret = amdsmi_shut_down();

    return 0;
}

Output:

$ hipcc example.cpp -o example -I/opt/rocm-6.3.1/include -L/opt/rocm-6.3.1/lib -lamd_smi
example.cpp:28:15: error: use of undeclared identifier 'amdsmi_get_processor_handles_by_type'
   28 |         ret = amdsmi_get_processor_handles_by_type(sockets[i], processor_type, nullptr, &cpu_count);
      |               ^
example.cpp:34:15: error: use of undeclared identifier 'amdsmi_get_processor_handles_by_type'
   34 |         ret = amdsmi_get_processor_handles_by_type(sockets[i], processor_type, &plist[0], &cpu_count);
      |               ^
example.cpp:41:19: error: use of undeclared identifier 'amdsmi_get_cpu_socket_power'
   41 |             ret = amdsmi_get_cpu_socket_power(plist[index], &socket_power);
      |                   ^
3 errors generated when compiling for gfx90a.
failed to execute:/opt/rocm-6.3.1/lib/llvm/bin/clang++  --offload-arch=gfx90a --offload-arch=gfx90a -O3 --driver-mode=g++ -O3 --hip-link  -x hip example.cpp -o "example" -I/opt/rocm-6.3.1/include -L/opt/rocm-6.3.1/lib -lamd_smi

Again, these are all related to the esmi requirement.

relevant line from amd_smi.cc

Operating System

Rocky Linux 9.5

CPU

2 x AMD EPYC 7313 (64) @ 3.73 GHz

GPU

AMD Instinct MI210

ROCm Version

ROCm 6.3.1

ROCm Component

amdsmi

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

garrettbyrd avatar Apr 21 '25 16:04 garrettbyrd

It might be a wording issue. AFAIK: esmi is required for CPU/APU support. Not sure about future plans.

I asked @marifamd to have a look at this issue

dmitrii-galantsev avatar Apr 21 '25 22:04 dmitrii-galantsev