compute-runtime icon indicating copy to clipboard operation
compute-runtime copied to clipboard

Incorrect Free Memory Reporting for Intel Arc(TM) A770 Graphics

Open avimanyu786 opened this issue 1 year ago • 3 comments

Description

There is an inconsistency in the reported GPU free memory between the Intel Compute Runtime and tools such as xpu-smi. When using the Intel Compute Runtime on Intel Arc(TM) A770 Graphics, the reported free memory value is incorrect, consistently showing the same value as the total memory, even when memory is being consumed. This issue was observed in both Python (dpctl) and a standalone C++ executable.

Steps to Reproduce

  1. Set up an environment with the Intel Compute Runtime and xpu-smi installed.
  2. Save the following C++ code as say mem.cpp:
#include <iostream>
#include <vector>
#include <string>
#include <sycl/sycl.hpp>

int main(void) {
    sycl::queue q{sycl::default_selector_v};

    const sycl::device &dev = q.get_device();
    const std::string &dev_name = dev.get_info<sycl::info::device::name>();
    const std::string &driver_ver = dev.get_info<sycl::info::device::driver_version>();

    std::cout << "Device: " << dev_name << " ["  << driver_ver << "]" << std::endl;

    auto global_mem_size = dev.get_info<sycl::info::device::global_mem_size>();

    std::cout << "Global device memory size: " << global_mem_size << " bytes" << std::endl;

    if (dev.has(sycl::aspect::ext_intel_free_memory)) {
         auto free_memory = dev.get_info<sycl::ext::intel::info::device::free_memory>();
         std::cout << "Free memory: " << free_memory << " bytes" << std::endl;
         std::cout << "Implied memory in use: " << global_mem_size - free_memory << " bytes" << std::endl;
    } else {
        std::cout << "Free memory descriptor is not available" << std::endl;
    }

    return 0;
}
  1. Compile the code to obtain the binary:
icpx -fsycl mem.cpp -o mem.x
  1. Execute the compiled binary with the environment variable ZES_ENABLE_SYSMAN set to 1:
export ZES_ENABLE_SYSMAN=1
./mem.x
  1. Compare the output with the results from xpu-smi:
xpu-smi stats -d 0

Observed Behavior

The C++ code consistently reports the same value for global_mem_size and free_memory, implying 0 bytes of used memory, even when memory is being consumed by the GPU. In contrast, xpu-smi correctly reports non-zero GPU memory usage.

Expected Behavior

The free_memory value reported by the Intel Compute Runtime should reflect the actual free memory, showing a decrease when GPU memory is used, consistent with the output from xpu-smi.

Environment Details

  • OS: HiveOS (Based on Ubuntu 20.04 and 22.04)
  • GPU: Intel(R) Arc(TM) A770 Graphics
  • GPU driver versions tested:
    • 1.3.27642
    • 1.3.29735
  • Intel Compute Runtime: Relevant versions for the above drivers
  • Compiler: Intel DPC++/C++ Compiler (icpx)

Additional Information

This issue is tracked in the dpctl repository here. The problem appears to stem from the GPU driver or the Intel Compute Runtime itself, as confirmed by running a standalone C++ executable.

Please let me know if further information or testing is required. Thank you for investigating this issue.

avimanyu786 avatar Jul 31 '24 18:07 avimanyu786

For more added context, xpu-smi fetches the value of XPUM_STATS_MEMORY_USED to report the used GPU memory. I found this when I searched for "GPU Memory Used" in the https://github.com/intel/xpumanager repository.

avimanyu786 avatar Aug 01 '24 05:08 avimanyu786

XPUM xpumd daemon (providing the data for xpu-smi CLI tool) uses compute-runtime L0 Sysman (not SYCL) API: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zesmemorygetstate

eero-t avatar Aug 19 '24 18:08 eero-t

XPUM xpumd daemon (providing the data for xpu-smi CLI tool) uses compute-runtime L0 Sysman (not SYCL) API: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zesmemorygetstate

The link has moved: https://oneapi-src.github.io/level-zero-spec/level-zero/latest/sysman/api.html#zesmemorygetstate

avimanyu786 avatar Oct 30 '24 07:10 avimanyu786

Hi @avimanyu786

We’d like to know if this issue is still affecting you. If so, please provide an update or any additional information. Otherwise, we’ll close this issue after 30 days of inactivity. Your feedback is appreciated!

kgibala avatar Oct 15 '25 09:10 kgibala

Hi @avimanyu786

We’d like to know if this issue is still affecting you. If so, please provide an update or any additional information. Otherwise, we’ll close this issue after 30 days of inactivity. Your feedback is appreciated!

Thanks for reaching out...will check and update soon.

avimanyu786 avatar Oct 15 '25 09:10 avimanyu786

I tried to reproduce the issue on a new machine, and now the ext_intel_free_memory reporting appears to be working correctly. I ran both alloc_touch and alloc_chunks tests inside a clean intel/oneapi-basekit:latest container with the Arc A770, and the free memory values before and after allocation are being reported accurately. The results show consistent Free0, Free1, and implied usage values, matching the expected memory consumption.

This indicates that the Level Zero runtime and driver stack are functioning as intended in the current setup, and the earlier reporting issue may have been environment-specific or related to mismatched library versions. Closing the issue.

avimanyu786 avatar Oct 16 '25 12:10 avimanyu786