sysmon failed on B580
Reproduce
- OS: ubuntu 24.04
- kernel: 6.13.4
- compute runtime: https://github.com/intel/compute-runtime/releases/tag/25.05.32567.17
- dGPU: B580
dpkg -l | grep intel
ii intel-gsc 0.9.5-112~u24.04 amd64 Intel(R) Graphics System Controller Firmware
ii intel-igc-core-2 2.7.11 amd64 Intel(R) Graphics Compiler for OpenCL(TM)
ii intel-igc-opencl-2 2.7.11 amd64 Intel(R) Graphics Compiler for OpenCL(TM)
ii intel-level-zero-gpu 1.6.32567.17 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii intel-media-va-driver:amd64 24.1.0+dfsg1-1 amd64 VAAPI driver for the Intel GEN8+ Graphics family
ii intel-metrics-discovery 1.13.179-1077~24.04 amd64 Intel(R) Metrics Discovery Application Programming Interface --
ii intel-metrics-library 1.0.182-1077~24.04 amd64 Intel(R) Metrics Library for MDAPI (Intel(R) Metrics Discovery
ii intel-microcode 3.20250211.0ubuntu0.24.04.1 amd64 Processor microcode firmware for Intel CPUs
ii intel-ocloc 24.52.32224.14-1077~24.04 amd64 Tool for managing Intel Compute GPU device binary format
ii intel-opencl-icd 25.05.32567.17 amd64 Intel graphics compute runtime for OpenCL
ii libchewing3:amd64 0.6.0-1build1 amd64 intelligent phonetic input method library
ii libchewing3-data 0.6.0-1build1 all intelligent phonetic input method library - data files
ii libdrm-intel1:amd64 2.4.124+git2501180500.a7eb2c~oibaf~n amd64 Userspace interface to intel-specific kernel DRM services -- runtime
ii xserver-xorg-video-intel 2:2.99.917+git20210115-1build1 amd64 X.Org X server -- Intel i8xx, i9xx display driver
sycl-ls 1 ↵
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) B580 Graphics 20.1.0 [1.6.32567.170000]
[opencl:gpu][opencl:0] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) B580 Graphics OpenCL 3.0 NEO [25.05.32567.17]
Build
cd pti-gpu/tools/sysmon
mkdir build && cd build
cmake ..
make
Log
./sysmon -p
=====================================================================================
sysmon: /home/az/workspace/pti-gpu/tools/sysmon/main.cc:117: void PrintShorInfo(ze_driver_handle_t, zes_device_handle_t, uint32_t): Assertion `status == ZE_RESULT_SUCCESS' failed.
[2] 14188 IOT instruction (core dumped) ./sysmon -p
Thanks, I was able to reproduce on Ubuntu 24.10 too. Looking into it.
@mschilling0 any updates? It is from l0 or driver, I think.
Yes, I filed an issue with their team, they seem to have accepted it. I will follow up and ask for an update.
No updates yet other than they've done an initial triage and seemed to accept it. I will let you know any updates here. It might have to work its way through their processes.
The API zesDeviceGetProperties is expected to return ZE_RESULT_ERROR_UNINITIALIZED i.e error code 78000001 when the core handle is given. This core handle has been created since the zeInit based sysman initialization is used.
On the BMG system, the sysman should be initialized using the zesInit based initialization which creates a separate Sysman handle. This Sysman handle should be given to the zesDeviceGetProperties API to fetch the correct values.
Regarding the BMG details, on BMG Xe KMD is enabled. On Xe KMD only sysman initialization with zesInit is supported. Legacy sysman initialization is not supported. https://github.com/intel/compute-runtime/blob/master/programmers-guide/SYSMAN.md
@pratikbariintel Hi, the current initialization is zeInit https://github.com/intel/pti-gpu/blob/00e4bbc736a64811195b94243b83d30383309396/tools/sysmon/main.cc#L1252.
I try to replace zesDeviceGetProperties with zeDeviceGetProperties and then zeDeviceGetProperties can get correct property.
So the next question is why zesDeviceGetProperties and zes related api failed.
As you paste above, you mean zes_device_handle_t should be fetched as above example guide?
The zeDeviceGetProperties is a core API and it expects a Core handle here (ze_device_handle_t). Here it will pass correctly. However, the zesDeviceGetProperties is a sysman API and it expects a Sysman handle (zes_device_handle_t) (As the core handle and the sysman handle has been separated out). All the zes related APIs are the Sysman API and hence will require only the Sysman handles. The flow to use the zes APIs should be zesInit. zesDriverGet and zesDeviceGet
@pratikbariintel Got it. I mis-understand the api according the spec.
/// @brief Handle of device object
typedef ze_device_handle_t zes_device_handle_t;
https://github.com/oneapi-src/level-zero/blob/3c938e21d827af014971d69dfd66759c2444e4d0/include/zes_api.h#L34C13-L34C48
On B580, to access sysman function. Now: zesInit() + zesDriverGet() + zesDeviceGet() to be called. sysman devce handles output from zesdeviceGet should be used for all other subsequent sysman APIs. This is recommended option.
Later: we have also added support for core device handles to be used for zesDevice*** APIs after successful sysman initialization through zesInit(). This support has not yet reached public driver and may be available in approximately in months time.
zesDeviceEnumEngineGroups still cannot achieve correct engines on B580 via zes_device_handle_t.
@AshwinKumarKulkarni @pratikbariintel
@alanzhai219 The support for the enumeration of the Engine Handles for Xe driver has been recently added. It will be available with the new driver release in 3-4 weeks.
@pratikbariintel @AshwinKumarKulkarni So, from this discussion, I think sysman still needs to be fixed. Ideally, we need to keep compatibility with devices older than BMG.
Is there an API call where we can determine if the device should use legacy mode or zesInit / zes_handle mode?
Should we just use a failed return code? Check /proc for xe (Linux)?
At present there is no separate API to check the legacy mode or the new mode. This should be referred for the Sysman initialization https://github.com/intel/compute-runtime/blob/master/programmers-guide/SYSMAN.md#support-and-limitations
Below pseudocode may help to decide zesInit/legacy, please check if it is useful
//core initialization
zeInit(..)
zeDriverGet(...)
zeDeviceGet(...)
//check GPU platform
ze_device_properties_t properties = {};
ze_device_ip_version_ext_t ip_version_ext{};
properties.stype = ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES;
properties.pNext = &ip_version_ext;
ip_version_ext.stype = ZE_STRUCTURE_TYPE_DEVICE_IP_VERSION_EXT;
ip_version_ext.pNext = nullptr;
result = zeDeviceGetProperties(ze_device, &properties);
CHECK_RESULT_FOR_SUCCESS(result);
//decision
if (properties.type == ZE_DEVICE_TYPE_GPU && properties.vendorId == 0x8086) {
ze_device_ip_version_ext_t *ip_version =
static_cast<ze_device_ip_version_ext_t *>(properties.pNext);
if (ip_version->ipVersion >= 0x05004000) { // BMGs ip version
//go with zesInit based sysman init-recommended
//Legacy not supported on Xe KMD
}else{
//go with legacy based sysman init
}
}
These infrastructure-related APIs should be ready before the new hardware is released. @pratikbariintel @AshwinKumarKulkarni
Below pseudocode may help to decide zesInit/legacy, please check if it is useful
//core initialization zeInit(..) zeDriverGet(...) zeDeviceGet(...) //check GPU platform ze_device_properties_t properties = {}; ze_device_ip_version_ext_t ip_version_ext{}; properties.stype = ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES; properties.pNext = &ip_version_ext; ip_version_ext.stype = ZE_STRUCTURE_TYPE_DEVICE_IP_VERSION_EXT; ip_version_ext.pNext = nullptr; result = zeDeviceGetProperties(ze_device, &properties); CHECK_RESULT_FOR_SUCCESS(result); //decision if (properties.type == ZE_DEVICE_TYPE_GPU && properties.vendorId == 0x8086) { ze_device_ip_version_ext_t *ip_version = static_cast<ze_device_ip_version_ext_t *>(properties.pNext); if (ip_version->ipVersion >= 0x05004000) { // BMGs ip version //go with zesInit based sysman init-recommended //Legacy not supported on Xe KMD }else{ //go with legacy based sysman init } }
Thanks! I will try it out next week.