compute-runtime
compute-runtime copied to clipboard
Assert with Xe KMD when using -DNEO_ENABLE_XE_DRM_DETECTION=TRUE
Problem
Compute-runtime Xe KMD support does not actually work with Xe KMD, it asserts
Details
When building kernel from Xe repo default "drm-xe-next" branch (yesterday HEAD commit): https://gitlab.freedesktop.org/drm/xe/kernel
With Xe driver enabled:
# grep _XE[^A-Z] /boot/drm_xe.config
CONFIG_DRM_XE=m
CONFIG_DRM_XE_FORCE_PROBE=""
CONFIG_DRM_XE_JOB_TIMEOUT_MAX=10000
CONFIG_DRM_XE_JOB_TIMEOUT_MIN=1
CONFIG_DRM_XE_TIMESLICE_MAX=10000000
CONFIG_DRM_XE_TIMESLICE_MIN=1
CONFIG_DRM_XE_PREEMPT_TIMEOUT=640000
CONFIG_DRM_XE_PREEMPT_TIMEOUT_MAX=10000000
CONFIG_DRM_XE_PREEMPT_TIMEOUT_MIN=1
CONFIG_DRM_XE_ENABLE_SCHEDTIMEOUT_LIMIT=y
Booting TGL device with it being enabled:
# dmesg | grep xe[^a-z]
[ 0.000000] Command line: BOOT_IMAGE=/boot/drm_xe rootwait fsck.repair=yes i915.force_probe=!9a60 xe.force_probe=9a60 ro
[ 0.038111] Kernel command line: BOOT_IMAGE=/boot/drm_xe rootwait fsck.repair=yes i915.force_probe=!9a60 xe.force_probe=9a60 ro
[ 3.068875] xe 0000:00:02.0: vgaarb: deactivate vga console
[ 3.198711] xe 0000:00:02.0: [drm] Using GuC firmware from i915/tgl_guc_70.bin version 70.13.1
[ 3.202558] xe 0000:00:02.0: [drm] Using HuC firmware from i915/tgl_huc.bin version 7.9.3
[ 3.204943] xe REG[0x2340-0x235f]: allow read access
[ 3.204946] xe REG[0x7010-0x7017]: allow rw access
[ 3.204947] xe REG[0x7018-0x701f]: allow rw access
[ 3.204974] xe REG[0x223a8-0x223af]: allow read access
[ 3.204993] xe REG[0x1c03a8-0x1c03af]: allow read access
[ 3.205011] xe REG[0x1d03a8-0x1d03af]: allow read access
[ 3.205030] xe REG[0x1c83a8-0x1c83af]: allow read access
[ 3.212040] [drm] Initialized xe 1.1.0 20201103 for 0000:00:02.0 on minor 0
[ 4.462524] xe 0000:00:02.0: [drm] GT0: suspended
And using compute stack built from following versions:
- GMMlib: intel-gmmlib-22.3.16
- SPIRV-SDK: vulkan-sdk-1.3.268.0/vulkan-sdk-1.3.268.0 (headers/tools)
- SPIRV-LLVM: libllvmspirvlib-14-dev:amd64:14.0.0-3ubuntu1 (Ubuntu package)
- OpenCL-Clang: libopencl-clang-14-dev:amd64:14.0.0-4 (Ubuntu package)
- VC-intrinsics: v0.16.0
- Graphics Compiler: igc-1.0.15985.0 (IGC)
- Level-Zero API: v1.15.8
- compute-runtime: 23.48.27912.9
Using options enabling Xe KMD support:
ARG ZELLO_LOC=../level_zero/tools/test/black_box_tests/zello_sysman.cpp
RUN cd compute-runtime && mkdir build && cd build && \
cmake -LH -Wno-dev -G Ninja \
-DCMAKE_INSTALL_PREFIX=${INSTALL_DIR} -DCMAKE_BUILD_TYPE=Release \
-DSUPPORT_GEN8=0 -DSUPPORT_GEN9=1 -DSUPPORT_GEN11=0 \
-DSUPPORT_TGLLP=1 -DSUPPORT_DG1=1 -DSUPPORT_XE_HP_SDV=1 \
-DSUPPORT_DG2=1 -DSUPPORT_PVC=1 \
-DNEO_ENABLE_i915_PRELIM_DETECTION=TRUE \
-DNEO_ENABLE_XE_DRM_DETECTION=TRUE \
-DNEO_DISABLE_LD_GOLD=1 \
-DDO_NOT_RUN_AUB_TESTS=1 -DDONT_CARE_OF_VIRTUALS=1 \
../ && \
ninja && ninja install && \
g++ -O2 -Wall -o ${INSTALL_DIR}/bin/zello_sysman $ZELLO_LOC -lze_loader -locloc
Compute-runtime and its zello_sysman
tool just abort with an assert:
# docker run -it --rm --user root --network none --cap-drop ALL --device /dev/dri:/dev/dri:rw registry/compute-tester:latest zello_sysman
ZES_ENABLE_SYSMAN environment variable Not Set
Setting the environment variable ZES_ENABLE_SYSMAN
ZES_ENABLE_SYSMAN environment variable Set
Abort was called at 311 line in file:
/source/compute-runtime/shared/source/os_interface/linux/xe/ioctl_helper_xe.cpp
OpenCL programs give also same assert, which is here in the repo code: https://github.com/intel/compute-runtime/blob/23.48.27912.9/shared/source/os_interface/linux/xe/ioctl_helper_xe.cpp#L311
Strace shows this memory region check issue happening at driver init time:
# ... strace -f -k zello_sysman
...
write(1, "Abort was called at 311 line in "..., 38Abort was called at 311 line in file:
) = 38
> /usr/lib/x86_64-linux-gnu/libc.so.6(__write+0x14) [0x10bf34]
...
> /usr/lib/x86_64-linux-gnu/libc.so.6(__printf_chk+0xab) [0x12d63b]
> /usr/local/lib/libze_intel_gpu.so.1.3.27912(zeKernelSuggestGroupSizeTracing+0x10e822) [0x3463b2]
...
> /usr/local/lib/libze_intel_gpu.so.1.3.27912(zeKernelSuggestGroupSizeTracing+0x36949d) [0x5a102d]
> /usr/local/lib/libze_intel_gpu.so.1.3.27912(zetGetMetricGroupExpProcAddrTable+0x22e86) [0x11b506]
> /usr/local/lib/libze_intel_gpu.so.1.3.27912(zetGetMetricGroupExpProcAddrTable+0x227af) [0x11ae2f]
> /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_mutexattr_settype+0x107) [0x94817]
> /usr/local/lib/libze_intel_gpu.so.1.3.27912(zetGetMetricGroupExpProcAddrTable+0x22a38) [0x11b0b8]
> /usr/local/lib/libze_tracing_layer.so.1.15.8(zeGetFabricVertexExpProcAddrTable+0xdc5) [0xe835]
> /usr/local/lib/libze_loader.so.1.15.8(loader::context_t::init_driver(loader::driver_t, unsigned int)+0x61d) [0x1f9bd]
> /usr/local/lib/libze_loader.so.1.15.8(loader::context_t::check_drivers(unsigned int)+0x126) [0x219e6]
> /usr/local/lib/libze_loader.so.1.15.8(ze_lib::context_t::~context_t()+0xc0) [0x1a170]
> /usr/local/lib/libze_loader.so.1.15.8(loader::createLoaderContext()+0x174) [0x117a4]
> /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_mutexattr_settype+0x107) [0x94817]
> /usr/local/lib/libze_loader.so.1.15.8(zeInit+0x73) [0x11853]
> /usr/local/bin/zello_sysman() [0xa658]
On Arc, I've seen also segfault instead of assert, but it was not reproducible. Strace showed it happening with same backtrace as the assert.
With OpenCL, strace shows line 311 assert being arrived through another route than in above zello_sysman
L0 backend backtrace:
ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x64, 0x40, 0x28), 0x7ffe75cd81f0) = 0
> /usr/lib/x86_64-linux-gnu/libc.so.6(ioctl+0x3f) [0x111f3f]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x505bba) [0x5ccb0a]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x51d7b7) [0x5e4707]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x51a22f) [0x5e117f]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x4fc8f5) [0x5c3845]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x1be9b) [0xe2deb]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x5105b0) [0x5d7500]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x464df7) [0x52bd47]
> /usr/local/lib/intel-opencl/libigdrcl.so() [0x9b121]
> /usr/local/lib/intel-opencl/libigdrcl.so() [0x9b2ae]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x46504d) [0x52bf9d]
> /usr/local/lib/intel-opencl/libigdrcl.so(clGetExtensionFunctionAddress+0x5a6b) [0xbf7fb]
> /usr/local/lib/intel-opencl/libigdrcl.so(clIcdGetPlatformIDsKHR+0x27) [0xbfe27]
> /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0() [0x7f64]
> /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0(clGetPlatformIDs+0xbb) [0x8f6b]
> /usr/bin/clinfo() [0x97cc]
> /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_init_first+0x90) [0x23a90]
> /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x89) [0x23b49]
> /usr/bin/clinfo() [0xc645]
newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}, AT_EMPTY_PATH) = 0
> /usr/lib/x86_64-linux-gnu/libc.so.6(fstatat+0xe) [0x10b42e]
> /usr/lib/x86_64-linux-gnu/libc.so.6(_IO_file_doallocate+0x63) [0x78603]
> /usr/lib/x86_64-linux-gnu/libc.so.6(_IO_doallocbuf+0x50) [0x885b0]
> /usr/lib/x86_64-linux-gnu/libc.so.6(_IO_file_overflow+0x180) [0x87510]
> /usr/lib/x86_64-linux-gnu/libc.so.6(_IO_file_xsputn+0x105) [0x85ce5]
> /usr/lib/x86_64-linux-gnu/libc.so.6(parse_printf_format+0x969) [0x56929]
> /usr/lib/x86_64-linux-gnu/libc.so.6(parse_printf_format+0x605) [0x565c5]
> /usr/lib/x86_64-linux-gnu/libc.so.6(parse_printf_format+0xbcc) [0x56b8c]
> /usr/lib/x86_64-linux-gnu/libc.so.6(_IO_vfprintf+0x24e5) [0x5ece5]
> /usr/lib/x86_64-linux-gnu/libc.so.6(_IO_vfprintf+0x4341) [0x60b41]
> /usr/lib/x86_64-linux-gnu/libc.so.6(__printf_chk+0xab) [0x12d63b]
> /usr/local/lib/intel-opencl/libigdrcl.so() [0x9b582]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x51a555) [0x5e14a5]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x4fc8f5) [0x5c3845]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x1be9b) [0xe2deb]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x5105b0) [0x5d7500]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x464df7) [0x52bd47]
> /usr/local/lib/intel-opencl/libigdrcl.so() [0x9b121]
> /usr/local/lib/intel-opencl/libigdrcl.so() [0x9b2ae]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x46504d) [0x52bf9d]
> /usr/local/lib/intel-opencl/libigdrcl.so(clGetExtensionFunctionAddress+0x5a6b) [0xbf7fb]
> /usr/local/lib/intel-opencl/libigdrcl.so(clIcdGetPlatformIDsKHR+0x27) [0xbfe27]
> /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0() [0x7f64]
> /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0(clGetPlatformIDs+0xbb) [0x8f6b]
> /usr/bin/clinfo() [0x97cc]
> /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_init_first+0x90) [0x23a90]
> /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x89) [0x23b49]
> /usr/bin/clinfo() [0xc645]
write(1, "Abort was called at 311 line in "..., 38Abort was called at 311 line in file:
) = 38
Mesa driver works fine with this (last night) Xe KMD git version.
Tried also older (Dec 21st) Xe KMD version recommended for media-driver in https://github.com/intel/media-driver/issues/1761
But compute-runtime
tags 23.48.27912.9
and earlier series 23.43.27642.21
one (using older Xe uAPI I think), still fail at init with it:
$ NEOReadDebugKeys=1 PrintDebugSettings=1 PrintDebugMessages=1 zello_sysman
ZES_ENABLE_SYSMAN environment variable Not Set
Setting the environment variable ZES_ENABLE_SYSMAN
ZES_ENABLE_SYSMAN environment variable Set
Non-default value of debug variable: PrintDebugSettings = 1
Non-default value of debug variable: PrintDebugMessages = 1
IoctlHelperXe::IoctlHelperXe
IoctlHelperXe::initialize
-> IoctlHelperXe::getIoctlRequestValue 0xe
-> IoctlHelperXe::getIoctlRequestValue 0xe
DRM_XE_QUERY_CONFIG_REV_AND_DEVICE_ID 0x19a60
REV_ID 0x1
DEVICE_ID 0x9a60
DRM_XE_QUERY_CONFIG_FLAGS 0
DRM_XE_QUERY_CONFIG_FLAG_HAS_VRAM OFF
DRM_XE_QUERY_CONFIG_MIN_ALIGNMENT 0x1000
DRM_XE_QUERY_CONFIG_VA_BITS 0x30
-> IoctlHelperXe::getIoctlRequestValue 0xe
-> IoctlHelperXe::getIoctlRequestValue 0xe
-> IoctlHelperXe::getIoctlRequestValue 0xe
-> IoctlHelperXe::getIoctlRequestValue 0xe
-> IoctlHelperXe::getIoctlRequestValue 0xe
-> IoctlHelperXe::getIoctlRequestValue 0xe
-> IoctlHelperXe::getDrmParamValue 0x26 QueryHwconfigTable
=> IoctlHelperXe::ioctl 0xe
-> IoctlHelperXe::ioctl Query id=0x26 f=0x0 len=0 r=0
INFO: System Info query failed!
-> IoctlHelperXe::getDrmParamValue 0x1b ParamHasExecSoftpin
=> IoctlHelperXe::ioctl 0x3
-> IoctlHelperXe::ioctl Getparam 0x1b/0x1 r=0
=> IoctlHelperXe::ioctl 0xd
-> IoctlHelperXe::ioctl GemContextSetparam r=0
-> IoctlHelperXe::getIoctlRequestValue 0xe
-> IoctlHelperXe::getIoctlRequestValue 0xe
-> IoctlHelperXe::getIoctlRequestValue 0xe
-> IoctlHelperXe::getIoctlRequestValue 0xe
Abort was called at 311 line in file:
/home/nobody/source/compute-runtime/shared/source/os_interface/linux/xe/ioctl_helper_xe.cpp
Aborted (core dumped)
Latest Mesa tag works with Xe KMD HEAD, and the linked media-driver
bug tells the working combo for media.
So, what Xe KMD version compute-runtime
needs?
As latest "compute-runtime" tag (23.52.28202.14
) included some Xe KMD uAPI support updates (08f7e7be18f17a8977a9c380faa6addee9d8cf83), I built latest of everything, and tried it with latest Xe KMD drm-xe-next
upstreaming tag drm-xe-next-fixes-2024-01-16
.
Although latest Mesa (release) and media-driver (master
) now both work with that Xe KMD tag (without any additional patches), "compute-runtime" still aborts:
# strace -f -k clinfo
...
write(1, "Abort was called at 509 line in "..., 38Abort was called at 509 line in file:
) = 38
> /usr/lib/x86_64-linux-gnu/libc.so.6(__write+0x14) [0x10bf34]
...
> /usr/lib/x86_64-linux-gnu/libc.so.6(__printf_chk+0xab) [0x12d63b]
> /usr/local/lib/intel-opencl/libigdrcl.so() [0x9e552]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x52805f) [0x5f24cf]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x52c71c) [0x5f6b8c]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x50ccce) [0x5d713e]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x1f1c6) [0xe9636]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x520290) [0x5ea700]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x4743c7) [0x53e837]
> /usr/local/lib/intel-opencl/libigdrcl.so() [0x9e0f1]
> /usr/local/lib/intel-opencl/libigdrcl.so() [0x9e27e]
> /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x47461d) [0x53ea8d]
> /usr/local/lib/intel-opencl/libigdrcl.so(clGetExtensionFunctionAddress+0x5a7b) [0xc2d2b]
> /usr/local/lib/intel-opencl/libigdrcl.so(clIcdGetPlatformIDsKHR+0x27) [0xc3357]
> /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0() [0x7f64]
> /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0(clGetPlatformIDs+0xbb) [0x8f6b]
> /usr/bin/clinfo() [0x97cc]
With what Xe KMD version, patches etc compute-runtime is supposed to work with? And which compute-runtime version, patches etc. I should use?
Hi @eero-t Could you try to build NEO as of https://github.com/intel/compute-runtime/commit/278ced35dc2d69323a9e2bd754e648fcdab62520 ?
@JablonskiMateusz That commit seems to be only in master
branch, not yet in any of the tagged versions:
$ git branch --contains 278ced3
* master
Similarly to media-driver
, master
build of compute-runtime
does work with Xe KMD!
Actually, both of the drivers work with both of the KMD versions from f.d.o:
-
drm-tip
(drm integration) repoHEAD
, and -
drm/xe/kernel
(Xe devel) repodrm-xe-next-fixes-2024-01-16
tag
However, while basic CL stuff seems to work, all Sysman metric queries return ZE_RESULT_ERROR_UNINITIALIZED
(according to zello_sysman
), at least on TGL iGPU.
Is there something I need to use to get at least some Sysman metrics to work, or is Xe KMD still lacking all metric support?
PS. I think this ticket should be open until:
- some tagged commit includes all the necessary Xe KMD support commit(s), and
- there's a README stating the Xe KMD commit/tag needed by that support [1]
[1] corresponding media-driver
README: https://github.com/intel/media-driver/blob/master/media_softlet/linux/common/os/xe/include/README.md
However, while basic CL stuff seems to work, all Sysman metric queries return
ZE_RESULT_ERROR_UNINITIALIZED
(according tozello_sysman
), at least on TGL iGPU.Is there something I need to use to get at least some Sysman metrics to work, or is Xe KMD still lacking all metric support?
With ZELLO_SYSMAN_USE_ZESINIT=1 env
var, zello_sysman
reports frequency metrics for TGL iGPU with xe
KMD.
(I.e. Sysman supports xe
KMD only when zesInit()
is used for initializing it instead of zeInit()
.)
However, when querying engine metrics, there's a segfault:
# ZELLO_SYSMAN_USE_ZESINIT=1 strace -f zello_sysman -e
...
write(1, " ---- Engine tests ---- \n", 26 ---- Engine tests ----
) = 26
futex(0x5650fe3c3eb8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
openat(AT_FDCWD, "/sys/class/drm/card0/device/vendor", O_RDONLY) = 3
read(3, "0x8086\n", 8191) = 7
close(3) = 0
openat(AT_FDCWD, "/sys/module/i915/agama_version", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/sys/module/i915/srcversion", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/sys/class/drm/card0/device/subsystem_vendor", O_RDONLY) = 3
read(3, "0x8086\n", 8191) = 7
close(3) = 0
write(1, "Device UUID: 0 0 0 0 0 0 0 0 0 0"..., 46Device UUID: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
) = 46
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} ---
+++ killed by SIGSEGV (core dumped) +++
Those 2 metrics types are only ones compute-runtime
supports for iGPUs, but once that segfault is fixed, I'll try also the other xe
provided Sysman metrics on some dGPU.
@eero-t we are looking into this and update you when fix is ready
However, when querying engine metrics, there's a segfault:
Segfault on engine metrics query is specific to "zello_sysman" (built from same 2024-02-09 master branch sources as driver itself).
There's no crash with my own zesInit() using program with Xe KMD, engine metrics just do not work: https://github.com/intel/compute-runtime/issues/707
Tried latest Xe KMD (6.8.0-rc3) tags:
- https://gitlab.freedesktop.org/drm/xe/kernel/-/tags/drm-xe-next-2024-02-25
- https://gitlab.freedesktop.org/drm/xe/kernel/-/tags/drm-xe-fixes-2024-02-29
Because latest "24.05.28454.10" release is still missing reguired https://github.com/intel/compute-runtime/commit/278ced35dc2d69323a9e2bd754e648fcdab62520 commit, I built again latest compute-runtime master
.
In quick testing, driver build seemed to work OK with "drm-xe-next-2024-02-25" one, except for missing engine metrics regression, that happens also with i915
, and zello_sysman
crash, discussed above.
As to "drm-xe-fixes-2024-02-29" Xe KMD, OpenCL read/write/copy tester hanged both on TGL iGPU and Arc. When stracing the tester, it was either using 100% by constantly sched_yield()ing (TGL), or nanosleeping (Arc). For now, I'm assuming driver is not even supposed to work with that Xe KMD version...
with new release it is fixed, please close
with new release it is fixed, please close
@saik-intel Haven't yet had time to verify latest release functionality. I'll try to do it before end of week.
Closing. On quick testing (zello_sysman
+ cl-mem
), latest release works both with Xe KMD repo "drm-xe-next-2024-02-25" tag, and last night "drm-tip" HEAD kernels.