oneMKL
oneMKL copied to clipboard
`gemm` throws exception on PVC
Summary
I'm trying to use gemm on PVC, but it keeps throwing an exception. Please let me know where I'm going wrong.
I am attempting to use gemm and execute on a 4oam PVC system on ORTCE. I am getting an exception thrown with both production icpx and with the most recent version of intel/llvm, both compiled with production oneMKL.
A minimal reproducer is attached below.
sycl::queue q(sycl::default_selector_v);
T* a_d = sycl::malloc_device<T>(m * k, q);
T* b_d = sycl::malloc_device<T>(k * n, q);
T* c_d = sycl::malloc_device<T>(m * n, q);
std::vector<T> a_l(m*k);
std::vector<T> b_l(k*n);
std::vector<T> c_l(m*n, 0);
for (std::size_t i = 0; i < m*k; i++) {
a_l[i] = drand48();
}
for (std::size_t i = 0; i < k*n; i++) {
b_l[i] = drand48();
}
q.memcpy(a_d, a_l.data(), m*k*sizeof(T)).wait();
q.memcpy(b_d, b_l.data(), k*n*sizeof(T)).wait();
q.memcpy(c_d, c_l.data(), m*n*sizeof(T)).wait();
std::cout << "Running MKL gemm..." << std::endl;
auto event = oneapi::mkl::blas::row_major::gemm(q,
oneapi::mkl::transpose::nontrans,
oneapi::mkl::transpose::nontrans,
m, n, k,
T(1),
a_d, k,
b_d, n,
T(1),
c_d, n);
event.wait();
This throws the following exception:
(base) bbrock@sdp4452:~/src/issues/oneMKL_gemm$ ./gemm
Running MKL gemm...
terminate called after throwing an instance of 'sycl::_V1::exception'
what(): Level-Zero error:700000041879048196
On device: 'Intel(R) Graphics [0x0bd5]'
in kernel: oneapi::mkl::blas::sgemm_itcopy
Aborted (core dumped)
As far as I can tell, I am allocating enough memory, and all of the pointers I'm passing in are USM device pointers, which should be accessible on the device associated with the queue passed to oneMKL.
Version
I am using production oneMKL 2023.1.0.
Environment
I am running this on a machine with four PVC GPUs.
(base) bbrock@sdp125071:~/src/distributed-ranges/examples/shp$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.15.3.0.20_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel (R) Xeon (R) CPU Max 9480 OpenCL 3.0 (Build 0) [2023.15.3.0.20_160000]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24595]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24595]
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24595]
[ext_oneapi_level_zero:gpu:3] Intel(R) Level-Zero, Intel(R) Graphics [0x0bd5] 1.3 [1.3.24595]
I am using production oneMKL 2023.1.0.
I am getting this error with both the most recent commit of intel/llvm and with production icpx.
(base) bbrock@sdp125071:~/src/distributed-ranges/examples/shp$ icpx --version
Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm
Configuration file: /opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm/../bin/icpx.cfg
Steps to reproduce
(base) bbrock@sdp125071:~/src/issues/oneMKL_gemm$ ./gemm
MESA: warning: Driver does not support the 0xbd5 PCI ID.
MESA: warning: Driver does not support the 0xbd5 PCI ID.
MESA: warning: Driver does not support the 0xbd5 PCI ID.
MESA: warning: Driver does not support the 0xbd5 PCI ID.
Running MKL gemm...
terminate called after throwing an instance of 'sycl::_V1::exception'
what(): Level-Zero error:700000041879048196
On device: 'Intel(R) Graphics [0x0bd5]'
in kernel: oneapi::mkl::blas::sgemm_itcopy
Aborted (core dumped)
Observed behavior
Throws an exception as above.
Expected behavior
I expect the kernel to execute successfully.
The gemm_usm example included with production oneMKL also throws the same error on PVC.
(base) bbrock@sdp125071:~/src/issues/oneMKL_gemm$ ./gemm_usm
########################################################################
# General Matrix-Matrix Multiplication using Unified Shared Memory Example:
#
# C = alpha * A * B + beta * C
#
# where A, B and C are general dense matrices and alpha, beta are
# floating point type precision scalars.
#
# Using apis:
# gemm
#
# Supported floating point type precisions:
# float
# double
#
########################################################################
Running tests on GPU.
Running with single precision real data type:
Caught synchronous SYCL exception during GEMM:
Level-Zero error:700000041879048196
On device: 'Intel(R) Graphics [0x0bd5]'
in kernel: oneapi::mkl::blas::sgemm_incopy
OpenCL status: 1
GEMM parameters:
transA = trans, transB = nontrans
m = 45, n = 98, k = 67
lda = 103, ldB = 105, ldC = 106
alpha = 2, beta = 3
Outputting 2x2 block of A,B,C matrices:
A = [ 0.340188, 0.260249, ...
[ -0.105617, 0.0125354, ...
[ ...
B = [ -0.326421, -0.192968, ...
[ 0.363891, 0.251295, ...
[ ...
C = [ 0.400017, 0.310497, ...
[ 0.00257462, -0.0560381, ...
[ ...
# Identical errors are thrown for double and complex as well
I've added the example to my minimal reproducer tarball here: oneMKL_gemm_example.tar.gz
This is actually running fine on Borealis, so I think this might be a configuration issue with ORTCE. I will get in touch with the people who run the cluster.
@BenBrock Thanks for the logs and update on Borealis. The error you see typically occurs when oneMKL can not detect the GPU architecture (PVC) and uses an alternative code path - which is not functional on PVC. So, that explains why you see the issue on specific machine. As you mentioned, this is probably a configuration issue on ORTCE. Please let us know what you find.
@mmeterel Could you elaborate on what kind of misconfiguration causes this? I'm working on making oneAPI.jl support PVC hardware, however we're seeing a similar issue:
terminate called after throwing an instance of 'sycl::_V1::exception'
what(): Level-Zero error:700000041879048196
On device: 'Intel(R) Data Center GPU Max 1550'
in kernel: oneapi::mkl::blas::sgemm_itcopy
From worker 16:
[85716] signal (6.-6): Aborted
in expression starting at none:1
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
__verbose_terminate_handler at /workspace/srcdir/gcc-13.2.0/libstdc++-v3/libsupc++/vterminate.cc:95
__terminate at /workspace/srcdir/gcc-13.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
terminate at /workspace/srcdir/gcc-13.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
__cxa_throw at /workspace/srcdir/gcc-13.2.0/libstdc++-v3/libsupc++/eh_throw.cc:98
_ZN6oneapi3mkl3gpu13build_programEPiPN4sycl3_V15queueEPvS7_iPKcS9_mcS9_Pb at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpuL22mkl_gpu_get_kernel_extEPiPN4sycl3_V15queueEiPKcS8_mcS8_S8_S8_mPKvmbb at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpu24mkl_gpu_get_spirv_kernelEPiPN4sycl3_V15queueEiPK22mkl_gpu_spirv_kernel_tPKcSB_ at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpu40mkl_blas_gpu_sgemm_copybased_driver_syclEPiPN4sycl3_V15queueEPNS1_14blas_arg_usm_tEP20mkl_gpu_event_list_t at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpu30mkl_blas_gpu_sgemm_driver_syclEPiPN4sycl3_V15queueEPNS1_14blas_arg_usm_tEP20mkl_gpu_event_list_t at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpu19sgemm_sycl_internalEPN4sycl3_V15queueE10MKL_LAYOUT13MKL_TRANSPOSES7_lllNS0_16value_or_pointerIfEEPKflSB_lS9_PflNS0_4blas12compute_modeERKSt6vectorINS3_5eventESaISG_EElll at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl3gpu10sgemm_syclEPN4sycl3_V15queueE10MKL_LAYOUT13MKL_TRANSPOSES7_lllNS0_16value_or_pointerIfEEPKflSB_lS9_PflNS0_4blas12compute_modeERKSt6vectorINS3_5eventESaISG_EElll at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl4blas5sgemmERN4sycl3_V15queueE10MKL_LAYOUTNS0_9transposeES7_lllNS0_16value_or_pointerIfEEPKflSB_lS9_PflNS1_12compute_modeERKSt6vectorINS3_5eventESaISF_EE at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
_ZN6oneapi3mkl4blas12column_major4gemmERN4sycl3_V15queueENS0_9transposeES7_lllNS0_16value_or_pointerIfEEPKflSB_lS9_PflNS1_12compute_modeERKSt6vectorINS4_5eventESaISF_EE at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libmkl_sycl_blas.so.4 (unknown line)
onemklSgemm at /home/sdp/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/deps/lib/liboneapi_support.so (unknown line)
As you can see, this code is being called from a oneMKL wrapper library (liboneapi_support) to work around the lack of C API in oneMKL. We build and distribute this library ourselves, along with the required MKL and SYCL libraries, downloaded from Conda:
- https://github.com/JuliaGPU/oneAPI.jl/blob/2df5d3f06e97d33fa3c1bcc2398afec7075d0a45/deps/CMakeLists.txt
- https://github.com/JuliaPackaging/Yggdrasil/blob/77c11e9e797db54e68a8cfd83eb9b0d38830e80f/O/oneAPI_Support/build_tarballs.jl#L109-L140
We're probably doing something wrong here, because the MWE provided above works fine when using the system MKL (from oneAPI 2024.0, same as what we use for building liboneapi_support). I'm doing this on IDC, using a Max 1550.
@maleadt It is hard to tell what is going wrong from the logs you sent. Can you please clarify your last paragraph? In your working configuration, are you using DPCPP compiler and oneMKL bits from the same 2024.0 base tool kit release? If yes, what is different in your non-working version? (Compiler? oneMKL?)
Also, what is the driver version you are using? (You can share the results of sycl-ls)
Hi @maleadt - thanks for your work on oneAPI.jl! Intel oneMKL product currently requires the OpenCL GPU runtime even when the Level-Zero backend is used. Could you please install it and see if that resolves the issue?
In your working configuration, are you using DPCPP compiler and oneMKL bits from the same 2024.0 base tool kit release?
I'm using the tools and libraries that are provisioned by the image on IDC, which according to the website seems to be: Ubuntu 22.04 LTS (Jammy Jellyfish) v20240129, oneAPI base kit 2024.0.1, oneAPI HPC kit 2024.0.1 and oneAPI render kit 2024.0.0
If yes, what is different in your non-working version? (Compiler? oneMKL?)
I'm using 2024.0.0 from Conda for my wrapper library. That library however isn't built on-device, it's built on a buildbot, and redistributed together with the necessary MKL/SYCL/OpenCL dependencies.
Intel oneMKL product currently requires the OpenCL GPU runtime even when the Level-Zero backend is used.
We already redistribute the things that our MKL wrapper library depends on, including libopencl, see https://github.com/JuliaPackaging/Yggdrasil/blob/77c11e9e797db54e68a8cfd83eb9b0d38830e80f/O/oneAPI_Support/build_tarballs.jl#L116-L119. This has been working perfectly on other architectures, except PVC. We aim for the redistributable wrapper library to be fully stand-alone, so that users don't have to install anything to get oneAPI.jl to work.
Adding @mkrainiuk to this discussion as she is more familiar with the distribution of oneMKL (interfaces)
@maleadt : Can you run the program with LD_DEBUG=libs with both the failing and working versions, as the problem is likely due to different OpenCL library is invoked at runtime?
Also add @kballeda to the thread.
Here you are: https://gist.github.com/maleadt/55d9069b5c63e381858dbe64d9f690d3. At first sight, everything looks OK there, and all oneMKL-related resources are loaded from the artifacts directory (i.e. there's no pollution by system libraries).
There is 'calling init' on the C++ side of the following library that doesn't exist on the Julia side: 128030: calling init: /lib/x86_64-linux-gnu/libze_intel_gpu.so.1. Could it be the problem?
libze_intel_gpu is there on the Julia side too, but it's loaded earlier (when oneAPI.jl loads):
calling init: /home/sdp/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2/lib/libze_intel_gpu.so.1
Turns out the issue was with my libOpenCL.so, which I took from intel-opencl-rt on Conda, somehow did not support or detect my PVC hardware, even resulting in clinfo returning in 0 platforms. After switching to Khronos' ICD loader, MKL works fine.
That said, this error as reported before is inscrutable and should be improved to something actionable.
@maleadt Thanks for the update and glad to see you found the problem. I should have thought about suggesting clinfo check! Sorry about that.
IMHO, when the right openCL library is not used from user side, oneMKL-GEMM could still give correct functionality but issue a warning about low performance. Does it sound reasonable?
oneMKL-GEMM could still give correct functionality but issue a warning about low performance. Does it sound reasonable?
Yes, that sounds great. Even a fatal error would be a good option, as long as it comes with an error message that would help diagnose the issue (No OpenCL device detected, or whatever).
I would vote for correct functionality + warning. :) Is it ok to close this issue?