cuBLAS tests failed after aligning with SYCL 2020 specification
Summary
After cuBLAS backend was aligned with SYCL 2020 specification, a large number of cuBLAS tests failed.
Environment
- OS: Linux
- Hardware: NVIDIA TITAN RTX
- Backend library version: CUDA 10.2
- Compiler version: sycl-nightly 20220210
Steps to reproduce
Clone the branch: https://github.com/dnhsieh-intel/oneMKL/tree/SYCL_2020_cuBLAS
$ mkdir build && cd build
$ cmake .. -DENABLE_CUBLAS_BACKEND=True -DENABLE_MKLCPU_BACKEND=False -DENABLE_MKLGPU_BACKEND=False -DREF_BLAS_ROOT=<reference_blas_install_prefix> -DTARGET_DOMAINS=blas
$ cmake --build .
$ ctest --output-on-failure
Observed behavior
Some cuBLAS tests using dynamic libraries reported segmentation faults in column major cases. Output of cuBLAS unit tests: cublas_unit_tests.txt (81% tests passed, 321 tests failed out of 1668)
Examples of failed tests:
Start 1: BLAS/RT/Nrm2TestSuite/Nrm2Tests.RealSinglePrecision/Column_Major_TITAN_RTX
1/1668 Test #1: BLAS/RT/Nrm2TestSuite/Nrm2Tests.RealSinglePrecision/Column_Major_TITAN_RTX ..................................***Exception: SegFault 1.42 sec
Start 155: BLAS/RT/CopyUsmTestSuite/CopyUsmTests.RealSinglePrecision/Column_Major_TITAN_RTX
155/1668 Test #155: BLAS/RT/CopyUsmTestSuite/CopyUsmTests.RealSinglePrecision/Column_Major_TITAN_RTX ............................***Exception: SegFault 0.58 sec
Run this program with --terse_output to change the way it prints its output.
Note: Google Test filter = CopyUsmTestSuite/CopyUsmTests.RealSinglePrecision/Column_Major_TITAN_RTX
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CopyUsmTestSuite/CopyUsmTests
[ RUN ] CopyUsmTestSuite/CopyUsmTests.RealSinglePrecision/Column_Major_TITAN_RTX
relative error = 1.59858 absolute error = 0.543816 limit = 0.000161767
Difference in entry 0: DPC++ -0.203628 vs. Reference 0.340188
relative error = 2.13229 absolute error = 0.225206 limit = 0.000161767
Difference in entry 1: DPC++ -0.330823 vs. Reference -0.105617
relative error = 2.63611 absolute error = 0.746282 limit = 0.000161767
...
Expected behavior
cuBLAS tests passed.
@npmiller Would it be possible for you to take a look at this issue when you have a chance? Based on your experience of PR #136, Mesut (@mmeterel) suggests that it would be more efficient to ask you for directions.
Hello @dnhsieh-intel, so I've had a look at this and I could reproduce the issues, I'm not getting segfaults but I am getting the incorrect results.
I don't think it's related to #136 but there does seem to be something strange here, maybe with the USM transfers or maybe a race condition in DPC++, I'm not too sure yet.
I did notice that the tests are waiting on the SYCL events, waiting on the SYCL queue instead seems to fix the issue, which also means that the _ct version of the tests works fine, but the _rt version is flaky indeed.
It seems to be very similar to what @densamoilov saw in oneDNN here:
- https://github.com/oneapi-src/oneDNN/pull/1285#discussion_r816507853
But a clean build doesn't fix it in this scenario, so I suspect there might actually be an underlying issue.
Okay so I have figured out what is going on, with the way interop_task and host_task are currently defined they are required to be synchronous.
Meaning that starting asynchronous work in the lambda and not waiting for it is undefined behavior.
If you look at the proposal specification for interop_task, it says:
It is the user's responsibility to ensure that all operations performed inside the interop_task are finished before returning from it.
And this is also the case for the host_task, these two are meant to have the exact same semantics.
However these two are implemented a bit differently in DPC++, and it looks like currently interop_task actually ends up adding a wait on the CUDA stream. So this is what we're seeing there, under the hood waiting on the event of the interop_task will wait on the CUDA stream and the cublas operation will finish, however for the host_task it won't and that's why we're seeing incorrect results, because the cublas operation has not completed yet.
So it seems that this is an incorrect assumption on how interop_task are supposed to work in both oneMKL and DPC++, even though it is true that the current "incorrect" implementation is likely more efficient than the way it is supposed to work.
It looks to me like the way oneMKL is using these is more in line with hipSYCL's custom operations. But these are pretty different, the lambda of host_task and interop_task are essentially "host kernels", they execute at the exact same time a kernel would and once the lambda has finished executing all the work is meant to be over, so waiting on a host_task event signals when the lambda has finished executing, as opposed to custom operations where the lambda is executed when enqueuing the command and can only contain asynchronous work and waiting on a custom operation event signals when the asynchronous work has finished executing.
@npmiller Thank you so much for the investigation! @andrewtbarker Could you take a look please?
I've prepared a pull request into @dnhsieh-intel 's branch as a starting point for discussion.
In my view there are two issues here: (1) functional failures (wrong answers) and (2) segfaults. For now I am only focused on (1).
My solution in the linked PR is to try to recreate the previous behavior of interop_task within our cublas wrappers using host_task. Whether or not this is the correct or optimal behavior based on the spec, I think this should be similar to the what we have in this project before transitioning away from interop_task. @npmiller do you agree? Is there a better way to accomplish this?
Meaning that starting asynchronous work in the lambda and not waiting for it is undefined behavior. And this is also the case for the host_task, these two are meant to have the exact same semantics.
There is nothing preventing starting some work from such tasks in SYCL. It is just normal C++ and obviously it requires some planning about what is launched. For example starting a std::thread and never joining it later would be UB.
In my view there are two issues here: (1) functional failures (wrong answers) and (2) segfaults. For now I am only focused on (1).
@andrewtbarker do you still see segfault with your PR? I suspect it might fix both issues.
I've checked out your PR, and I do think that should fix the issue, I would be curious to know if there are performance implication for using host_task plus this patch as opposed to interop_task like before.
There is nothing preventing starting some work from such tasks in SYCL. It is just normal C++ and obviously it requires some planning about what is launched. For example starting a std::thread and never joining it later would be UB.
@keryell you're right it is absolutely possible. It's just a lot more tricky to handle than the interop_task because in the host_task case the SYCL runtime has no idea about the asynchronous work, so as far as the SYCL runtime is concerned once the lambda of the host_task has finished all tasks depending on it can start.
I do continue to see segfaults in some cases on my branch - they happen at the very end of the program after tests have reported "passed". I haven't yet investigated further.
I have traced the segfaults to the destruction of the cublas handle on line 74 of cublas_scope_handle.cpp:
CUBLAS_ERROR_FUNC(cublasDestroy, err1, handle);
It fails with the latest compiler I have access to. This doesn't get called twice (so if it's a double delete it happens somewhere else) and gdb does not think the pointer is null.
I see, that's interesting, it could have to do with the context stuff in https://github.com/oneapi-src/oneMKL/pull/136 still.
However I'm still unable to reproduce it locally with the latest DPC++ compiler so I'm not too sure what's going on, what version of the compiler are you using? There was also some issues regarding host tasks and context in the SYCL runtime, but that was quite a while ago.
Here's the version string for the particular clang++ that cmake is picking up. I've tried a couple other versions with the same result, this is the most recent:
clang version 15.0.0 (https://github.com/intel/llvm cae91b2b13af3915f4b6f990216146adde973903)
Can you share the script or command line you use to invoke cmake?
Sure, I'm doing the following:
export PATH=/path/to/llvm/build/bin:$PATH
export LD_LIBRARY_PATH=/path/to/llvm/build/lib
cmake -GNinja .. -DENABLE_CUBLAS_BACKEND=True -DENABLE_MKLCPU_BACKEND=False -DENABLE_MKLGPU_BACKEND=False -DREF_BLAS_ROOT=/usr -DTARGET_DOMAINS=blas
ninja
ctest
And everything passes:
100% tests passed, 0 tests failed out of 1668
I've rebuilt with the revision you provided (cae91b2b13af3915f4b6f990216146adde973903), and I'm still not getting any segfault, not sure if there's something wrong in my setup or if I'm just being lucky.
Are the segfault flaky? Do they happen every time? Even when running the tests individually?
The segfaults happen every time, even when running tests individually. I spent a little time trying to make a small reproducer but didn't make much progress. As far as I can tell, every cublas RT test fails - the ones reported passing in CI are actually just skipped, nothing runs.
@npmiller Which CUDA version are you using? 11.x? We have 10.2 installed. Do you think that would make a difference?
I've prepared a pull request into @dnhsieh-intel 's branch as a starting point for discussion.
In my view there are two issues here: (1) functional failures (wrong answers) and (2) segfaults. For now I am only focused on (1).
My solution in the linked PR is to try to recreate the previous behavior of
interop_taskwithin our cublas wrappers usinghost_task. Whether or not this is the correct or optimal behavior based on the spec, I think this should be similar to the what we have in this project before transitioning away frominterop_task. @npmiller do you agree? Is there a better way to accomplish this?
@andrewtbarker Recently, @aelizaro merged this https://github.com/oneapi-src/oneMKL/pull/169 that brought oneMKL-cuBLAS backend performance on par with native cuBLAS. Can we check if this commit has any impact on performance? @aelizaro might have benchmarks we can re-use.
I ran a few basic performance tests, just using the reproducer from #106, and I can't distinguish between develop branch and my branch in performance:
| m=n=k | my partial fix | develop |
|---|---|---|
| 1024 | 138.639 gflops | 138.637 gflops |
| 4096 | 1600.91 | 1600.84 |
I ran a few basic performance tests, just using the reproducer from #106, and I can't distinguish between develop branch and my branch in performance:
m=n=k my partial fix develop 1024 138.639 gflops 138.637 gflops 4096 1600.91 1600.84
@andrewtbarker Thanks for confirming.
@npmiller Which CUDA version are you using? 11.x? We have 10.2 installed. Do you think that would make a difference?
I am running with 11.6, I've tried to switch to 10.2 but I'm still not seeing any segfaults there, so I'm not sure it's related.
I have a smaller reproducer, unfortunately it's not very informative to me. Here's the source:
#include <iostream>
#include "oneapi/mkl.hpp"
int main(int argc, char * argv[]) {
std::cout << "Basic segfault reproducer, issue #175." << std::endl;
auto dev = sycl::device(sycl::gpu_selector());
sycl::queue queue(dev);
sycl::context cxt = queue.get_context();
const int N = 10;
const int incx = 1;
float * result_p = (float*) sycl::aligned_alloc_shared(64, sizeof(float), dev, cxt);
auto ua = sycl::usm_allocator<float, sycl::usm::alloc::shared, 64>(cxt, dev);
std::vector<float, decltype(ua)> x(ua);
for (int i = 0; i < N; ++i) {
x.push_back(i * 1.72);
}
std::cout << " before actual call\n";
auto done = oneapi::mkl::blas::column_major::nrm2(queue, N, x.data(), incx, result_p);
std::cout << " after actual call\n";
sycl::free(result_p, cxt);
std::cout << "done.\n";
return 0;
}
One compilation and two separate link lines:
clang++ -g -Wno-deprecated-declarations -I${ONEMKL}/include -isystem ${LLVM_DIR}/compiler/include/sycl -isystem /usr/local/cuda/include -fPIC -fsycl -o app.o -c app.cpp
clang++ -g -fPIC -fsycl -L${ONEMKL}/lib -lonemkl app.o -o app1.exe
clang++ -g -fPIC -fsycl /usr/lib/x86_64-linux-gnu/libcublas.so -L${ONEMKL}/lib -lonemkl app.o -o app2.exe
For me, app1.exe crashes with a segfault at the very end of execution, while app2.exe runs to completion with no errors. This suggests that on our system here we are somehow picking up a wrong standard library and some symbol is resolving incorrectly. It probably has more to do with a local configuration than with the actual code or build system.
That's pretty strange, I still tried the reproducer and your instructions but I'm still not getting any segfault. You might be right that it's some sort of local configuration issue
Is there any "stack" trace with cuda-gdb or whatever that works with cuBLAS?
Close this issue as it cannot be reproduced now. It appears to be related to machine environment and/or compiler packages.