oneMKL
oneMKL copied to clipboard
Add new batch_gemm types
Description
This adds new data types for the gemm_batch operation, to better be in line with the oneMKL spec. The types added are <half, half, float, float>, <int8, int8, float, float>, and <int8, int8, int32, float>.
New testing is added for these data types. Tests where the scalar type does not match the input type require a higher tolerance as the reference calculation is being performed at a much higher precision.
Test logs: rocblas_test_log.txt cublas_test_log.txt
I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers.
Fixes # (GitHub issue) https://github.com/oneapi-src/oneMKL/issues/446
Checklist
All Submissions
- [x] Do all unit tests pass locally? Attach a log.
- [x]Have you formatted the code using clang-format?
New interfaces
- [x] Have you provided motivation for adding a new feature as part of RFC and it was accepted? # (RFC)
New features
- [x] Have you provided motivation for adding a new feature?
- [x] Have you added relevant tests?
@Rbiessy, cc
@AidanBeltonS Thanks for the PR. Before going through the review in more detail, what is your plan for this issue? Why openBLAS come into picture here? I would prefer to have all applicable backends working before adding these new APIs.
"I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers."
@AidanBeltonS Thanks for the PR. Before going through the review in more detail, what is your plan for this issue? Why openBLAS come into picture here? I would prefer to have all applicable backends working before adding these new APIs.
"I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers."
Hey @mmeterel, I checked with Aidan about the issue with the MKL backends. The duplicate definitions seemed to be an issue with the setup or build commands used. We ran into another issue with undefined references with iamax and iamin functions using buffers using 2024.1 oneAPI base toolkit. Just a few example:
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<double, 1, sycl::_V1::detail::aligned_allocator<double>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
Looking at libmkl_sycl_blas.so.4 in 2024.1 these functions expect an index_base as a last argument but is not there in oneMKL:
$ readelf -Wa /path/to/mkl/latest/lib/libmkl_sycl_blas.so.4 | c++filt -t | grep "row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>"
1302: 0000000002b573e0 9 FUNC GLOBAL DEFAULT 11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<int, 1, sycl::_V1::detail::aligned_allocator<int>, void>&, oneapi::mkl::index_base)
8510: 0000000002b573d0 9 FUNC GLOBAL DEFAULT 11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&, oneapi::mkl::index_base)
We can use 2024.0 for the tests for now. Aidan is running more tests.
@AidanBeltonS Thanks for the PR. Before going through the review in more detail, what is your plan for this issue? Why openBLAS come into picture here? I would prefer to have all applicable backends working before adding these new APIs. "I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers."
Hey @mmeterel, I checked with Aidan about the issue with the MKL backends. The duplicate definitions seemed to be an issue with the setup or build commands used. We ran into another issue with undefined references with
iamaxandiaminfunctions using buffers using 2024.1 oneAPI base toolkit. Just a few example:/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)' /usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<double, 1, sycl::_V1::detail::aligned_allocator<double>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)' /usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)' /usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'Looking at
libmkl_sycl_blas.so.4in 2024.1 these functions expect anindex_baseas a last argument but is not there in oneMKL:$ readelf -Wa /path/to/mkl/latest/lib/libmkl_sycl_blas.so.4 | c++filt -t | grep "row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>" 1302: 0000000002b573e0 9 FUNC GLOBAL DEFAULT 11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<int, 1, sycl::_V1::detail::aligned_allocator<int>, void>&, oneapi::mkl::index_base) 8510: 0000000002b573d0 9 FUNC GLOBAL DEFAULT 11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&, oneapi::mkl::index_base)We can use 2024.0 for the tests for now. Aidan is running more tests.
@Rbiessy @AidanBeltonS AFAIK, there should not be any issues with missing symbols with 2024.1. This version has been in CI for a while now. I would suspect it can be a rebase issue on your branch. We should make it functional with 2024.1.
@andrewtbarker Will you be able to help with this review?
@andrewtbarker Will you be able to help with this review?
Sure, I will take a look.
@Rbiessy @AidanBeltonS AFAIK, there should not be any issues with missing symbols with 2024.1. This version has been in CI for a while now. I would suspect it can be a rebase issue on your branch. We should make it functional with 2024.1.
Yes, this should have been fixed in #445 . If not we should fix it.
Have you tested the PR with hipSYCL/AdaptiveSYCL? Can you please add the logs?
No I have not tested HIPsycl. I have attached the other backend tests below. Netlib and portblas are passing fine. MKL has some failing tests due to tolerating which I am investigating further. It seems it deviates more from the reference implementation in some cases. mkl_test_log.txt netlib_test_log.txt port_blas_test_logs.txt
MKL tests error: mkl_test_log.txt
MKL has some failing tests due to tolerating which I am investigating further.
It looks like dotc and dotu have segfaults in your tests. Initially I think this is unlikely to be due to your PR but have you looked at this at all?
MKL has some failing tests due to tolerating which I am investigating further.
It looks like
dotcanddotuhave segfaults in your tests. Initially I think this is unlikely to be due to your PR but have you looked at this at all?
The failures it Dot are due to error
[ RUN ] DotTestSuite/DotTests.RealDoubleSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100
relative error = 1.83849e-08 absolute error = 1.24863e-07 limit = 3.01315e-13
Difference in result: DPC++ 6.79159 vs. Reference 6.79159
/home/aidanbelton/source/oneMKL/tests/unit_tests/blas/level1/dot.cpp:157: Failure
Expected equality of these values:
res
Which is: 0
1
[ FAILED ] DotTestSuite/DotTests.RealDoubleSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100, where GetParam() = (0x560f5e0, 1-byte object <00>) (1 ms)
DotU is an odd one, it does not appear to be related to my changes however
[ RUN ] DotuTestSuite/DotuTests.ComplexSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100
Caught synchronous SYCL exception during DOTU:
The program was built for 1 devices
Build program log for 'Intel(R) Data Center GPU Max 1100':
-11 (PI_ERROR_BUILD_PROGRAM_FAILURE) -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)
OpenCL status: sycl:7
unknown file: Failure
C++ exception with description "Enqueue process failed. -59 (PI_ERROR_INVALID_OPERATION)" thrown in the test body.
[ FAILED ] DotuTestSuite/DotuTests.ComplexSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100, where GetParam() = (0x560f5e0, 1-byte object <00>) (0 ms)
I have resolved all but one issue with GemmBatch's tests. The CPU MKL implementation has significant amounts of error compared to the GPU. I believe there may be a fundamental difference in the precision of the calculation for the CPU. One possible fix would be to increase the tolerance significantly just for the CPU. Im not a fan of this approach as it is a bit of a brute force solution. Does anyone have any recommendations on how they would like to see this handled?
[ RUN ] GemmBatchUsmTestSuite/GemmBatchUsmTests.RealIntRealScalarPrecision/Column_Major_Intel_R__Xeon_R__Gold_5418Y
relative error = 0.000911658 absolute error = 0.00168478 limit = 0.000333786
Difference in entry (58,119): DPC++ 1.84973 vs. Reference 1.84804
relative error = 0.000812301 absolute error = 0.00121021 limit = 0.000333786
Difference in entry (0,124): DPC++ 1.49107 vs. Reference 1.48986
relative error = 0.000534697 absolute error = 0.000857353 limit = 0.000333786
Difference in entry (17,144): DPC++ 1.60258 vs. Reference 1.60344
relative error = 0.000527185 absolute error = 0.00049144 limit = 0.000333786
Difference in entry (52,186): DPC++ -0.932689 vs. Reference -0.932197
/home/aidanbelton/source/oneMKL/tests/unit_tests/blas/batch/gemm_batch_usm.cpp:408: Failure
Expected equality of these values:
res
Which is: 0
1
[ FAILED ] GemmBatchUsmTestSuite/GemmBatchUsmTests.RealIntRealScalarPrecision/Column_Major_Intel_R__Xeon_R__Gold_5418Y, where GetParam() = (0x56845d0, 1-byte object <01>) (331 ms)
No I have not tested HIPsycl. I have attached the other backend tests below. Netlib and portblas are passing fine. MKL has some failing tests due to tolerating which I am investigating further. It seems it deviates more from the reference implementation in some cases. mkl_test_log.txt netlib_test_log.txt port_blas_test_logs.txt
MKL tests error: mkl_test_log.txt
Can you please test hipSYCL backend as well?
Does anyone have any recommendations on how they would like to see this handled?
If, as we suspect, the CPU backend is doing accumulation in double while the GPU backend does it in float, one option would be changing what reference gemm from tests/unit_tests/blas/include/reference_blas_templates.hpp we call (might need to add a reference gemm in that file).
What is the status here? As I see it we have three outstanding items:
- AdaptiveCpp testing
- Test names (my most recent review, minor)
- Failure in
RealIntRealScalarPrecision
(1) may be a larger issue with CI that in my opinion can be dealt with separately in another PR. (2) is minor and should be easy to fix. I hope (3) is also minor but I'm not sure, is there any progress understanding it?
Hi @andrewtbarker, I have updated the status by email as it was easier to discuss issues with testing AdaptiveCpp on the CI. In short there are a few issues @AidanBeltonS will need to look at once he is back from Holiday next week!
What is the status here? As I see it we have three outstanding items:
1. AdaptiveCpp testing 2. Test names (my most recent review, minor) 3. Failure in `RealIntRealScalarPrecision`(1) may be a larger issue with CI that in my opinion can be dealt with separately in another PR. (2) is minor and should be easy to fix. I hope (3) is also minor but I'm not sure, is there any progress understanding it?
I have addressed items 2. and 3. To resolve 3 I am scaling the tolerance by the possible input range from int8 matricies. i.e. 256 I have yet to test this with AdaptiveCpp, Ill start looking at that shortly
I have disabled the (int8, int8, float, float) dtype combinations on the CuBlas backend due to failing tests on AdaptiveCPP. This is because I do not currently have bandwidth to solve this problem. The root goal of this PR was to get the (half, half, float, float) combination for llama.cpp, and that is working.
I think it would be easier for another person to pick up this work in the future, as the interface, testing, and structure that is needed already exists.
I think it would be easier for another person to pick up this work in the future, as the interface, testing, and structure that is needed already exists.
Can you file an issue documenting this so we can track it?
I think it would be easier for another person to pick up this work in the future, as the interface, testing, and structure that is needed already exists.
Can you file an issue documenting this so we can track it?
Sure, I have created #506
I have also disabled the in8, float combination for MKLCPU/GPU as I found similar precision issues. #506 will be updated to reflect this
I have confirmed the tests pass with AdaptiveCpp on AMD and Nvidia HW.