trafficstars

Description

This adds new data types for the gemm_batch operation, to better be in line with the oneMKL spec. The types added are <half, half, float, float>, <int8, int8, float, float>, and <int8, int8, int32, float>.

New testing is added for these data types. Tests where the scalar type does not match the input type require a higher tolerance as the reference calculation is being performed at a much higher precision.

Test logs: rocblas_test_log.txt cublas_test_log.txt

I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers.

Fixes # (GitHub issue) https://github.com/oneapi-src/oneMKL/issues/446

Checklist

All Submissions

[x] Do all unit tests pass locally? Attach a log.
[x]Have you formatted the code using clang-format?

New interfaces

[x] Have you provided motivation for adding a new feature as part of RFC and it was accepted? # (RFC)

New features

[x] Have you provided motivation for adding a new feature?
[x] Have you added relevant tests?

Apr 03 '24 14:04 AidanBeltonS

@Rbiessy, cc

Apr 03 '24 14:04 AidanBeltonS

@AidanBeltonS Thanks for the PR. Before going through the review in more detail, what is your plan for this issue? Why openBLAS come into picture here? I would prefer to have all applicable backends working before adding these new APIs.

"I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers."

Apr 03 '24 16:04 mmeterel

@AidanBeltonS Thanks for the PR. Before going through the review in more detail, what is your plan for this issue? Why openBLAS come into picture here? I would prefer to have all applicable backends working before adding these new APIs.

"I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers."

Hey @mmeterel, I checked with Aidan about the issue with the MKL backends. The duplicate definitions seemed to be an issue with the setup or build commands used. We ran into another issue with undefined references with iamax and iamin functions using buffers using 2024.1 oneAPI base toolkit. Just a few example:

/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<double, 1, sycl::_V1::detail::aligned_allocator<double>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'

Looking at libmkl_sycl_blas.so.4 in 2024.1 these functions expect an index_base as a last argument but is not there in oneMKL:

$ readelf -Wa /path/to/mkl/latest/lib/libmkl_sycl_blas.so.4 | c++filt -t | grep "row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>"
  1302: 0000000002b573e0     9 FUNC    GLOBAL DEFAULT   11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<int, 1, sycl::_V1::detail::aligned_allocator<int>, void>&, oneapi::mkl::index_base)
  8510: 0000000002b573d0     9 FUNC    GLOBAL DEFAULT   11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&, oneapi::mkl::index_base)

We can use 2024.0 for the tests for now. Aidan is running more tests.

Apr 04 '24 12:04 Rbiessy

@AidanBeltonS Thanks for the PR. Before going through the review in more detail, what is your plan for this issue? Why openBLAS come into picture here? I would prefer to have all applicable backends working before adding these new APIs. "I have been unable to test the mkl backends as I was running into some problems regarding duplicate definitions between the mkl headers and the openBlas/CBlas headers."

Hey @mmeterel, I checked with Aidan about the issue with the MKL backends. The duplicate definitions seemed to be an issue with the setup or build commands used. We ran into another issue with undefined references with iamax and iamin functions using buffers using 2024.1 oneAPI base toolkit. Just a few example:
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<double, 1, sycl::_V1::detail::aligned_allocator<double>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamin(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
/usr/bin/ld: lib/libonemkl_blas_mklcpu.so.0: undefined reference to `oneapi::mkl::blas::column_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<float, 1, sycl::_V1::detail::aligned_allocator<float>, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&)'
Looking at libmkl_sycl_blas.so.4 in 2024.1 these functions expect an index_base as a last argument but is not there in oneMKL:
$ readelf -Wa /path/to/mkl/latest/lib/libmkl_sycl_blas.so.4 | c++filt -t | grep "row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>"
  1302: 0000000002b573e0     9 FUNC    GLOBAL DEFAULT   11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<int, 1, sycl::_V1::detail::aligned_allocator<int>, void>&, oneapi::mkl::index_base)
  8510: 0000000002b573d0     9 FUNC    GLOBAL DEFAULT   11 oneapi::mkl::blas::row_major::iamax(sycl::_V1::queue&, long, sycl::_V1::buffer<std::complex<double>, 1, sycl::_V1::detail::aligned_allocator<std::complex<double> >, void>&, long, sycl::_V1::buffer<long, 1, sycl::_V1::detail::aligned_allocator<long>, void>&, oneapi::mkl::index_base)
We can use 2024.0 for the tests for now. Aidan is running more tests.

@Rbiessy @AidanBeltonS AFAIK, there should not be any issues with missing symbols with 2024.1. This version has been in CI for a while now. I would suspect it can be a rebase issue on your branch. We should make it functional with 2024.1.

Apr 04 '24 15:04 mmeterel

@andrewtbarker Will you be able to help with this review?

Apr 04 '24 16:04 mmeterel

@andrewtbarker Will you be able to help with this review?

Sure, I will take a look.

Apr 04 '24 16:04 andrewtbarker

@Rbiessy @AidanBeltonS AFAIK, there should not be any issues with missing symbols with 2024.1. This version has been in CI for a while now. I would suspect it can be a rebase issue on your branch. We should make it functional with 2024.1.

Yes, this should have been fixed in #445 . If not we should fix it.

Apr 04 '24 22:04 andrewtbarker

Have you tested the PR with hipSYCL/AdaptiveSYCL? Can you please add the logs?

Apr 04 '24 22:04 mmeterel

No I have not tested HIPsycl. I have attached the other backend tests below. Netlib and portblas are passing fine. MKL has some failing tests due to tolerating which I am investigating further. It seems it deviates more from the reference implementation in some cases. mkl_test_log.txt netlib_test_log.txt port_blas_test_logs.txt

MKL tests error: mkl_test_log.txt

Apr 08 '24 15:04 AidanBeltonS

MKL has some failing tests due to tolerating which I am investigating further.

It looks like dotc and dotu have segfaults in your tests. Initially I think this is unlikely to be due to your PR but have you looked at this at all?

Apr 10 '24 17:04 andrewtbarker

MKL has some failing tests due to tolerating which I am investigating further.

It looks like dotc and dotu have segfaults in your tests. Initially I think this is unlikely to be due to your PR but have you looked at this at all?

The failures it Dot are due to error

[ RUN      ] DotTestSuite/DotTests.RealDoubleSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100
relative error = 1.83849e-08 absolute error = 1.24863e-07 limit = 3.01315e-13
Difference in result: DPC++ 6.79159 vs. Reference 6.79159
/home/aidanbelton/source/oneMKL/tests/unit_tests/blas/level1/dot.cpp:157: Failure
Expected equality of these values:
  res
    Which is: 0
  1
[  FAILED  ] DotTestSuite/DotTests.RealDoubleSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100, where GetParam() = (0x560f5e0, 1-byte object <00>) (1 ms)

DotU is an odd one, it does not appear to be related to my changes however

[ RUN      ] DotuTestSuite/DotuTests.ComplexSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100
Caught synchronous SYCL exception during DOTU:
The program was built for 1 devices
Build program log for 'Intel(R) Data Center GPU Max 1100':
 -11 (PI_ERROR_BUILD_PROGRAM_FAILURE) -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)
OpenCL status: sycl:7
unknown file: Failure
C++ exception with description "Enqueue process failed. -59 (PI_ERROR_INVALID_OPERATION)" thrown in the test body.
[  FAILED  ] DotuTestSuite/DotuTests.ComplexSinglePrecision/Row_Major_Intel_R__Data_Center_GPU_Max_1100, where GetParam() = (0x560f5e0, 1-byte object <00>) (0 ms)

Apr 12 '24 13:04 AidanBeltonS

I have resolved all but one issue with GemmBatch's tests. The CPU MKL implementation has significant amounts of error compared to the GPU. I believe there may be a fundamental difference in the precision of the calculation for the CPU. One possible fix would be to increase the tolerance significantly just for the CPU. Im not a fan of this approach as it is a bit of a brute force solution. Does anyone have any recommendations on how they would like to see this handled?

[ RUN      ] GemmBatchUsmTestSuite/GemmBatchUsmTests.RealIntRealScalarPrecision/Column_Major_Intel_R__Xeon_R__Gold_5418Y
relative error = 0.000911658 absolute error = 0.00168478 limit = 0.000333786
Difference in entry (58,119): DPC++ 1.84973 vs. Reference 1.84804
relative error = 0.000812301 absolute error = 0.00121021 limit = 0.000333786
Difference in entry (0,124): DPC++ 1.49107 vs. Reference 1.48986
relative error = 0.000534697 absolute error = 0.000857353 limit = 0.000333786
Difference in entry (17,144): DPC++ 1.60258 vs. Reference 1.60344
relative error = 0.000527185 absolute error = 0.00049144 limit = 0.000333786
Difference in entry (52,186): DPC++ -0.932689 vs. Reference -0.932197
/home/aidanbelton/source/oneMKL/tests/unit_tests/blas/batch/gemm_batch_usm.cpp:408: Failure
Expected equality of these values:
  res
    Which is: 0
  1
[  FAILED  ] GemmBatchUsmTestSuite/GemmBatchUsmTests.RealIntRealScalarPrecision/Column_Major_Intel_R__Xeon_R__Gold_5418Y, where GetParam() = (0x56845d0, 1-byte object <01>) (331 ms)

Apr 12 '24 13:04 AidanBeltonS

No I have not tested HIPsycl. I have attached the other backend tests below. Netlib and portblas are passing fine. MKL has some failing tests due to tolerating which I am investigating further. It seems it deviates more from the reference implementation in some cases. mkl_test_log.txt netlib_test_log.txt port_blas_test_logs.txt

MKL tests error: mkl_test_log.txt

Can you please test hipSYCL backend as well?

Apr 12 '24 16:04 mmeterel

Does anyone have any recommendations on how they would like to see this handled?

If, as we suspect, the CPU backend is doing accumulation in double while the GPU backend does it in float, one option would be changing what reference gemm from tests/unit_tests/blas/include/reference_blas_templates.hpp we call (might need to add a reference gemm in that file).

Apr 12 '24 16:04 andrewtbarker

What is the status here? As I see it we have three outstanding items:

AdaptiveCpp testing
Test names (my most recent review, minor)
Failure in RealIntRealScalarPrecision

(1) may be a larger issue with CI that in my opinion can be dealt with separately in another PR. (2) is minor and should be easy to fix. I hope (3) is also minor but I'm not sure, is there any progress understanding it?

May 01 '24 23:05 andrewtbarker

Hi @andrewtbarker, I have updated the status by email as it was easier to discuss issues with testing AdaptiveCpp on the CI. In short there are a few issues @AidanBeltonS will need to look at once he is back from Holiday next week!

May 02 '24 10:05 Rbiessy

What is the status here? As I see it we have three outstanding items:
1. AdaptiveCpp testing

2. Test names (my most recent review, minor)

3. Failure in `RealIntRealScalarPrecision`
(1) may be a larger issue with CI that in my opinion can be dealt with separately in another PR. (2) is minor and should be easy to fix. I hope (3) is also minor but I'm not sure, is there any progress understanding it?

I have addressed items 2. and 3. To resolve 3 I am scaling the tolerance by the possible input range from int8 matricies. i.e. 256 I have yet to test this with AdaptiveCpp, Ill start looking at that shortly

May 13 '24 15:05 AidanBeltonS

I have disabled the (int8, int8, float, float) dtype combinations on the CuBlas backend due to failing tests on AdaptiveCPP. This is because I do not currently have bandwidth to solve this problem. The root goal of this PR was to get the (half, half, float, float) combination for llama.cpp, and that is working.

I think it would be easier for another person to pick up this work in the future, as the interface, testing, and structure that is needed already exists.

May 31 '24 09:05 AidanBeltonS

I think it would be easier for another person to pick up this work in the future, as the interface, testing, and structure that is needed already exists.

Can you file an issue documenting this so we can track it?

May 31 '24 15:05 andrewtbarker

I think it would be easier for another person to pick up this work in the future, as the interface, testing, and structure that is needed already exists.

Can you file an issue documenting this so we can track it?

Sure, I have created #506

May 31 '24 15:05 AidanBeltonS

I have also disabled the in8, float combination for MKLCPU/GPU as I found similar precision issues. #506 will be updated to reflect this

Jun 05 '24 11:06 AidanBeltonS

I have confirmed the tests pass with AdaptiveCpp on AMD and Nvidia HW.

Jun 24 '24 08:06 Rbiessy

oneMKL
oneMKL copied to clipboard

Add new batch_gemm types

Description

Checklist

All Submissions

New interfaces

New features

oneMKL oneMKL copied to clipboard

Add new batch_gemm types

Description

Checklist

All Submissions

New interfaces

New features

oneMKL
oneMKL copied to clipboard