Full-BLAS support
This is an issue to track full blas support. We will update the issue when new stuff gets added. I list the single precision blas call here, Kokkos Kernels is scalar type agnostic.
BLAS 1
| BLAS Call | Kokkos Kernels Call | Reference | TPL BLAS | TPL CUBLAS | TPL ROCBLAS | TPL oneMKL | Complex Special |
|---|---|---|---|---|---|---|---|
| SROTG | rotg(a, b, c, s) | done | done | done | done | -- | -- |
| SROTMG | rotmg(d1, d2, x1, y1, param) | done | done | done | done | -- | NC |
| SROT | rot(X, Y, c, s) | done | done | done | done | -- | -- |
| SROTM | rotm(X, Y, param) | done | done | done | -- | -- | NC |
| SSWAP | swap(X, Y) | done | done | done | done | -- | N/A |
| SSCAL | scal(y,a,x) | done | -- | -- | -- | -- | N/A |
| CSSCAL | scal(y,a,x) | done | -- | -- | -- | -- | OC |
| SCOPY | deep_copy(y,x) | done | -- | -- | -- | -- | N/A |
| SAXPY | axpby(a,x,b,y) | done | -- | -- | -- | -- | -- |
| SDOT* | dot(x,y) | done | -- | -- | -- | -- | -- |
| SDSDOT* | dot(x,y) | done | -- | -- | -- | -- | NC |
| CDOTU | -- | -- | -- | -- | -- | -- | OC |
| CDOTC* | dot(x,y) | done | -- | -- | -- | -- | OC |
| SNRM2 | nrm2(x) | done | -- | -- | -- | -- | NC |
| SCNRM2 | nrm2(x) | done | -- | -- | -- | -- | OC |
| SASUM | asum(x) | done | done | done | done | done | -- |
| ISAMAX | iamax(x) | done | -- | -- | -- | -- | -- |
*Kokkos Kernels dot() has a slightly different behavior if the result is passed as a return value or as an output Kokkos::View. In the former, dot product is always accumulated in double, in the later the dot product is accumulated in a scalar of same type as value_type of the output view.
BLAS 2
Not instead of the symmetric calls, for complex it has hermetian.
| BLAS Call | std::blas | Kokkos Kernels Call | Reference | TPL BLAS | TPL CUBLAS | TPL ROCBLAS | TPL oneMKL | Complex Special |
|---|---|---|---|---|---|---|---|---|
| SGEMV | y | gemv(trans,a,A,x,b,y) | done | -- | -- | -- | -- | -- |
| SGBMV | n | -- | -- | -- | -- | -- | -- | -- |
| SSYMV | y | -- | -- | -- | -- | -- | -- | -- |
| SSBMV | n | -- | -- | -- | -- | -- | -- | -- |
| SSPMV | y | -- | -- | -- | -- | -- | -- | -- |
| STRMV | y | derive from trmm | done | -- | -- | -- | -- | -- |
| STBMV | n | -- | -- | -- | -- | -- | -- | -- |
| STPMV | y | -- | -- | -- | -- | -- | -- | -- |
| STRSV | y | derive from trmv | done | -- | -- | -- | -- | -- |
| STBSV | n | -- | -- | -- | -- | -- | -- | -- |
| STPSV | y | -- | -- | -- | -- | -- | -- | -- |
| SGER | y | ger(trans,a,x,y,A) | done | X | X | X | -- | NC |
| CGERU | y | ger(trans,a,x,y,A) | done | X | X | X | -- | OC |
| CGERC | y | ger(trans,a,x,y,A) | done | X | X | X | -- | OC |
| SSYR | y | syr(trans,uplo,a,x,A) | done | X | X | X | -- | -- |
| SSPR | y | -- | -- | -- | -- | -- | -- | -- |
| SSYR2 | y | syr2(trans,uplo,a,x,y,A) | done | X | X | X | -- | -- |
| SSPR2 | y | -- | -- | -- | -- | -- | -- | -- |
BLAS 3
| BLAS Call | std::blas | Kokkos Kernels Call | Reference | TPL BLAS | TPL CUBLAS | Complex Special |
|---|---|---|---|---|---|---|
| SGEMM | y | gemm(transA,transB,a,A,B,b,C) | -- | -- | -- | -- |
| SSYMM | y | -- | -- | -- | -- | -- |
| SSYRK | y | -- | -- | -- | -- | -- |
| SSYR2K | y | -- | -- | -- | -- | -- |
| CHEMM | y | -- | -- | -- | -- | OC |
| CHERK | y | -- | -- | -- | -- | OC |
| CHER2K | y | -- | -- | -- | -- | OC |
| STRMM | y | trmm(side, uplo, trans, diag, a, A, B) | -- | -- | -- | -- |
| STRSM | y | trsm(side, uplo, trans, diag, a, A, B) | -- | -- | -- | -- |
For the qmcpack miniapp, it would be very useful to have gemv and ger. Typical problem sizes would be for matrix / vectors that have dimensionality of ~300 - 5000.