LvArray
LvArray copied to clipboard
Add tensorOps benchmarks
Also examine using std::fma
.
What did you have in mind for std::fma
for device kernels?
CUDA has a fma
as well, just like cos
and whatnot. I'm not sure it would be beneficial but worth checking out.
I suspect that it may force the compiler to recognize the fma
operation, when it might miss it otherwise?? We are getting all sorts of DFMA
instructions in our CUDA PTX, but I was pretty careful about checking that we are getting them when we expect.
Yeah but it could be slower: https://stackoverflow.com/questions/34265982/automatically-generate-fma-instructions-in-msvc
For things like AiBi
it is very applicable. But how you'd go about applying it to things like
dstSymMatrix[ 3 ] = matrixA[ 1 ][ 0 ] * symMatrixB[ 0 ] * matrixA[ 2 ][ 0 ] +
matrixA[ 1 ][ 0 ] * symMatrixB[ 5 ] * matrixA[ 2 ][ 1 ] +
matrixA[ 1 ][ 0 ] * symMatrixB[ 4 ] * matrixA[ 2 ][ 2 ] +
matrixA[ 1 ][ 1 ] * symMatrixB[ 5 ] * matrixA[ 2 ][ 0 ] +
matrixA[ 1 ][ 1 ] * symMatrixB[ 1 ] * matrixA[ 2 ][ 1 ] +
matrixA[ 1 ][ 1 ] * symMatrixB[ 3 ] * matrixA[ 2 ][ 2 ] +
matrixA[ 1 ][ 2 ] * symMatrixB[ 4 ] * matrixA[ 2 ][ 0 ] +
matrixA[ 1 ][ 2 ] * symMatrixB[ 3 ] * matrixA[ 2 ][ 1 ] +
matrixA[ 1 ][ 2 ] * symMatrixB[ 2 ] * matrixA[ 2 ][ 2 ];
might harm performance even if std::fma
is fast because it limits the re-arranging the compiler can do.
without fma
I count 27 fp operations.
dstSymMatrix[ 3 ] = matrixA[ 1 ][ 0 ] * ( symMatrixB[ 0 ] * matrixA[ 2 ][ 0 ] +
symMatrixB[ 5 ] * matrixA[ 2 ][ 1 ] +
symMatrixB[ 4 ] * matrixA[ 2 ][ 2 ] ) +
matrixA[ 1 ][ 1 ] * ( symMatrixB[ 5 ] * matrixA[ 2 ][ 0 ] +
symMatrixB[ 1 ] * matrixA[ 2 ][ 1 ] +
symMatrixB[ 3 ] * matrixA[ 2 ][ 2 ] ) +
matrixA[ 1 ][ 2 ] * ( symMatrixB[ 4 ] * matrixA[ 2 ][ 0 ] +
symMatrixB[ 3 ] * matrixA[ 2 ][ 1 ] +
symMatrixB[ 2 ] * matrixA[ 2 ][ 2 ] );
rearranging and using fma
i count 12.