LvArray Add tensorOps benchmarks

Also examine using std::fma.

Jul 10 '20 04:07 corbett5

What did you have in mind for std::fma for device kernels?

Jul 12 '20 00:07 rrsettgast

CUDA has a fma as well, just like cos and whatnot. I'm not sure it would be beneficial but worth checking out.

Jul 12 '20 01:07 corbett5

I suspect that it may force the compiler to recognize the fma operation, when it might miss it otherwise?? We are getting all sorts of DFMA instructions in our CUDA PTX, but I was pretty careful about checking that we are getting them when we expect.

Jul 12 '20 05:07 rrsettgast

Yeah but it could be slower: https://stackoverflow.com/questions/34265982/automatically-generate-fma-instructions-in-msvc For things like AiBi it is very applicable. But how you'd go about applying it to things like

dstSymMatrix[ 3 ] = matrixA[ 1 ][ 0 ] * symMatrixB[ 0 ] * matrixA[ 2 ][ 0 ] +
                        matrixA[ 1 ][ 0 ] * symMatrixB[ 5 ] * matrixA[ 2 ][ 1 ] +
                        matrixA[ 1 ][ 0 ] * symMatrixB[ 4 ] * matrixA[ 2 ][ 2 ] +
                        matrixA[ 1 ][ 1 ] * symMatrixB[ 5 ] * matrixA[ 2 ][ 0 ] +
                        matrixA[ 1 ][ 1 ] * symMatrixB[ 1 ] * matrixA[ 2 ][ 1 ] +
                        matrixA[ 1 ][ 1 ] * symMatrixB[ 3 ] * matrixA[ 2 ][ 2 ] +
                        matrixA[ 1 ][ 2 ] * symMatrixB[ 4 ] * matrixA[ 2 ][ 0 ] +
                        matrixA[ 1 ][ 2 ] * symMatrixB[ 3 ] * matrixA[ 2 ][ 1 ] +
                        matrixA[ 1 ][ 2 ] * symMatrixB[ 2 ] * matrixA[ 2 ][ 2 ];

might harm performance even if std::fma is fast because it limits the re-arranging the compiler can do.

Jul 12 '20 05:07 corbett5

without fma I count 27 fp operations.

dstSymMatrix[ 3 ] = matrixA[ 1 ][ 0 ] * ( symMatrixB[ 0 ] * matrixA[ 2 ][ 0 ] +
                                          symMatrixB[ 5 ] * matrixA[ 2 ][ 1 ] +
                                          symMatrixB[ 4 ] * matrixA[ 2 ][ 2 ] ) +
                    matrixA[ 1 ][ 1 ] * ( symMatrixB[ 5 ] * matrixA[ 2 ][ 0 ] +
                                          symMatrixB[ 1 ] * matrixA[ 2 ][ 1 ] +
                                          symMatrixB[ 3 ] * matrixA[ 2 ][ 2 ] ) +
                    matrixA[ 1 ][ 2 ] * ( symMatrixB[ 4 ] * matrixA[ 2 ][ 0 ] +
                                          symMatrixB[ 3 ] * matrixA[ 2 ][ 1 ] +
                                          symMatrixB[ 2 ] * matrixA[ 2 ][ 2 ] );

rearranging and using fma i count 12.

Jul 12 '20 05:07 rrsettgast

LvArray LvArray copied to clipboard

Add tensorOps benchmarks

LvArray
LvArray copied to clipboard