matrixmultiply Integer matrices

Would you consider also implementing matrix multiplication for integer matrices, or do you want to keep this purely floating point?

Nov 20 '18 23:11 SuperFluffy

It's pretty far from what we are focusing on, maybe it's simple to plug into the existing code?

Nov 21 '18 07:11 bluss

Experiment for fun https://github.com/bluss/matrixmultiply/compare/i32-gemm-experiment?expand=1

Nov 21 '18 18:11 bluss

@SuperFluffy do you have any good docs on integer gemm? It seems a bit fraught, like the wraparound problems especially with large matrices, there must be many good reasons it's not often implemented.

Nov 22 '18 19:11 bluss

@bluss Here is the doc for the cblas_gemm_*: https://software.intel.com/en-us/mkl-developer-reference-c-cblas-gemm-1#2A58B860-609A-44CC-9812-E47BD01810CC At the bottom you have implementation details.

One of the few documents talking about it is this here: http://www.netlib.org/utk/people/JackDongarra/WEB-PAGES/Batched-BLAS-2017/talk12-gurney.pdf

~~Two relevant implementation details are probably (all from page 11/15):~~

~~They implement only GEMM_S16S16S32 and GEMM_S16S16S16, with S16=i16 and S32=i32. respectively.~~
~~Internal summation done with at least 16 bits (that's probably quite important!).~~

They note:

Only saturation variants are implemented

And then on page 13/15:

Saturate instead of overflowing or underflowing

The arraymancer library for nim has implemented integer gemm here: https://github.com/mratsim/Arraymancer/commit/654c89e59088304159d7ad5c4d712d862fbfe395. Discussions can be found here: https://github.com/mratsim/Arraymancer/issues/25, https://github.com/mratsim/Arraymancer/issues/6. They also have integer gemv here: https://github.com/mratsim/Arraymancer/commit/a5e79d9625c5c056445ebceb7f487f4dc26b6b2e

EDIT: Intel MKL implements cblas_gemm_s8u8s32 and cblas_gemm_s16s16s32.

Note, that's a u8 in the first function!

Nov 22 '18 19:11 SuperFluffy

Oh saturation! Good to know. Thanks for the details!

Nov 22 '18 19:11 bluss

Note the comment at the bottom of the API doc (emphasis mine):

After computing these four multiplication terms separately, they are summed from left to right. The results from the matrix-matrix product and the C matrix are scaled with alpha and beta floating-point values respectively using double-precision arithmetic. Before storing the results to the output c array, the floating-point values are rounded to the nearest integers. In the event of overflow or underflow, the results depend on the architecture . The results are either unsaturated (wrapped) or saturated to maximum or minimum representable integer values for the data type of the output matrix.

When using cblas_gemm_s8u8s32 with row-major layout, the data types of A and B must be swapped. That is, you must provide an 8-bit unsigned integer array for matrix A and an 8-bit signed integer array for matrix B.

Intermediate integer computations in cblas_gemm_s8u8s32 on 64-bit Intel® Advanced Vector Extensions 2 (Intel® AVX2) and Intel® Advanced Vector Extensions 512 (Intel® AVX-512) architectures without Vector Neural Network Instructions (VNNI) extensions can saturate. This is because only 16-bits are available for the accumulation of intermediate results. You can avoid integer saturation by maintaining all integer elements of A or B matrices under 8 bits.

Also, I edited my comment above: Intel only supports s8u8s32 (i8, u8(!), i32) and s16s16s32 (i16, i16, i32).

Nov 22 '18 20:11 SuperFluffy

What a bunch of hacks upon hacks

Nov 22 '18 20:11 bluss

I have found mention of integer gemm in the context of BLIS, but it looks like nothing came of it: https://groups.google.com/forum/#!topic/blis-devel/qA00lB2yGY0

Nov 26 '18 09:11 SuperFluffy

Would it be possible to make just the fallback implementation available for more types as a first step?

Oct 02 '19 20:10 SolidTux