Integer matrices
Would you consider also implementing matrix multiplication for integer matrices, or do you want to keep this purely floating point?
It's pretty far from what we are focusing on, maybe it's simple to plug into the existing code?
Experiment for fun https://github.com/bluss/matrixmultiply/compare/i32-gemm-experiment?expand=1
@SuperFluffy do you have any good docs on integer gemm? It seems a bit fraught, like the wraparound problems especially with large matrices, there must be many good reasons it's not often implemented.
@bluss Here is the doc for the cblas_gemm_*: https://software.intel.com/en-us/mkl-developer-reference-c-cblas-gemm-1#2A58B860-609A-44CC-9812-E47BD01810CC At the bottom you have implementation details.
One of the few documents talking about it is this here: http://www.netlib.org/utk/people/JackDongarra/WEB-PAGES/Batched-BLAS-2017/talk12-gurney.pdf
~~Two relevant implementation details are probably (all from page 11/15):~~
- ~~They implement only
GEMM_S16S16S32andGEMM_S16S16S16, withS16=i16andS32=i32. respectively.~~ - ~~Internal summation done with at least 16 bits (that's probably quite important!).~~
They note:
Only saturation variants are implemented
And then on page 13/15:
Saturate instead of overflowing or underflowing
The arraymancer library for nim has implemented integer gemm here: https://github.com/mratsim/Arraymancer/commit/654c89e59088304159d7ad5c4d712d862fbfe395. Discussions can be found here: https://github.com/mratsim/Arraymancer/issues/25, https://github.com/mratsim/Arraymancer/issues/6. They also have integer gemv here: https://github.com/mratsim/Arraymancer/commit/a5e79d9625c5c056445ebceb7f487f4dc26b6b2e
EDIT: Intel MKL implements cblas_gemm_s8u8s32 and cblas_gemm_s16s16s32.
Note, that's a u8 in the first function!
Oh saturation! Good to know. Thanks for the details!
Note the comment at the bottom of the API doc (emphasis mine):
After computing these four multiplication terms separately, they are summed from left to right. The results from the matrix-matrix product and the C matrix are scaled with alpha and beta floating-point values respectively using double-precision arithmetic. Before storing the results to the output c array, the floating-point values are rounded to the nearest integers. In the event of overflow or underflow, the results depend on the architecture . The results are either unsaturated (wrapped) or saturated to maximum or minimum representable integer values for the data type of the output matrix.
When using cblas_gemm_s8u8s32 with row-major layout, the data types of A and B must be swapped. That is, you must provide an 8-bit unsigned integer array for matrix A and an 8-bit signed integer array for matrix B.
Intermediate integer computations in cblas_gemm_s8u8s32 on 64-bit Intel® Advanced Vector Extensions 2 (Intel® AVX2) and Intel® Advanced Vector Extensions 512 (Intel® AVX-512) architectures without Vector Neural Network Instructions (VNNI) extensions can saturate. This is because only 16-bits are available for the accumulation of intermediate results. You can avoid integer saturation by maintaining all integer elements of A or B matrices under 8 bits.
Also, I edited my comment above: Intel only supports s8u8s32 (i8, u8(!), i32) and s16s16s32 (i16, i16, i32).
What a bunch of hacks upon hacks
I have found mention of integer gemm in the context of BLIS, but it looks like nothing came of it: https://groups.google.com/forum/#!topic/blis-devel/qA00lB2yGY0
Would it be possible to make just the fallback implementation available for more types as a first step?