intgemm
intgemm copied to clipboard
Hacky nonmult8 for VNNI
It's not a purr fect implementation, but it is a start... This patch implements the following:
- PrepareB for arbitrary columns matrices for all architectures. The last non-multiple-of-eight-columns are prepared and compressed as a small independent width by 8 matrix, zero'ed blocks of register_width are stripped. Unfortunately, this is not done in place in the current implementation, and involves memory copying. This can be improved in the future. I am using some inlined functions that don't have
CPU_ATTR
set. as I was lazy. I hope that inlining means they would be generated with the proper ISA limitataions. Regardless, so far only VNNI multiply is implemented anyways. - Avx512VNNI multiplication of matrices of arbitrary number of columns + tests. The multiplication proceeds as normal until it reaches the last non-multiple-of-eight column and then proceeds to do those in a separate loop.
Example: If we have A = 2x64 matrix and B = 64x9, we will perform a multiplication first, of 2x64 times 64x8 and then 1x64 times 64x1 (to produce the last column)
Unfortunately, now that we can have matrices that have non-multiple-of-eight columns, but we no longer write the columns consecutively, we get unaligned memory access when writing and we segfault. For this reason I have replaced the store routine with storeu.
Preliminary performance benchmarks with the builtin https://github.com/kpu/intgemm/blob/6228d016ecc63470d2dbb76bd4ab7b0abe097993/benchmarks/biasmultiply.cc#L267 to check for any performance regressions. (This is not including irregularly shaped non-multiple-of-8 matrices) This branch (n=1)
taskset --cpu-list 0 ./biasmultiply
1000 iterations of SSSE3 without bias took: 2.31014 seconds.
1000 iterations of SSSE3 took: 2.39446 seconds.
1000 iterations of Shifted SSSE3 took: 1.98965 seconds.
1000 iterations of AVX2 without bias took: 1.33628 seconds.
1000 iterations of AVX2 took: 1.33306 seconds.
1000 iterations of Shifted AVX2 took: 1.20668 seconds.
1000 iterations of AVX512 without bias took: 1.01728 seconds.
1000 iterations of AVX512 took: 1.04101 seconds.
1000 iterations of Shifted AVX512 took: 0.779364 seconds.
1000 iterations of AVX512VNNI without bias took: 0.754878 seconds.
1000 iterations of AVX512VNNI took: 0.771353 seconds.
1000 iterations of Shifted AVX512VNNI took: 0.539761 seconds.
Master (n=1)
taskset --cpu-list 0 ./biasmultiply
1000 iterations of SSSE3 without bias took: 2.31003 seconds.
1000 iterations of SSSE3 took: 2.37843 seconds.
1000 iterations of Shifted SSSE3 took: 1.97674 seconds.
1000 iterations of AVX2 without bias took: 1.28795 seconds.
1000 iterations of AVX2 took: 1.33322 seconds.
1000 iterations of Shifted AVX2 took: 1.20815 seconds.
1000 iterations of AVX512 without bias took: 1.01804 seconds.
1000 iterations of AVX512 took: 1.06707 seconds.
1000 iterations of Shifted AVX512 took: 0.779698 seconds.
1000 iterations of AVX512VNNI without bias took: 0.776488 seconds.
1000 iterations of AVX512VNNI took: 0.772831 seconds.
1000 iterations of Shifted AVX512VNNI took: 0.653334 seconds.
Speed seems to be even better, but I don't trust that. Maybe some of the instruction reordering makes the benchmark perform better. I will have test it in a real world situation later on.