intgemm Hacky nonmult8 for VNNI

Hacky nonmult8 for VNNI

Open XapaJIaMnu opened this issue 3 years ago • 0 comments

It's not a purr fect implementation, but it is a start... This patch implements the following:

PrepareB for arbitrary columns matrices for all architectures. The last non-multiple-of-eight-columns are prepared and compressed as a small independent width by 8 matrix, zero'ed blocks of register_width are stripped. Unfortunately, this is not done in place in the current implementation, and involves memory copying. This can be improved in the future. I am using some inlined functions that don't have CPU_ATTR set. as I was lazy. I hope that inlining means they would be generated with the proper ISA limitataions. Regardless, so far only VNNI multiply is implemented anyways.
Avx512VNNI multiplication of matrices of arbitrary number of columns + tests. The multiplication proceeds as normal until it reaches the last non-multiple-of-eight column and then proceeds to do those in a separate loop.

Example: If we have A = 2x64 matrix and B = 64x9, we will perform a multiplication first, of 2x64 times 64x8 and then 1x64 times 64x1 (to produce the last column)

Unfortunately, now that we can have matrices that have non-multiple-of-eight columns, but we no longer write the columns consecutively, we get unaligned memory access when writing and we segfault. For this reason I have replaced the store routine with storeu.

Preliminary performance benchmarks with the builtin https://github.com/kpu/intgemm/blob/6228d016ecc63470d2dbb76bd4ab7b0abe097993/benchmarks/biasmultiply.cc#L267 to check for any performance regressions. (This is not including irregularly shaped non-multiple-of-8 matrices) This branch (n=1)

taskset --cpu-list 0 ./biasmultiply
1000 iterations of SSSE3 without bias took: 2.31014 seconds.
1000 iterations of SSSE3 took: 2.39446 seconds.
1000 iterations of Shifted SSSE3 took: 1.98965 seconds.
1000 iterations of AVX2 without bias took: 1.33628 seconds.
1000 iterations of AVX2 took: 1.33306 seconds.
1000 iterations of Shifted AVX2 took: 1.20668 seconds.
1000 iterations of AVX512 without bias took: 1.01728 seconds.
1000 iterations of AVX512 took: 1.04101 seconds.
1000 iterations of Shifted AVX512 took: 0.779364 seconds.
1000 iterations of AVX512VNNI without bias took: 0.754878 seconds.
1000 iterations of AVX512VNNI took: 0.771353 seconds.
1000 iterations of Shifted AVX512VNNI took: 0.539761 seconds.

Master (n=1)

taskset --cpu-list 0 ./biasmultiply
1000 iterations of SSSE3 without bias took: 2.31003 seconds.
1000 iterations of SSSE3 took: 2.37843 seconds.
1000 iterations of Shifted SSSE3 took: 1.97674 seconds.
1000 iterations of AVX2 without bias took: 1.28795 seconds.
1000 iterations of AVX2 took: 1.33322 seconds.
1000 iterations of Shifted AVX2 took: 1.20815 seconds.
1000 iterations of AVX512 without bias took: 1.01804 seconds.
1000 iterations of AVX512 took: 1.06707 seconds.
1000 iterations of Shifted AVX512 took: 0.779698 seconds.
1000 iterations of AVX512VNNI without bias took: 0.776488 seconds.
1000 iterations of AVX512VNNI took: 0.772831 seconds.
1000 iterations of Shifted AVX512VNNI took: 0.653334 seconds.

Speed seems to be even better, but I don't trust that. Maybe some of the instruction reordering makes the benchmark perform better. I will have test it in a real world situation later on.

Jul 17 '21 00:07 XapaJIaMnu

intgemm intgemm copied to clipboard

Hacky nonmult8 for VNNI

intgemm
intgemm copied to clipboard