intgemm icon indicating copy to clipboard operation
intgemm copied to clipboard

Hacky nonmult8 for VNNI

Open XapaJIaMnu opened this issue 3 years ago • 0 comments

It's not a purr fect implementation, but it is a start... This patch implements the following:

  • PrepareB for arbitrary columns matrices for all architectures. The last non-multiple-of-eight-columns are prepared and compressed as a small independent width by 8 matrix, zero'ed blocks of register_width are stripped. Unfortunately, this is not done in place in the current implementation, and involves memory copying. This can be improved in the future. I am using some inlined functions that don't have CPU_ATTR set. as I was lazy. I hope that inlining means they would be generated with the proper ISA limitataions. Regardless, so far only VNNI multiply is implemented anyways.
  • Avx512VNNI multiplication of matrices of arbitrary number of columns + tests. The multiplication proceeds as normal until it reaches the last non-multiple-of-eight column and then proceeds to do those in a separate loop.

Example: If we have A = 2x64 matrix and B = 64x9, we will perform a multiplication first, of 2x64 times 64x8 and then 1x64 times 64x1 (to produce the last column)

Unfortunately, now that we can have matrices that have non-multiple-of-eight columns, but we no longer write the columns consecutively, we get unaligned memory access when writing and we segfault. For this reason I have replaced the store routine with storeu.

Preliminary performance benchmarks with the builtin https://github.com/kpu/intgemm/blob/6228d016ecc63470d2dbb76bd4ab7b0abe097993/benchmarks/biasmultiply.cc#L267 to check for any performance regressions. (This is not including irregularly shaped non-multiple-of-8 matrices) This branch (n=1)

taskset --cpu-list 0 ./biasmultiply
1000 iterations of SSSE3 without bias took: 2.31014 seconds.
1000 iterations of SSSE3 took: 2.39446 seconds.
1000 iterations of Shifted SSSE3 took: 1.98965 seconds.
1000 iterations of AVX2 without bias took: 1.33628 seconds.
1000 iterations of AVX2 took: 1.33306 seconds.
1000 iterations of Shifted AVX2 took: 1.20668 seconds.
1000 iterations of AVX512 without bias took: 1.01728 seconds.
1000 iterations of AVX512 took: 1.04101 seconds.
1000 iterations of Shifted AVX512 took: 0.779364 seconds.
1000 iterations of AVX512VNNI without bias took: 0.754878 seconds.
1000 iterations of AVX512VNNI took: 0.771353 seconds.
1000 iterations of Shifted AVX512VNNI took: 0.539761 seconds.

Master (n=1)

taskset --cpu-list 0 ./biasmultiply
1000 iterations of SSSE3 without bias took: 2.31003 seconds.
1000 iterations of SSSE3 took: 2.37843 seconds.
1000 iterations of Shifted SSSE3 took: 1.97674 seconds.
1000 iterations of AVX2 without bias took: 1.28795 seconds.
1000 iterations of AVX2 took: 1.33322 seconds.
1000 iterations of Shifted AVX2 took: 1.20815 seconds.
1000 iterations of AVX512 without bias took: 1.01804 seconds.
1000 iterations of AVX512 took: 1.06707 seconds.
1000 iterations of Shifted AVX512 took: 0.779698 seconds.
1000 iterations of AVX512VNNI without bias took: 0.776488 seconds.
1000 iterations of AVX512VNNI took: 0.772831 seconds.
1000 iterations of Shifted AVX512VNNI took: 0.653334 seconds.

Speed seems to be even better, but I don't trust that. Maybe some of the instruction reordering makes the benchmark perform better. I will have test it in a real world situation later on.

XapaJIaMnu avatar Jul 17 '21 00:07 XapaJIaMnu