[RVV] Add qu8-gemm/qu8-igemm kernels for rvv
- Adding these QU8 kernels for completeness, although I recognize that QU8 will be deprecated at some point.
- Small change to qs8-gemm/rvv.in for qu8 support, however qs8/qd8 generated kernels remain unchanged.
- Unit tests pass. I used a local fix for https://github.com/google/XNNPACK/issues/8096, as described there, to properly execute the tests, however I decided not to include that fix here as that should really require re-execution of all gemm/igemm tests for all archs.
- Performance of 4x4v and 7x4v is roughly equivalent on the test platform (K1) at ~15 G/s and so I've used 1x4v and 4x4v for the production config.
- In future I'll take a look at why for RVV the generated gemm/igemm tests use the duplicate (mostly) CreateTests2 as its not clear to me why that is required.
@dsharlet and @fbarchard please review when you are able. Thank you.
looks good overall.
you call this m4 but store is m1? in gemm config you set NR to 4 * hardware_config->vlenb / sizeof(int32_t); The elements are 8 bit and you store with m1, so it seems like it should be nr = hardware_config->vlenb / sizeof(uint8_t);
4x4 - typically on large convolutions it helps to be taller... the overread of reading left side as bytes is amortized by applying the same weights to them. I suggest running gemm benchmark and look for a large slow convolution and then filtering on that, find which gemm tile size is fastest.
the quantization you could do as uint8 like arm does. typically there are fewer registers as 8 bit than float and/or min/max may be faster as 8 bit instead of float.
the gemm-config doesnt specify a packw kernel? weird... we are using the reference code. the 8 bit packing for qu8 would be similar to 'x8' which can produce qs8 or x8 which doesnt care about sign, and is used for f32_qc8w. qu8 is not often used compared to qs8 and qd8 does have a packw kernel. But its using a strided load, which is slow. If that comes up high on profiles, it may be worth an optimization pass.
As you see our qu8 requiring a kernel zero point, typically requires 16 bit implementation, which is not ideal. Another way to do it is 8 bit multiply of kernel value, and 8 bit multiply of zero point. But the zero point only needs to be done for each input, not each weight, so if you support a wider NR, the zero pointer amortizes. Also you can keep another set of accumulators. This method lends itself to hardware 8 bit matrix multiply or dot product. But it only helps qu8. qs8 doesnt have the kernel zero point. On arm we convert the kernel values to 16 bit due to lanes only working on 16 bit values. But ideally multiply 8 bit weights by inputs to get 16 bit, then a widening accumulate. arm supports padal which adds pairs of 16 bit, sums and converts to 32 bit, then accumulate with 32 bit accumulator. does riscv have a padal equivalent? If so the 16 bit to 32 bit could makes this into a c2 (KR=2) kernel, processing 2 blocks at a time.