The purpose of different implementations of qs8/qu8-f32-vcvt kernels.
Hi, @oliIMG In the PR https://github.com/google/XNNPACK/pull/7127, I see there are different versions of qs8/qu8-f32-vcvt kernels namely qs8-f32-vcvt-rvv-u1v.c, qs8-f32-vcvt-rvv-u2v.c, qu8-f32-vcvt-rvv-u1v.c and qu8-f32-vcvt-rvv-u2v.c. They are about the m1/m2 rvv implementations of the kernels.
In some other kernels, they have four RVV implementation versions (based on scalar) in the forms of m1, m2, m4 and m8.
I want to know the purpose of the different implementations, and how the users can select the best version.
In general a kernel with 'u4v' means 'm4' for the source. Kernels such as float binary ops, can implement all 4 variations - m1, m2, m4, m8. In the src/configs/gemm-config.c etc, the fastest of these variations can be enabled. It will depend on hardware, so once some benchmarks can be done with different vendors, a switch statement on uarch can be added to select different kernels for different hardware.
With 8 or 16 bit datatypes, the intermediates are often lengthened, limiting the variations to m1, m2, and maybe m4.