cpu: aarch64: enable jit conv for 128
Description
Draft: needs some optimization tweaks, also deconv tests fail due to check_zero_padding
Naively enable 128 by copying the equivalent invocations to 512 and 256. Note that is_1stconv is hard coded to true for sve_128, which misses out on some performance.
Some more optimization is necessary, but this speeds up some cases, specifically backward. In some cases this was slower than Arm Compute Library (ACL), so unlike the 512 and 256 counterparts, it has been set below the ACL implementations in the CPU list.
Checklist
General
- [ ] Do all unit and benchdnn tests (
make testandmake test_benchdnn_*) pass locally for each commit? - [ ] Have you formatted the code using clang-format?
Performance improvements
- [ ] Have you submitted performance data that demonstrates performance improvements?
@kasturedeeksha is this a reasonable approach? (It's a draft, so I know there may be some code quality issues)
Should help #2165
@jondea Yes, this approach can be used to extend the work for 128-bit support, just need to verify that all tests are passing and check if anything more needs to be done, priority in cpu_convolution_list can be decided based on performance.