cpu: aarch64: add ASIMD softmax JIT implementation
Description
This commit introduces an f32 ASIMD softmax JIT implementation using the exp eltwise injector added in #4376, while also improving performance for the existing sve_* implementations (primarily by increasing the unrolling factor unroll_regs_ and skipping the multiplication with default dequantization / requantization factors src_scales / dst_scales). For jit:asimd and jit:sve_128, the exp function is also effectively inlined by setting preserve_vmm = false, whereas jit:sve_256 did not benefit from such a change.
As the previous softmax implementation heavily relied on predicated instructions, jit_softmax_base_t was refactored to only include common logic for SVE and non-SVE implementations alike. At the same time, two different derived constructs were added to handle ISA-specific work: jit_softmax_sve_t and jit_softmax_asimd_t.
In addition, the JIT eltwise injector was changed to support storing/loading preserved vectors on non-SVE targets.
Performance improvements (f32)
c6g
| Shape | Threads | jit:asimd (ms) | acl (ms) | Speedup |
|---|---|---|---|---|
| 1539x387 | 1 | 1.21689 | 1.5615 | 1.28 |
| 1539x387 | 4 | 0.306583 | 0.394197 | 1.29 |
| 1539x387 | 16 | 0.078976 | 0.103172 | 1.31 |
| 1539x387 | 64 | 0.02816 | 0.04522 | 1.61 |
| 1024x4096 | 1 | 8.12552 | 10.4083 | 1.28 |
| 1024x4096 | 4 | 2.05314 | 2.62449 | 1.28 |
| 1024x4096 | 16 | 0.526042 | 0.678114 | 1.29 |
| 1024x4096 | 64 | 0.13881 | 0.182793 | 1.32 |
| 4096x4096 | 1 | 32.5925 | 41.3373 | 1.27 |
| 4096x4096 | 4 | 8.19186 | 10.3651 | 1.27 |
| 4096x4096 | 16 | 2.0928 | 2.66398 | 1.27 |
| 4096x4096 | 64 | 0.734764 | 0.937735 | 1.28 |
c7g
| Shape | Threads | jit:sve_256 (after) | jit:sve_256 (before) | Speedup |
|---|---|---|---|---|
| 1539x387 | 1 | 0.58647 | 0.748606 | 1.28 |
| 1539x387 | 4 | 0.150092 | 0.189787 | 1.26 |
| 1539x387 | 16 | 0.03906 | 0.049228 | 1.26 |
| 1539x387 | 64 | 0.018721 | 0.021218 | 1.13 |
| 1024x4096 | 1 | 3.94334 | 5.12185 | 1.30 |
| 1024x4096 | 4 | 0.991868 | 1.30929 | 1.32 |
| 1024x4096 | 16 | 0.24468 | 0.329952 | 1.35 |
| 1024x4096 | 64 | 0.084429 | 0.108232 | 1.28 |
| 4096x4096 | 1 | 15.9669 | 20.4236 | 1.28 |
| 4096x4096 | 4 | 4.08712 | 5.56156 | 1.36 |
| 4096x4096 | 16 | 1.08677 | 1.43602 | 1.32 |
| 4096x4096 | 64 | 0.369658 | 0.432615 | 1.17 |
c8g
| Shape | Threads | jit:sve_128 (after) | jit:sve_128 (before) | Speedup |
|---|---|---|---|---|
| 1539x387 | 1 | 0.669235 | 0.863312 | 1.29 |
| 1539x387 | 4 | 0.168464 | 0.217245 | 1.29 |
| 1539x387 | 16 | 0.043956 | 0.055711 | 1.27 |
| 1539x387 | 64 | 0.018259 | 0.023519 | 1.29 |
| 1024x4096 | 1 | 4.95383 | 6.07039 | 1.23 |
| 1024x4096 | 4 | 1.17104 | 1.50691 | 1.29 |
| 1024x4096 | 16 | 0.295833 | 0.367653 | 1.24 |
| 1024x4096 | 64 | 0.09172 | 0.130347 | 1.42 |
| 4096x4096 | 1 | 20.0518 | 24.4886 | 1.22 |
| 4096x4096 | 4 | 5.11177 | 6.25783 | 1.22 |
| 4096x4096 | 16 | 1.3261 | 1.58102 | 1.19 |
| 4096x4096 | 64 | 0.341221 | 0.478697 | 1.40 |
As this change is pretty big, do you think it would be possible to neatly split it into two commits: one for the sve optimizations and one for the asimd impl? The sve changes should even maybe be a separate PR.
I've now split up the changes into 3 separate commits:
cpu: aarch64: refactor jit_uni_softmax:- keeps ISA-agnostic logic in
jit_softmax_base_t, while all SVE-specific code is moved into a new constructjit_softmax_sve_t. - most of the changes are due to indentation differences.
- keeps ISA-agnostic logic in
cpu: aarch64: add ASIMD softmax JIT implementation:- adds ASIMD kernel, but also improves SVE kernels as the unroll factor change is done directly in the common base struct
jit_softmax_base_t.
- adds ASIMD kernel, but also improves SVE kernels as the unroll factor change is done directly in the common base struct
cpu: aarch64: improve SVE JIT softmax performance:- adapts some of the ASIMD performance gains for the SVE kernels too, in particular SVE 128 as they share the same vector length.
I will move the final commit to a follow-up PR if you think that's best. I've only left all 3 together for now as the c7g/c8g speedups would be less noticeable at a glance with the SVE improvements in commits 2 and 3 split up, compared to being altogether in a single table like this.