Description

This commit introduces an f32 ASIMD softmax JIT implementation using the exp eltwise injector added in #4376, while also improving performance for the existing sve_* implementations (primarily by increasing the unrolling factor unroll_regs_ and skipping the multiplication with default dequantization / requantization factors src_scales / dst_scales). For jit:asimd and jit:sve_128, the exp function is also effectively inlined by setting preserve_vmm = false, whereas jit:sve_256 did not benefit from such a change.

As the previous softmax implementation heavily relied on predicated instructions, jit_softmax_base_t was refactored to only include common logic for SVE and non-SVE implementations alike. At the same time, two different derived constructs were added to handle ISA-specific work: jit_softmax_sve_t and jit_softmax_asimd_t.

In addition, the JIT eltwise injector was changed to support storing/loading preserved vectors on non-SVE targets.

Performance improvements (f32)

c6g

Shape	Threads	jit:asimd (ms)	acl (ms)	Speedup
1539x387	1	1.21689	1.5615	1.28
1539x387	4	0.306583	0.394197	1.29
1539x387	16	0.078976	0.103172	1.31
1539x387	64	0.02816	0.04522	1.61
1024x4096	1	8.12552	10.4083	1.28
1024x4096	4	2.05314	2.62449	1.28
1024x4096	16	0.526042	0.678114	1.29
1024x4096	64	0.13881	0.182793	1.32
4096x4096	1	32.5925	41.3373	1.27
4096x4096	4	8.19186	10.3651	1.27
4096x4096	16	2.0928	2.66398	1.27
4096x4096	64	0.734764	0.937735	1.28

c7g

Shape	Threads	jit:sve_256 (after)	jit:sve_256 (before)	Speedup
1539x387	1	0.58647	0.748606	1.28
1539x387	4	0.150092	0.189787	1.26
1539x387	16	0.03906	0.049228	1.26
1539x387	64	0.018721	0.021218	1.13
1024x4096	1	3.94334	5.12185	1.30
1024x4096	4	0.991868	1.30929	1.32
1024x4096	16	0.24468	0.329952	1.35
1024x4096	64	0.084429	0.108232	1.28
4096x4096	1	15.9669	20.4236	1.28
4096x4096	4	4.08712	5.56156	1.36
4096x4096	16	1.08677	1.43602	1.32
4096x4096	64	0.369658	0.432615	1.17

c8g

Shape	Threads	jit:sve_128 (after)	jit:sve_128 (before)	Speedup
1539x387	1	0.669235	0.863312	1.29
1539x387	4	0.168464	0.217245	1.29
1539x387	16	0.043956	0.055711	1.27
1539x387	64	0.018259	0.023519	1.29
1024x4096	1	4.95383	6.07039	1.23
1024x4096	4	1.17104	1.50691	1.29
1024x4096	16	0.295833	0.367653	1.24
1024x4096	64	0.09172	0.130347	1.42
4096x4096	1	20.0518	24.4886	1.22
4096x4096	4	5.11177	6.25783	1.22
4096x4096	16	1.3261	1.58102	1.19
4096x4096	64	0.341221	0.478697	1.40

Dec 09 '25 13:12 Anndrey24

As this change is pretty big, do you think it would be possible to neatly split it into two commits: one for the sve optimizations and one for the asimd impl? The sve changes should even maybe be a separate PR.

Dec 09 '25 13:12 michalowski-arm

I've now split up the changes into 3 separate commits:

cpu: aarch64: refactor jit_uni_softmax:
- keeps ISA-agnostic logic in jit_softmax_base_t, while all SVE-specific code is moved into a new construct jit_softmax_sve_t.
- most of the changes are due to indentation differences.
cpu: aarch64: add ASIMD softmax JIT implementation:
- adds ASIMD kernel, but also improves SVE kernels as the unroll factor change is done directly in the common base struct jit_softmax_base_t.
cpu: aarch64: improve SVE JIT softmax performance:
- adapts some of the ASIMD performance gains for the SVE kernels too, in particular SVE 128 as they share the same vector length.

I will move the final commit to a follow-up PR if you think that's best. I've only left all 3 together for now as the c7g/c8g speedups would be less noticeable at a glance with the SVE improvements in commits 2 and 3 split up, compared to being altogether in a single table like this.

Dec 09 '25 17:12 Anndrey24

cpu: aarch64: add ASIMD softmax JIT implementation

Description

Performance improvements (f32)

c6g

c7g

c8g