llvm-project
llvm-project copied to clipboard
[AArch64] Generate sqdmlal
| Bugzilla Link | 50653 |
| Version | trunk |
| OS | Linux |
| CC | @Arnaud-de-Grandmaison-ARM,@DMG862,@smithp35 |
Extended Description
Raising this missed optimisation opportunity in case someone finds this interesting.
For this input:
#include "arm_neon.h" int32_t t_vqdmlalh_lane_s16 (int32_t a, int16_t b, int16x4_t c) { return vqdmlalh_lane_s16 (a, b, c, 0); }
We are not generating this multiply-accumulate variant that gcc generates:
t_vqdmlalh_lane_s16: dup v2.4h, w1 fmov s1, w0 sqdmlal s1, h2, v0.h[0] fmov w0, s1 ret
We get this instead:
t_vqdmlalh_lane_s16: // @t_vqdmlalh_lane_s16 fmov s1, w1 sqdmull v0.4s, v1.4h, v0.4h fmov s1, w0 sqadd s0, s1, s0 fmov w0, s0 ret
See also https://godbolt.org/z/41nMxM5q1
@llvm/issue-subscribers-good-first-issue
I am looking into this.
There are equivalent missed optimization opportunities with the following intrinsics too:
vqdmlalh_s16vqdmlslh_s16vqdmlalh_laneq_s16vqdmlslh_lane_s16vqdmlslh_laneq_s16
I found a fix for this issue.
I also found that sqdmlal/sqdmlsl instructions are in fact generated from the following intrinsics, as long as the lane number passed is not 0:
vqdmlalh_lane_s16vqdmlalh_laneq_s16vqdmlslh_lane_s16vqdmlslh_laneq_s16
For example, for this C code:
int32_t u_vqdmlalh_lane_s16(int32_t a, int16_t b, int16x4_t v) {
return vqdmlalh_lane_s16(a, b, v, 1);
}
Clang generates this AArch64 assembly code:
u_vqdmlalh_lane_s16: // @u_vqdmlalh_lane_s16
fmov s1, w1
fmov s2, w0
sqdmlal v2.4s, v1.4h, v0.h[1]
fmov w0, s2
ret
See https://godbolt.org/z/fYM6G1TcM for all my experiments.
The different kind of DAGs generated when the lane number is not 0 is matched by this TableGen definition:
https://github.com/llvm/llvm-project/blob/f8d976171f2a1b7bf9268929f77904973edb0378/llvm/lib/Target/AArch64/AArch64InstrFormats.td#L8850-L8862
IICU there's a candidate patch: https://reviews.llvm.org/D131700