highway
highway copied to clipboard
Support for saturating doubling multiply add
There is a saturating doubling multiply add instructions in NEON and SVE - vqdmla
, svqdmla
which are equivalent to hn::SaturatedAdd(hn::Mul(2, hn::Mul(a_real, b_imag)),hn::Mul(2, hn::Mul(a_imag, b_real)));
. In testing, we have seen a significant performance decrease using highway vs NEON and highway vs SVE. I was wondering if it would be possible to add a similar instruction to highway so we are able to leverage these qdmla
instructions.
It sounds reasonable to want access to this instruction. I'm curious about the purpose of the extra 2x mul in their definition?
Because NEON differs from SVE in that it takes the upper/lower half, vs even/odd lanes, I think we'd want to define the op like ReorderWidenMulAccumulate
, which is allowed to return any lane order. This is fine for dot products but I wonder if it's still OK for your use case, or whether we have some other idea how to bridge the gap between NEON and SVE?
The use case we've tested out is a vector multiplication so the order needs to be preserved. You raise a good point on the difference between NEON and SVE, we could potentially return two vectors - the lower and upper half of the result - this sounds like a good compromise since I'd imagine you'd need the whole vector for most use cases anyway
Makes sense. ReorderWidenMulAccumulate
also returns a second value using an output param, so there is precedent.
So the proposed op would call both vqdmlal and vqdmlal_high on NEON, and svqdmlalb + svqdmlalt on SVE2?
I'm still curious why the 2x factor in the instruction, is it also for Q1.x fixed-point? (similar to MulFixedPoint15)
So the proposed op would call both vqdmlal and vqdmlal_high on NEON, and svqdmlalb + svqdmlalt on SVE2?
yes that is correct, I'm still trying to find out why there is a 2x factor myself, I'll give an update once I have an answer
So it looks like the doubling is just a side effect of the way the multiplication is done. The multiplication is done in Q0.15 format which is saturated to Q0.31 format and this is equivalent to doubling saturating multiply-add of two 16 bit integers
Got it, thanks. In that case adding FixedPoint to the op name may be helpful.
AVX3_DL is also capable of carrying out the saturated doubling multiply add using the _mm*_dpwssds_epi32 intrinsics.
Here is how the vqdmla op can be implemented on AVX3_DL for I32 vectors that have 1 to 4 lanes:
template <class VI32, HWY_IF_V_SIZE_LE_V(VI32, 16)>
HWY_API VI32 Vqdmla(VI32 a, Vec<Rebind<int16_t, DFromV<VI32>>> b, Vec<Rebind<int16_t, DFromV<VI32>>> c) {
const Rebind<int16_t, DFromV<VI32>> di16;
const Twice<decltype(di16)> dt_i16;
const auto dup_b = InterleaveWholeLower(dt_i16, ResizeBitCast(dt_i16, b), ResizeBitCast(dt_i16, b));
const auto dup_c = InterleaveWholeLower(dt_i16, ResizeBitCast(dt_i16, c), ResizeBitCast(dt_i16, c));
return VI32{_mm_dpwssds_epi32(a.raw, dup_b.raw, dup_c.raw)};
}
The vqdmla op can be carried out on PPC8/PPC9/PPC10 using vec_msums. Here is how the vqdmla op can be implemented on PPC8/PPC9/PPC10:
template <class VI32, HWY_IF_V_SIZE_LE_V(VI32, 16)>
HWY_API VI32 Vqdmla(VI32 a, Vec<Rebind<int16_t, DFromV<VI32>>> b, Vec<Rebind<int16_t, DFromV<VI32>>> c) {
const Rebind<int16_t, DFromV<VI32>> di16;
const Twice<decltype(di16)> dt_i16;
const auto dup_b = InterleaveWholeLower(dt_i16, ResizeBitCast(dt_i16, b), ResizeBitCast(dt_i16, b));
const auto dup_c = InterleaveWholeLower(dt_i16, ResizeBitCast(dt_i16, c), ResizeBitCast(dt_i16, c));
return VI32{vec_msums(dup_b.raw, dup_c.raw, a.raw)};
}
Very nice! @Ryo-not-rio FYI John's solution defines the op in terms of NEON/x86, so for SVE we have two extra Zip. Does that work for you?
@jan-wassenberg yes, that is perfect! I've added a comment on the PR requesting support for 8 bit and 16 bit elements as well