highway icon indicating copy to clipboard operation
highway copied to clipboard

Support for saturating doubling multiply add

Open Ryo-not-rio opened this issue 10 months ago • 9 comments

There is a saturating doubling multiply add instructions in NEON and SVE - vqdmla, svqdmla which are equivalent to hn::SaturatedAdd(hn::Mul(2, hn::Mul(a_real, b_imag)),hn::Mul(2, hn::Mul(a_imag, b_real)));. In testing, we have seen a significant performance decrease using highway vs NEON and highway vs SVE. I was wondering if it would be possible to add a similar instruction to highway so we are able to leverage these qdmla instructions.

Ryo-not-rio avatar Apr 03 '24 13:04 Ryo-not-rio

It sounds reasonable to want access to this instruction. I'm curious about the purpose of the extra 2x mul in their definition?

Because NEON differs from SVE in that it takes the upper/lower half, vs even/odd lanes, I think we'd want to define the op like ReorderWidenMulAccumulate, which is allowed to return any lane order. This is fine for dot products but I wonder if it's still OK for your use case, or whether we have some other idea how to bridge the gap between NEON and SVE?

jan-wassenberg avatar Apr 03 '24 14:04 jan-wassenberg

The use case we've tested out is a vector multiplication so the order needs to be preserved. You raise a good point on the difference between NEON and SVE, we could potentially return two vectors - the lower and upper half of the result - this sounds like a good compromise since I'd imagine you'd need the whole vector for most use cases anyway

Ryo-not-rio avatar Apr 03 '24 15:04 Ryo-not-rio

Makes sense. ReorderWidenMulAccumulate also returns a second value using an output param, so there is precedent. So the proposed op would call both vqdmlal and vqdmlal_high on NEON, and svqdmlalb + svqdmlalt on SVE2?

I'm still curious why the 2x factor in the instruction, is it also for Q1.x fixed-point? (similar to MulFixedPoint15)

jan-wassenberg avatar Apr 03 '24 15:04 jan-wassenberg

So the proposed op would call both vqdmlal and vqdmlal_high on NEON, and svqdmlalb + svqdmlalt on SVE2?

yes that is correct, I'm still trying to find out why there is a 2x factor myself, I'll give an update once I have an answer

Ryo-not-rio avatar Apr 03 '24 15:04 Ryo-not-rio

So it looks like the doubling is just a side effect of the way the multiplication is done. The multiplication is done in Q0.15 format which is saturated to Q0.31 format and this is equivalent to doubling saturating multiply-add of two 16 bit integers

Ryo-not-rio avatar Apr 03 '24 16:04 Ryo-not-rio

Got it, thanks. In that case adding FixedPoint to the op name may be helpful.

jan-wassenberg avatar Apr 03 '24 19:04 jan-wassenberg

AVX3_DL is also capable of carrying out the saturated doubling multiply add using the _mm*_dpwssds_epi32 intrinsics.

Here is how the vqdmla op can be implemented on AVX3_DL for I32 vectors that have 1 to 4 lanes:

template <class VI32, HWY_IF_V_SIZE_LE_V(VI32, 16)>
HWY_API VI32 Vqdmla(VI32 a, Vec<Rebind<int16_t, DFromV<VI32>>> b, Vec<Rebind<int16_t, DFromV<VI32>>> c) {
  const Rebind<int16_t, DFromV<VI32>> di16;
  const Twice<decltype(di16)> dt_i16;

  const auto dup_b = InterleaveWholeLower(dt_i16, ResizeBitCast(dt_i16, b), ResizeBitCast(dt_i16, b));
  const auto dup_c = InterleaveWholeLower(dt_i16, ResizeBitCast(dt_i16, c), ResizeBitCast(dt_i16, c));

  return VI32{_mm_dpwssds_epi32(a.raw, dup_b.raw, dup_c.raw)};
}

The vqdmla op can be carried out on PPC8/PPC9/PPC10 using vec_msums. Here is how the vqdmla op can be implemented on PPC8/PPC9/PPC10:

template <class VI32, HWY_IF_V_SIZE_LE_V(VI32, 16)>
HWY_API VI32 Vqdmla(VI32 a, Vec<Rebind<int16_t, DFromV<VI32>>> b, Vec<Rebind<int16_t, DFromV<VI32>>> c) {
  const Rebind<int16_t, DFromV<VI32>> di16;
  const Twice<decltype(di16)> dt_i16;

  const auto dup_b = InterleaveWholeLower(dt_i16, ResizeBitCast(dt_i16, b), ResizeBitCast(dt_i16, b));
  const auto dup_c = InterleaveWholeLower(dt_i16, ResizeBitCast(dt_i16, c), ResizeBitCast(dt_i16, c));

  return VI32{vec_msums(dup_b.raw, dup_c.raw, a.raw)};
}

johnplatts avatar Apr 03 '24 20:04 johnplatts

Very nice! @Ryo-not-rio FYI John's solution defines the op in terms of NEON/x86, so for SVE we have two extra Zip. Does that work for you?

jan-wassenberg avatar Apr 04 '24 08:04 jan-wassenberg

@jan-wassenberg yes, that is perfect! I've added a comment on the PR requesting support for 8 bit and 16 bit elements as well

Ryo-not-rio avatar Apr 05 '24 08:04 Ryo-not-rio