simde icon indicating copy to clipboard operation
simde copied to clipboard

simde_mm256_fmadd_pd as two 128 bit FMA operations?

Open AlexK-BD opened this issue 1 year ago • 4 comments

simde_mm256_fmadd_pd is defined as follows:

simde_mm256_fmadd_pd (simde__m256d a, simde__m256d b, simde__m256d c) {
  #if defined(SIMDE_X86_FMA_NATIVE)
    return _mm256_fmadd_pd(a, b, c);
  #else
    return simde_mm256_add_pd(simde_mm256_mul_pd(a, b), c);
  #endif
}

When building for a target that doesn't have native 256 bit FMA support, why not use two 128 bit FMA operations on the two halves of the input?

If that's possible, I would be happy to attempt a patch adding that support. I wanted to check if there's some behavioral reason that two Neon 128 bit FMA operations wouldn't be appropriate here.

AlexK-BD avatar Jul 16 '24 14:07 AlexK-BD

When building for a target that doesn't have native 256 bit FMA support, why not use two 128 bit FMA operations on the two halves of the input?

When in doubt, check the compiler output, yeah?

And then double check the timings for the 128 bit FMA operations versus the alternatives.

Yes, an investigation into this is welcome!

mr-c avatar Jul 16 '24 14:07 mr-c

I investigated further and found that fnmadd was producing some pretty bad disassembly; PR here #1197

fmadd seems to compile reasonably with -O2 on gcc 10, as-is. I haven't checked fmsub or fnmsub or any single precision variants yet.

AlexK-BD avatar Jul 16 '24 16:07 AlexK-BD

I wanted to add an additional comment here that I've run into some additional issues handling FMAs, specifically on the Windows/MSVC platform and compiling AVX2+ code down to SSE.

The various fallbacks in fma.h vary, but they mostly try to preserve using an FMA op if possible, which makes sense when porting from AVX+ level x86 to neon/webassembly/etc. On MSVC in particular this leads to really bad codegen however, where a single simde__m256 leads to scalar splay-out and individually running each scalar.

When porting from AVX+ (which implies FMA on x86) to SSE (which does not), the primary fallback should crack the FMA apart into two 128bit FMAs, which then should crack apart into mul+add. I've performed this fixup locally for my purposes, and I'd like to contribute this work back if adding fallbacks like this are kosher for the project.

Remnant44 avatar Sep 12 '24 20:09 Remnant44

@Remnant44 , thank you for investigating. Yes, that contribution would be welcome!

mr-c avatar Sep 13 '24 06:09 mr-c

This appears to have been fixed in (https://github.com/simd-everywhere/simde/pull/1197) https://github.com/simd-everywhere/simde/commit/bd0532032487df1ea501ae8dc3f8e32abd5309b1 by @AlexK-BD ; thanks again!

mr-c avatar Feb 01 '25 17:02 mr-c