simde
simde copied to clipboard
simde_mm256_fmadd_pd as two 128 bit FMA operations?
simde_mm256_fmadd_pd is defined as follows:
simde_mm256_fmadd_pd (simde__m256d a, simde__m256d b, simde__m256d c) {
#if defined(SIMDE_X86_FMA_NATIVE)
return _mm256_fmadd_pd(a, b, c);
#else
return simde_mm256_add_pd(simde_mm256_mul_pd(a, b), c);
#endif
}
When building for a target that doesn't have native 256 bit FMA support, why not use two 128 bit FMA operations on the two halves of the input?
If that's possible, I would be happy to attempt a patch adding that support. I wanted to check if there's some behavioral reason that two Neon 128 bit FMA operations wouldn't be appropriate here.
When building for a target that doesn't have native 256 bit FMA support, why not use two 128 bit FMA operations on the two halves of the input?
When in doubt, check the compiler output, yeah?
And then double check the timings for the 128 bit FMA operations versus the alternatives.
Yes, an investigation into this is welcome!
I investigated further and found that fnmadd was producing some pretty bad disassembly; PR here #1197
fmadd seems to compile reasonably with -O2 on gcc 10, as-is. I haven't checked fmsub or fnmsub or any single precision variants yet.
I wanted to add an additional comment here that I've run into some additional issues handling FMAs, specifically on the Windows/MSVC platform and compiling AVX2+ code down to SSE.
The various fallbacks in fma.h vary, but they mostly try to preserve using an FMA op if possible, which makes sense when porting from AVX+ level x86 to neon/webassembly/etc. On MSVC in particular this leads to really bad codegen however, where a single simde__m256 leads to scalar splay-out and individually running each scalar.
When porting from AVX+ (which implies FMA on x86) to SSE (which does not), the primary fallback should crack the FMA apart into two 128bit FMAs, which then should crack apart into mul+add. I've performed this fixup locally for my purposes, and I'd like to contribute this work back if adding fallbacks like this are kosher for the project.
@Remnant44 , thank you for investigating. Yes, that contribution would be welcome!
This appears to have been fixed in (https://github.com/simd-everywhere/simde/pull/1197) https://github.com/simd-everywhere/simde/commit/bd0532032487df1ea501ae8dc3f8e32abd5309b1 by @AlexK-BD ; thanks again!