sleef Provide APIs for static dispatch

In many cases it is best to skip the dynamic dispatch and and instead just call a function chosen at compile time.

For example, sometimes the code is compiled for a specific machine (often the one it's being compiled on) and the compiled code will never be distributed to other machines.

Other times people will already be doing dynamic dispatch higher up in the call stack, and SLEEF's dynamic dispatch is just wasted cycles.

AFAICT, currently the only way to accommodate this is to have code like

#if defined(__AVX2__)
  res = Sleef_acosf4_u35avx2128(value);
#elif defined(__SSE4_1__)
  res = Sleef_acosf4_u35sse4(value);
#elif defined(__SSE2__)
  res = Sleef_acosf4_u35sse2(value);
#endif

This is especially horrible for code which is meant to run on multiple architectures. For example, here is some code I just wrote:

    #if defined(SIMDE_MATH_SLEEF_ENABLE) && defined(SIMDE_X86_SSE2_NATIVE)
      r_.n = SIMDE_X86_SVML_SLEEF_LIBM(acosf4,sse2)(a);
    #elif defined(SIMDE_MATH_SLEEF_ENABLE) && defined(SIMDE_X86_SSE4_1_NATIVE)
      r_.n = SIMDE_X86_SVML_SLEEF_LIBM(acosf4,sse4)(a);
    #elif defined(SIMDE_MATH_SLEEF_ENABLE) && defined(SIMDE_X86_AVX2_NATIVE)
      r_.n = SIMDE_X86_SVML_SLEEF_LIBM(acosf4,avx2128)(a);
    #elif defined(SIMDE_MATH_SLEEF_ENABLE) && defined(SIMDE_ARM_NEON_A32V7_NATIVE) && (__ARM_NEON_FP >= 6)
      r_.neon_f32 = SIMDE_X86_SVML_SLEEF_LIBM(acosf4,neonvfpv4)(a);
    #elif defined(SIMDE_MATH_SLEEF_ENABLE) && defined(SIMDE_ARM_NEON_A32V7_NATIVE)
      r_.neon_f32 = SIMDE_X86_SVML_SLEEF_LIBM(acosf4,neon)(a);
    #elif defined(SIMDE_MATH_SLEEF_ENABLE) && defined(SIMDE_ARM_NEON_A64V8_NATIVE) && defined(__ARM_FEATURE_FMA)
      r_.neon_f32 = SIMDE_X86_SVML_SLEEF_LIBM(acosf4,advsimd)(a);
    #elif defined(SIMDE_MATH_SLEEF_ENABLE) && defined(SIMDE_ARM_NEON_A64V8_NATIVE)
      r_.neon_f32 = SIMDE_X86_SVML_SLEEF_LIBM(acosf4,advsimdnofma)(a);
    #elif defined(ISMED_MATH_SLEEF_ENABLE) && defined(SIMDE_POWER_ALTIVEC_P7) && defined(__FP_FAST_FMAF128)
      r_.altivec_f32 = SIMDE_X86_SVML_SLEEF_LIBM(acosf4,vsx)(a);
    #elif defined(ISMED_MATH_SLEEF_ENABLE) && defined(SIMDE_POWER_ALTIVEC_P7)
      r_.altivec_f32 = SIMDE_X86_SVML_SLEEF_LIBM(acosf4,vsxnofma)(a);
    #else
      SIMDE_VECTORIZE
      for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) {
        r_.f32[i] = simde_math_acosf(a_.f32[i]);
      }
    #endif

(The SIMDE_X86_SVML_SLEEF_LIBM macro switches between u10 and u35, depending on another macro, the other macros are probably obvious).

It would be nice if SLEEF provided macros for this, something like

#if defined(__AVX2__)
  #define Sleef_acosf4_u35static Sleef_acosf4_u35avx2128
#elif defined(__SSE4_1__)
  #define Sleef_acosf4_u35static Sleef_acosf4_u35sse4
#elif defined(__SSE2__)
  #define Sleef_acosf4_u35static Sleef_acosf4_u35sse2
#endif

That way people could just call Sleef_acosf4_u35static, similarly to how you can call Sleef_acosf4_u35 for the dynamic dispatch version, but without the unnecessary overhead.

FWIW, if this sounds like something you'd be interested in I'd probably be willing to supply a patch, though a bit of guidance on how to make it work would be appreciated since I'm not really familiar with the SLEEF internals.

Jul 22 '20 04:07 nemequ

I was thinking about the same thing. Sleef needs a flexible mechanism to manage the relationship between the baseline and the dispatch-able CPU features, also simplifying the runtime dispatching. similar to what OpenCV/NumPy has but that will require a lot of work (^_^).

Jul 22 '20 12:07 seiko2plus

Regardless of what SLEEF does on the dynamic dispatch side, I think something like this will be necessary for the static dispatch part. I want something that happens 100% at compile time.

Jul 22 '20 18:07 nemequ

@shibatch, sorry to pressure you like this, but…

We have a Google Summer of Code student (@himanshi18037) who is interested in working on this. She is the one who added the SVML implementation to SIMDe, and has been working on adding a SLEEF implementation to it which currently uses the dynamic dispatch functions.

GSoC doesn't last much longer, so she would need to start on this ASAP. If you're interested in this idea, please let me know. I should be able to help her with most of the code so there shouldn't be too much for you to do, but I'd really appreciate it if you could provide a basic outline of what you'd like to see the patch look like (mostly just where the code should go, a basic idea of how to structure it, and how to get it into sleef.h).

Jul 26 '20 17:07 nemequ

@nemequ Hello,

I am interested in that idea, but planning may need some time and experiments. I think it is not so straightforward, since sleef now has lots of functionalities.

Jul 28 '20 00:07 shibatch

I'm trying to see how the functions are generated; I see the rename.h header for the dynamic dispatch versions, but where do the names like Sleef_acosf4_u35sse2 get generated? It would be great if we could automate this…

Another option might be a small script to parse the header. The function naming convention is consistent, so it wouldn't be too hard to parse the names for each variant to figure out what the correct preprocessor flags to test would be. We could use that to generate a second header (sleef-static.h maybe?), or possibly just append to the existing header.

Jul 28 '20 20:07 nemequ

mkrename.c generates header files for renaming functions with each vector extension. This process is (kind of) automated.

Before introducing scripts, please consider how long that scripting language will be supported. I don't like the scripts to be deprecated, and so I preferred generating many files with plain C programs.

Jul 28 '20 23:07 shibatch