simde icon indicating copy to clipboard operation
simde copied to clipboard

Share code between implementations

Open easyaspi314 opened this issue 1 year ago • 1 comments

One thing about SIMD implementations is that there is often a direct equivalent of each intrinsic. Basically everything has add, sub, shift right, etc,.

Therefore, there could be a "common" folder containing the more common intrinsics. Then, we can just reuse this in the platform-specific intrinsic polyfills and avoid any copy-paste errors/missed optimizations.

I would just use an extension of NEON types since NEON has the strongest type system.

This also lets us elegantly handle differing native vector sizes by divide and conquer:

  1. If a native vector size matches the intrinsic's vector size, use the native intrinsics.
  2. If a native vector size is smaller than the intrinsic's vector size (or we are scalar-only), split in half and use the next size down. This already seems to be done in AVX.
  3. If the minumum native vector size is larger than the intrinsic's vector size, widen, use the native intrinsic size, then narrow. This only seems to be necessary for 64-bit vectors. See #1025 for some research on this logic.
// Basic element specific scalar code
SIMDE_FUNCTION_ATTRIBUTES
int32_t simde_add_s32(int32_t a, int32_t b) {
  return a + b;
}
/* forward declare */
SIMDE_FUNCTION_ATTRIBUTES
simde_int32x4_t simde_add_s32x4(simde_int32x4_t a, simde_int32x4_t b);

SIMDE_FUNCTION_ATTRIBUTES
simde_int32x2_t simde_add_s32x2(simde_int32x2_t a, simde_int32x2_t b) {
  #if SIMD_MIN_VECTOR_SIZE_GE(128)
     // see #1025 
     return simde_fast_narrow_s32x4(
       simde_add_s32x4(
         simde_fast_widen_s32x2(a),
         simde_fast_widen_s32x2(b)
      )
    );
  #else 
    simde_int32x2_private a_ = simde_int32x2_to_private(a);
    simde_int32x2_private b_ = simde_int32x2_to_private(b);
    simde_int32x2_private r_;
    #if defined(SIMDE_ARM_NEON_A32V7_NATIVE)
      r_ = vadd_s32(a_.neon_i32, b_.neon_i32);
    #else 
      SIMDE_VECTORIZE
      for (size_t i = 0; i < sizeof(a_.values) / sizeof(a_.values[0]); i++) {
        r_.values[i] = simde_add_s32(a_.values[i], b_.values[i]);
      }
    #endif 
    return simde_int32x2_from_private(r_);
  #endif
}

SIMDE_FUNCTION_ATTRIBUTES
simde_int32x4_t simde_add_s32x4(simde_int32x4_t a, simde_int32x4_t b) {
    simde_int32x4_private a_ = simde_int32x4_to_private(a);
    simde_int32x4_private b_ = simde_int32x4_to_private(b);
    simde_int32x4_private r_;

   #if SIMDE_MIN_VECTOR_SIZE_LT(128)
      r_.s32x2[0] = simde_add_s32x2(a_.s32x2[0], b_.s32x2[0]);
      r_.s32x2[1] = simde_add_s32x2(a_.s32x2[1], b_.s32x2[1]);
    #else
      // all the 128-bit vector stuff here
    #endif
}

// repeat for s32x8, s32x16

I am mostly proposing this because removing MMX will need a massive rewrite anyways, so if any large changes are to be made it would be the best time to do it, and we might as well try to reap the benefits of widening 64-bit vectors on all 128-bit only platforms.

Obviously this can be a gradual change.

easyaspi314 avatar Jun 17 '23 22:06 easyaspi314