cglm icon indicating copy to clipboard operation
cglm copied to clipboard

Improve glm_quat_conjugate

Open gottfriedleibniz opened this issue 2 years ago • 3 comments

The current implementation of quat_conjugate is quite slow when compiled with SSE. For reference, here is clang's output.

Included in this link are alternate implementations, one of which can be easily extended to WASM and Neon, e.g.,

  float32x4_t mask = glmm_float32x4_init(-1.0f, -1.0f, -1.0f, 1.0);
  glmm_store(dest, vmulq_f32(glmm_load(q), mask));

gottfriedleibniz avatar Aug 04 '23 15:08 gottfriedleibniz

@gottfriedleibniz nice suggestion thanks,

To avoid mul overhead ( if there is no special optimization for -1 ), it would be nice to do that without mul as your implementations in godbolt:

extern
void glm_quat_conjugate_simd(versor q, versor dest) {
#if 0
  __m128i mask = _mm_set_epi32(0, GLMM_NEGZEROf, GLMM_NEGZEROf, GLMM_NEGZEROf);
  glmm_store(dest, _mm_xor_ps(glmm_load(q), _mm_castsi128_ps(mask)));
#else
  __m128 mask = _mm_set_ps(1.0f, -1.0f, -1.0f, -1.0f);
  glmm_store(dest, _mm_mul_ps(glmm_load(q), mask));
#endif
}

with defining GLMM__SIGNMASKf or glmm_float32x4_SIGNMASK_NNNP in SEE, NEON and WASM ... we could write as:

CGLM_INLINE
void
glm_quat_conjugate(versor q, versor dest) {
#if defined(CGLM_SIMD)
  glmm_store(dest, glmm_xor(glmm_load(q), glmm_float32x4_SIGNMASK_NNNP));
#else
  dest[0] = -q[0];
  dest[1] = -q[1];
  dest[2] = -q[2];
  dest[3] =  q[3];
#endif
}

currently there is no glmm_xor in WASM, it would make thing easier to improve glmm_ api.

recp avatar Aug 05 '23 09:08 recp

Seems good.

Although, I wouldn't be surprised if the scalar equivalent is faster or as-fast on ARM64 (and maybe ARMv7). Some timing here would be nice, but can be done later.

gottfriedleibniz avatar Aug 05 '23 12:08 gottfriedleibniz

Thanks,

Although, I wouldn't be surprised if the scalar equivalent is faster or as-fast on ARM64 (and maybe ARMv7). Some timing here would be nice, but can be done later.

Sure, simd can be ignored if there is no benefits on ARM ( or maybe on other platforms too ), as you said benchmark could be done asap.

recp avatar Aug 05 '23 15:08 recp