cglm
cglm copied to clipboard
Improve glm_quat_conjugate
The current implementation of quat_conjugate is quite slow when compiled with SSE. For reference, here is clang's output.
Included in this link are alternate implementations, one of which can be easily extended to WASM and Neon, e.g.,
float32x4_t mask = glmm_float32x4_init(-1.0f, -1.0f, -1.0f, 1.0);
glmm_store(dest, vmulq_f32(glmm_load(q), mask));
@gottfriedleibniz nice suggestion thanks,
To avoid mul overhead ( if there is no special optimization for -1 ), it would be nice to do that without mul as your implementations in godbolt:
extern
void glm_quat_conjugate_simd(versor q, versor dest) {
#if 0
__m128i mask = _mm_set_epi32(0, GLMM_NEGZEROf, GLMM_NEGZEROf, GLMM_NEGZEROf);
glmm_store(dest, _mm_xor_ps(glmm_load(q), _mm_castsi128_ps(mask)));
#else
__m128 mask = _mm_set_ps(1.0f, -1.0f, -1.0f, -1.0f);
glmm_store(dest, _mm_mul_ps(glmm_load(q), mask));
#endif
}
with defining GLMM__SIGNMASKf or glmm_float32x4_SIGNMASK_NNNP in SEE, NEON and WASM ... we could write as:
CGLM_INLINE
void
glm_quat_conjugate(versor q, versor dest) {
#if defined(CGLM_SIMD)
glmm_store(dest, glmm_xor(glmm_load(q), glmm_float32x4_SIGNMASK_NNNP));
#else
dest[0] = -q[0];
dest[1] = -q[1];
dest[2] = -q[2];
dest[3] = q[3];
#endif
}
currently there is no glmm_xor in WASM, it would make thing easier to improve glmm_ api.
Seems good.
Although, I wouldn't be surprised if the scalar equivalent is faster or as-fast on ARM64 (and maybe ARMv7). Some timing here would be nice, but can be done later.
Thanks,
Although, I wouldn't be surprised if the scalar equivalent is faster or as-fast on ARM64 (and maybe ARMv7). Some timing here would be nice, but can be done later.
Sure, simd can be ignored if there is no benefits on ARM ( or maybe on other platforms too ), as you said benchmark could be done asap.