sleef Reciprocal square-root algorithms

All architectures currently target by the library have instructions to perform a single NR-iteration of a reciprocal square root (and in some also an exact computation):

x86, x86_64: rsqrtps (SSE), vrsqrtps (AVX), vrsqrt14p{s,d} (AVX-512F), vrsqrt23p{s,d} (AVX-512F)
arm32, arm64: frsqrte (NEON), frsqrts (NEON)
ppc: frsqrte (ALTIVEC 32-bit floats, VSX supports 64-bit floats)

The Intel C Compiler (icc) provides many reciprocal square-root intrinsics with different levels of precision (in bits):

_mm_invsqrt_pd, _mm256_invsqrt_pd, _mm512_invsqrt_pd
_mm_invsqrt_ps, _mm256_invsqrt_ps, _mm512_invsqrt_ps
_mm_rsqrt_ps, _mm256_rsqrt_ps, _mm512_rsqrt23_ps, _mm512_rsqrt28_ps
_mm_rsqrt_pd, _mm256_rsqrt_pd, _mm512_rsqrt14_pd, _mm512_rsqrt28_pd

The SVML library specifies these here (https://software.intel.com/en-us/ipp-dev-reference-invsqrt):

ippsInvSqrt_32f_A11 (14ULPs??), ippsInvSqrt_32f_A21 (4ULPs), ippsInvSqrt_32f_A24 (1ULP), ippsInvSqrt_64f_A26 (6.7E+7 ULPs), ippsInvSqrt_64f_A50 (4 ULPs), ippsInvSqrt_64f_A53 (1ULP)

Clang does not provide most of these, and does not implement the invsqrt intirnsics.

These intrinsics are tricky to implement efficiently and correctly, yet have extensive hardware support, and are very useful (e.g. to normalize vectors). I think it would make sense to provide an API for reciprocal square roots with different levels of precision, just like SVML does.

Aug 15 '18 07:08 gnzlbg

Would you like to try implementing that feature? You are welcomed to contribute. I will advise you on how to implement them correctly.

Aug 15 '18 08:08 shibatch

I don't have hardware to test ARM and PPC (only qemu), and don't have much experience on this front, but I can give it a shot.

Aug 15 '18 09:08 gnzlbg

Okay, then please tentatively write those functions with intrinsics. You can approximate the error in ULP by reinterpreting a floating point value to an integer value, and calculate the difference between correct and approximate values.

Aug 15 '18 09:08 shibatch