sleef icon indicating copy to clipboard operation
sleef copied to clipboard

Reciprocal square-root algorithms

Open gnzlbg opened this issue 7 years ago • 3 comments

All architectures currently target by the library have instructions to perform a single NR-iteration of a reciprocal square root (and in some also an exact computation):

  • x86, x86_64: rsqrtps (SSE), vrsqrtps (AVX), vrsqrt14p{s,d} (AVX-512F), vrsqrt23p{s,d} (AVX-512F)
  • arm32, arm64: frsqrte (NEON), frsqrts (NEON)
  • ppc: frsqrte (ALTIVEC 32-bit floats, VSX supports 64-bit floats)

The Intel C Compiler (icc) provides many reciprocal square-root intrinsics with different levels of precision (in bits):

  • _mm_invsqrt_pd, _mm256_invsqrt_pd, _mm512_invsqrt_pd
  • _mm_invsqrt_ps, _mm256_invsqrt_ps, _mm512_invsqrt_ps
  • _mm_rsqrt_ps, _mm256_rsqrt_ps, _mm512_rsqrt23_ps, _mm512_rsqrt28_ps
  • _mm_rsqrt_pd, _mm256_rsqrt_pd, _mm512_rsqrt14_pd, _mm512_rsqrt28_pd

The SVML library specifies these here (https://software.intel.com/en-us/ipp-dev-reference-invsqrt):

  • ippsInvSqrt_32f_A11 (14ULPs??), ippsInvSqrt_32f_A21 (4ULPs), ippsInvSqrt_32f_A24 (1ULP), ippsInvSqrt_64f_A26 (6.7E+7 ULPs), ippsInvSqrt_64f_A50 (4 ULPs), ippsInvSqrt_64f_A53 (1ULP)

Clang does not provide most of these, and does not implement the invsqrt intirnsics.

These intrinsics are tricky to implement efficiently and correctly, yet have extensive hardware support, and are very useful (e.g. to normalize vectors). I think it would make sense to provide an API for reciprocal square roots with different levels of precision, just like SVML does.

gnzlbg avatar Aug 15 '18 07:08 gnzlbg

Would you like to try implementing that feature? You are welcomed to contribute. I will advise you on how to implement them correctly.

shibatch avatar Aug 15 '18 08:08 shibatch

I don't have hardware to test ARM and PPC (only qemu), and don't have much experience on this front, but I can give it a shot.

gnzlbg avatar Aug 15 '18 09:08 gnzlbg

Okay, then please tentatively write those functions with intrinsics. You can approximate the error in ULP by reinterpreting a floating point value to an integer value, and calculate the difference between correct and approximate values.

shibatch avatar Aug 15 '18 09:08 shibatch