volk
volk copied to clipboard
New AVX512F implementation
I think this is a better implementation of the reciprocal kernel as it uses the new _mm512_rcp14_ps intrinsic that handles exceptions correctly. It's accurate to tol < 6.2e-5. On a 7950X3D there is a 30% speedup.
magnus@r7950x3d:~/src/kazam/volk/build$ volk_profile -R reci
RUN_VOLK_TESTS: volk_32f_reciprocal_32f(131071,1987)
generic completed in 20.7839 ms
a_sse completed in 41.2548 ms
a_avx completed in 20.6385 ms
a_avx512 completed in 16.861 ms
u_sse completed in 41.301 ms
u_avx completed in 20.7819 ms
u_avx512 completed in 15.9916 ms
Best aligned arch: u_avx512
Best unaligned arch: u_avx512
Writing /home/magnus/.volk/volk_config...
I ran all kernels for special values:
magnus@r7950x3d:~/src/kazam/scratch$ ./a.out
x:
-0.0000e+00 0.0000e+00 inf -inf nan -nan 1.0000e-30 1.0000e+30
generic:
-inf inf 0.0000e+00 -0.0000e+00 nan -nan 1.0000e+30 1.0000e-30
a_sse:
-inf inf 0.0000e+00 -0.0000e+00 nan -nan 1.0000e+30 1.0000e-30
a_avx:
-inf inf 0.0000e+00 -0.0000e+00 nan -nan 1.0000e+30 1.0000e-30
a_avx512:
-inf inf 0.0000e+00 -0.0000e+00 nan -nan 9.9999e+29 9.9999e-31
NaN and inf with sign are properly handled for all kernels.
No objections. The broken build should be fixed now with #761 . Merging...