pnumpy icon indicating copy to clipboard operation
pnumpy copied to clipboard

No central location for AVX math

Open tdimitri opened this issue 4 years ago • 6 comments

In numpy, I notice numpy intrinsics being checking by xiegengxin xiegengxin

However, it is difficult to tell which routines are best because they are also checked in tis project.

Then further when tests are added... like this one for testing np.log strided, np.log strided They do not test the multithreaded version.

Further, writing C macro expansion code like below is not easily portable and harder to follow -- this is only done because not allowed to use template code? And this is done, because we have not broken out and linked to an external math package? That seems to me like the primary problem that should be resolved, otherwise code like this will keep being written.

static NPY_INLINE NPY_GCC_OPT_3 NPY_GCC_TARGET_AVX512F @vtype@
avx512_cabsolute_@vsub@(const @vtype@ x1,
                        const @vtype@ x2,
                        const __m512i re_indices,
                        const __m512i im_indices)
{
    @vtype@ inf = _mm512_set1_@vsub@(@INF@);
    @vtype@ nan = _mm512_set1_@vsub@(@NAN@);
    @vtype@ x1_abs = avx512_abs_@vsub@(x1);
    @vtype@ x2_abs = avx512_abs_@vsub@(x2);
    @vtype@ re = _mm512_permutex2var_@vsub@(x1_abs, re_indices, x2_abs);
    @vtype@ im = _mm512_permutex2var_@vsub@(x1_abs, im_indices , x2_abs);

thus it becomes difficult to track best of breed because there are two locations this code is being wrtten to.

Further, new intel intrinsics have been added to immintrin.h where Intel/Microsoft/etc. are providing the speed ups for functions like np.sin(). This is again, in contrast to code being checked into numpy/core/src/umath/simd.inc.src

// Math functions
extern __m128  _mm_sin_ps(__m128);
extern __m128d _mm_sin_pd(__m128d);
extern __m256  _mm256_sin_ps(__m256);
extern __m256d _mm256_sin_pd(__m256d);
extern __m128  _mm_cos_ps(__m128);
extern __m128d _mm_cos_pd(__m128d);
extern __m256  _mm256_cos_ps(__m256);
extern __m256d _mm256_cos_pd(__m256d);
extern __m128  _mm_sincos_ps(__m128  * /*cos_res*/, __m128);
extern __m128d _mm_sincos_pd(__m128d * /*cos_res*/, __m128d);
extern __m256  _mm256_sincos_ps(__m256  * /*cos_res*/, __m256);
extern __m256d _mm256_sincos_pd(__m256d * /*cos_res*/, __m256d);
extern __m128  _mm_tan_ps(__m128);
extern __m128d _mm_tan_pd(__m128d);
extern __m256  _mm256_tan_ps(__m256);
extern __m256d _mm256_tan_pd(__m256d);
extern __m128  _mm_asin_ps(__m128);
extern __m128d _mm_asin_pd(__m128d);
extern __m256  _mm256_asin_ps(__m256);
extern __m256d _mm256_asin_pd(__m256d);
extern __m128  _mm_acos_ps(__m128);
extern __m128d _mm_acos_pd(__m128d);
extern __m256  _mm256_acos_ps(__m256);
extern __m256d _mm256_acos_pd(__m256d);
extern __m128  _mm_atan_ps(__m128);
extern __m128d _mm_atan_pd(__m128d);
extern __m256  _mm256_atan_ps(__m256);
extern __m256d _mm256_atan_pd(__m256d);
extern __m128  _mm_atan2_ps(__m128, __m128);
extern __m128d _mm_atan2_pd(__m128d, __m128d);
extern __m256  _mm256_atan2_ps(__m256, __m256);
extern __m256d _mm256_atan2_pd(__m256d, __m256d);
extern __m128  _mm_sind_ps(__m128);
extern __m128d _mm_sind_pd(__m128d);
extern __m256  _mm256_sind_ps(__m256);
extern __m256d _mm256_sind_pd(__m256d);
extern __m128  _mm_cosd_ps(__m128);
extern __m128d _mm_cosd_pd(__m128d);
extern __m256  _mm256_cosd_ps(__m256);
extern __m256d _mm256_cosd_pd(__m256d);
extern __m128  _mm_tand_ps(__m128);
extern __m128d _mm_tand_pd(__m128d);
extern __m256  _mm256_tand_ps(__m256);
extern __m256d _mm256_tand_pd(__m256d);
extern __m128  _mm_sinh_ps(__m128);
extern __m128d _mm_sinh_pd(__m128d);
extern __m256  _mm256_sinh_ps(__m256);
extern __m256d _mm256_sinh_pd(__m256d);
extern __m128  _mm_cosh_ps(__m128);
extern __m128d _mm_cosh_pd(__m128d);
extern __m256  _mm256_cosh_ps(__m256);
extern __m256d _mm256_cosh_pd(__m256d);
extern __m128  _mm_tanh_ps(__m128);
extern __m128d _mm_tanh_pd(__m128d);
extern __m256  _mm256_tanh_ps(__m256);
extern __m256d _mm256_tanh_pd(__m256d);
extern __m128  _mm_asinh_ps(__m128);
extern __m128d _mm_asinh_pd(__m128d);
extern __m256  _mm256_asinh_ps(__m256);
extern __m256d _mm256_asinh_pd(__m256d);
extern __m128  _mm_acosh_ps(__m128);
extern __m128d _mm_acosh_pd(__m128d);
extern __m256  _mm256_acosh_ps(__m256);
extern __m256d _mm256_acosh_pd(__m256d);
extern __m128  _mm_atanh_ps(__m128);
extern __m128d _mm_atanh_pd(__m128d);
extern __m256  _mm256_atanh_ps(__m256);
extern __m256d _mm256_atanh_pd(__m256d);
extern __m128  _mm_log_ps(__m128);
extern __m128d _mm_log_pd(__m128d);
extern __m256  _mm256_log_ps(__m256);
extern __m256d _mm256_log_pd(__m256d);
extern __m128  _mm_log1p_ps(__m128);
extern __m128d _mm_log1p_pd(__m128d);
extern __m256  _mm256_log1p_ps(__m256);
extern __m256d _mm256_log1p_pd(__m256d);
extern __m128  _mm_log10_ps(__m128);
extern __m128d _mm_log10_pd(__m128d);
extern __m256  _mm256_log10_ps(__m256);
extern __m256d _mm256_log10_pd(__m256d);
extern __m128  _mm_log2_ps(__m128);
extern __m128d _mm_log2_pd(__m128d);
extern __m256  _mm256_log2_ps(__m256);
extern __m256d _mm256_log2_pd(__m256d);
extern __m128  _mm_logb_ps(__m128);
extern __m128d _mm_logb_pd(__m128d);
extern __m256  _mm256_logb_ps(__m256);
extern __m256d _mm256_logb_pd(__m256d);
extern __m128  _mm_exp_ps(__m128);
extern __m128d _mm_exp_pd(__m128d);
extern __m256  _mm256_exp_ps(__m256);
extern __m256d _mm256_exp_pd(__m256d);
extern __m128  _mm_exp10_ps(__m128);
extern __m128d _mm_exp10_pd(__m128d);
extern __m256  _mm256_exp10_ps(__m256);
extern __m256d _mm256_exp10_pd(__m256d);
extern __m128  _mm_exp2_ps(__m128);
extern __m128d _mm_exp2_pd(__m128d);
extern __m256  _mm256_exp2_ps(__m256);
extern __m256d _mm256_exp2_pd(__m256d);
extern __m128  _mm_expm1_ps(__m128);
extern __m128d _mm_expm1_pd(__m128d);
extern __m256  _mm256_expm1_ps(__m256);
extern __m256d _mm256_expm1_pd(__m256d);
extern __m128  _mm_pow_ps(__m128, __m128);
extern __m128d _mm_pow_pd(__m128d, __m128d);
extern __m256  _mm256_pow_ps(__m256, __m256);
extern __m256d _mm256_pow_pd(__m256d, __m256d);
extern __m128  _mm_trunc_ps(__m128);
extern __m128d _mm_trunc_pd(__m128d);
extern __m256  _mm256_trunc_ps(__m256);
extern __m256d _mm256_trunc_pd(__m256d);
extern __m128  _mm_svml_floor_ps(__m128);
extern __m128d _mm_svml_floor_pd(__m128d);
extern __m256  _mm256_svml_floor_ps(__m256);
extern __m256d _mm256_svml_floor_pd(__m256d);
extern __m128  _mm_svml_ceil_ps(__m128);
extern __m128d _mm_svml_ceil_pd(__m128d);
extern __m256  _mm256_svml_ceil_ps(__m256);
extern __m256d _mm256_svml_ceil_pd(__m256d);
extern __m128  _mm_svml_round_ps(__m128);
extern __m128d _mm_svml_round_pd(__m128d);
extern __m256  _mm256_svml_round_ps(__m256);
extern __m256d _mm256_svml_round_pd(__m256d);
extern __m128  _mm_fmod_ps(__m128, __m128);
extern __m128d _mm_fmod_pd(__m128d, __m128d);
extern __m256  _mm256_fmod_ps(__m256, __m256);
extern __m256d _mm256_fmod_pd(__m256d, __m256d);
extern __m128  _mm_svml_sqrt_ps(__m128);
extern __m128d _mm_svml_sqrt_pd(__m128d);
extern __m256  _mm256_svml_sqrt_ps(__m256);
extern __m256d _mm256_svml_sqrt_pd(__m256d);
extern __m128  _mm_invsqrt_ps(__m128);
extern __m128d _mm_invsqrt_pd(__m128d);
extern __m256  _mm256_invsqrt_ps(__m256);
extern __m256d _mm256_invsqrt_pd(__m256d);
extern __m128  _mm_cbrt_ps(__m128);
extern __m128d _mm_cbrt_pd(__m128d);
extern __m256  _mm256_cbrt_ps(__m256);
extern __m256d _mm256_cbrt_pd(__m256d);
extern __m128  _mm_invcbrt_ps(__m128);
extern __m128d _mm_invcbrt_pd(__m128d);
extern __m256  _mm256_invcbrt_ps(__m256);
extern __m256d _mm256_invcbrt_pd(__m256d);
extern __m128  _mm_hypot_ps(__m128, __m128);
extern __m128d _mm_hypot_pd(__m128d, __m128d);
extern __m256  _mm256_hypot_ps(__m256, __m256);
extern __m256d _mm256_hypot_pd(__m256d, __m256d);
extern __m128  _mm_cdfnorm_ps(__m128);
extern __m128d _mm_cdfnorm_pd(__m128d);
extern __m256  _mm256_cdfnorm_ps(__m256);
extern __m256d _mm256_cdfnorm_pd(__m256d);
extern __m128  _mm_cdfnorminv_ps(__m128);
extern __m128d _mm_cdfnorminv_pd(__m128d);
extern __m256  _mm256_cdfnorminv_ps(__m256);
extern __m256d _mm256_cdfnorminv_pd(__m256d);
extern __m128  _mm_cexp_ps(__m128);
extern __m256  _mm256_cexp_ps(__m256);
extern __m128  _mm_clog_ps(__m128);
extern __m256  _mm256_clog_ps(__m256);
extern __m128  _mm_csqrt_ps(__m128);
extern __m256  _mm256_csqrt_ps(__m256);
extern __m128  _mm_erf_ps(__m128);
extern __m128d _mm_erf_pd(__m128d);
extern __m256  _mm256_erf_ps(__m256);
extern __m256d _mm256_erf_pd(__m256d);
extern __m128  _mm_erfc_ps(__m128);
extern __m128d _mm_erfc_pd(__m128d);
extern __m256  _mm256_erfc_ps(__m256);
extern __m256d _mm256_erfc_pd(__m256d);
extern __m128  _mm_erfcinv_ps(__m128);
extern __m128d _mm_erfcinv_pd(__m128d);
extern __m256  _mm256_erfcinv_ps(__m256);
extern __m256d _mm256_erfcinv_pd(__m256d);
extern __m128  _mm_erfinv_ps(__m128);
extern __m128d _mm_erfinv_pd(__m128d);
extern __m256  _mm256_erfinv_ps(__m256);
extern __m256d _mm256_erfinv_pd(__m256d);

/* Cache line demote */
extern void _mm_cldemote(void const *);
#define _cldemote  _mm_cldemote

/* Direct stores */
extern void _directstoreu_u32(void *, unsigned int);
#if defined (_M_X64)
extern void _directstoreu_u64(void *, unsigned __int64);
#endif  /* defined (_M_X64) */
extern void _movdir64b(void *, void const *);

tdimitri avatar Oct 20 '20 13:10 tdimitri

Did you look through the Universal SIMD NEP and the work being done to make this not specific to x86? The idea is that abstract intrinsics will be replaced (at compile time, by macros) with platform-specific ones.

mattip avatar Oct 20 '20 13:10 mattip

As long as that package is external to numpy and python, and is a lib that can be loaded and used universally, then I agree. The vector math routines are universal to all platforms and languages and should be decoupled from numpy. Then others can participate -- for instance contribute routines like BooleanSum for all platforms/languages.. for instance if Rust wants a fast BooleanSum... or what about SDS file loader which has a boolean mask, and so on.

tdimitri avatar Oct 20 '20 13:10 tdimitri

This is dependent on action from NumPy. It would be good to open an issue there and link that back here.

mattip avatar Oct 26 '20 14:10 mattip

btw, last week I took the time to show a complete compare & contrast example of the log operation written in numpy with C macro expansion vs C++ templates.

I am biased, but I think the C++ template is more portable, easier to read and maintain. The cpp file ops_log.cpp stands on its own.

templated log example

numpy log example

And then another routine here: open source example

Then both Intel and Microsoft compilers have _mm256_log_ps() intrinsics (which appear to run faster than existing one in numpy).

gcc has __m256d _ZGVdN4v_log(__m256d x); built in (and maybe more - which appears to run slower) For clang, it might have one also but I have not bothered look.

That is at least 5 different versions of AVX log functions...which one is the best of breed?

Then further, the numpy implementation (which is well optimized by a skilled developer who knew what they were doing)... even that well written code has a missing optimization:

 exponent = @isa@_get_exponent(x);
 x = @isa@_get_mantissa(x);

the call to get_exponent and get_mantissa are very similar and can be further optimized into one routine that returns both the exponent and mantissa.

numpy get_exponent templated get_exponent

It is easier to spot this optimization in the templated code because it is easier to read and follow. My C++ editor can tag all the functions, showing me both template and intrinsics information. Compiler errors are easier to understand and fix. But my editor has little clue what is going on with the macro expansion C code.

Is anyone going to further optimize the log operation in numpy src/umath/simd.inc.src ? If they do, that optimization is only for numpy, as opposed to the rest of world, plus that developer is writing in an macro expansion style that is specific to just numpy. Be nice if numpy would switch over to the templated style and to an external lib that the world could collaborate on and use.

tdimitri avatar Oct 26 '20 15:10 tdimitri

xref numpy/numpy#17698

mattip avatar Nov 03 '20 11:11 mattip

In particular there is this breakout of the Universal SIMD functions. Maybe we could be its first users?

mattip avatar Dec 15 '20 14:12 mattip