pnumpy
pnumpy copied to clipboard
No central location for AVX math
In numpy, I notice numpy intrinsics being checking by xiegengxin xiegengxin
However, it is difficult to tell which routines are best because they are also checked in tis project.
Then further when tests are added... like this one for testing np.log strided, np.log strided They do not test the multithreaded version.
Further, writing C macro expansion code like below is not easily portable and harder to follow -- this is only done because not allowed to use template code? And this is done, because we have not broken out and linked to an external math package? That seems to me like the primary problem that should be resolved, otherwise code like this will keep being written.
static NPY_INLINE NPY_GCC_OPT_3 NPY_GCC_TARGET_AVX512F @vtype@
avx512_cabsolute_@vsub@(const @vtype@ x1,
const @vtype@ x2,
const __m512i re_indices,
const __m512i im_indices)
{
@vtype@ inf = _mm512_set1_@vsub@(@INF@);
@vtype@ nan = _mm512_set1_@vsub@(@NAN@);
@vtype@ x1_abs = avx512_abs_@vsub@(x1);
@vtype@ x2_abs = avx512_abs_@vsub@(x2);
@vtype@ re = _mm512_permutex2var_@vsub@(x1_abs, re_indices, x2_abs);
@vtype@ im = _mm512_permutex2var_@vsub@(x1_abs, im_indices , x2_abs);
thus it becomes difficult to track best of breed because there are two locations this code is being wrtten to.
Further, new intel intrinsics have been added to immintrin.h where Intel/Microsoft/etc. are providing the speed ups for functions like np.sin(). This is again, in contrast to code being checked into numpy/core/src/umath/simd.inc.src
// Math functions
extern __m128 _mm_sin_ps(__m128);
extern __m128d _mm_sin_pd(__m128d);
extern __m256 _mm256_sin_ps(__m256);
extern __m256d _mm256_sin_pd(__m256d);
extern __m128 _mm_cos_ps(__m128);
extern __m128d _mm_cos_pd(__m128d);
extern __m256 _mm256_cos_ps(__m256);
extern __m256d _mm256_cos_pd(__m256d);
extern __m128 _mm_sincos_ps(__m128 * /*cos_res*/, __m128);
extern __m128d _mm_sincos_pd(__m128d * /*cos_res*/, __m128d);
extern __m256 _mm256_sincos_ps(__m256 * /*cos_res*/, __m256);
extern __m256d _mm256_sincos_pd(__m256d * /*cos_res*/, __m256d);
extern __m128 _mm_tan_ps(__m128);
extern __m128d _mm_tan_pd(__m128d);
extern __m256 _mm256_tan_ps(__m256);
extern __m256d _mm256_tan_pd(__m256d);
extern __m128 _mm_asin_ps(__m128);
extern __m128d _mm_asin_pd(__m128d);
extern __m256 _mm256_asin_ps(__m256);
extern __m256d _mm256_asin_pd(__m256d);
extern __m128 _mm_acos_ps(__m128);
extern __m128d _mm_acos_pd(__m128d);
extern __m256 _mm256_acos_ps(__m256);
extern __m256d _mm256_acos_pd(__m256d);
extern __m128 _mm_atan_ps(__m128);
extern __m128d _mm_atan_pd(__m128d);
extern __m256 _mm256_atan_ps(__m256);
extern __m256d _mm256_atan_pd(__m256d);
extern __m128 _mm_atan2_ps(__m128, __m128);
extern __m128d _mm_atan2_pd(__m128d, __m128d);
extern __m256 _mm256_atan2_ps(__m256, __m256);
extern __m256d _mm256_atan2_pd(__m256d, __m256d);
extern __m128 _mm_sind_ps(__m128);
extern __m128d _mm_sind_pd(__m128d);
extern __m256 _mm256_sind_ps(__m256);
extern __m256d _mm256_sind_pd(__m256d);
extern __m128 _mm_cosd_ps(__m128);
extern __m128d _mm_cosd_pd(__m128d);
extern __m256 _mm256_cosd_ps(__m256);
extern __m256d _mm256_cosd_pd(__m256d);
extern __m128 _mm_tand_ps(__m128);
extern __m128d _mm_tand_pd(__m128d);
extern __m256 _mm256_tand_ps(__m256);
extern __m256d _mm256_tand_pd(__m256d);
extern __m128 _mm_sinh_ps(__m128);
extern __m128d _mm_sinh_pd(__m128d);
extern __m256 _mm256_sinh_ps(__m256);
extern __m256d _mm256_sinh_pd(__m256d);
extern __m128 _mm_cosh_ps(__m128);
extern __m128d _mm_cosh_pd(__m128d);
extern __m256 _mm256_cosh_ps(__m256);
extern __m256d _mm256_cosh_pd(__m256d);
extern __m128 _mm_tanh_ps(__m128);
extern __m128d _mm_tanh_pd(__m128d);
extern __m256 _mm256_tanh_ps(__m256);
extern __m256d _mm256_tanh_pd(__m256d);
extern __m128 _mm_asinh_ps(__m128);
extern __m128d _mm_asinh_pd(__m128d);
extern __m256 _mm256_asinh_ps(__m256);
extern __m256d _mm256_asinh_pd(__m256d);
extern __m128 _mm_acosh_ps(__m128);
extern __m128d _mm_acosh_pd(__m128d);
extern __m256 _mm256_acosh_ps(__m256);
extern __m256d _mm256_acosh_pd(__m256d);
extern __m128 _mm_atanh_ps(__m128);
extern __m128d _mm_atanh_pd(__m128d);
extern __m256 _mm256_atanh_ps(__m256);
extern __m256d _mm256_atanh_pd(__m256d);
extern __m128 _mm_log_ps(__m128);
extern __m128d _mm_log_pd(__m128d);
extern __m256 _mm256_log_ps(__m256);
extern __m256d _mm256_log_pd(__m256d);
extern __m128 _mm_log1p_ps(__m128);
extern __m128d _mm_log1p_pd(__m128d);
extern __m256 _mm256_log1p_ps(__m256);
extern __m256d _mm256_log1p_pd(__m256d);
extern __m128 _mm_log10_ps(__m128);
extern __m128d _mm_log10_pd(__m128d);
extern __m256 _mm256_log10_ps(__m256);
extern __m256d _mm256_log10_pd(__m256d);
extern __m128 _mm_log2_ps(__m128);
extern __m128d _mm_log2_pd(__m128d);
extern __m256 _mm256_log2_ps(__m256);
extern __m256d _mm256_log2_pd(__m256d);
extern __m128 _mm_logb_ps(__m128);
extern __m128d _mm_logb_pd(__m128d);
extern __m256 _mm256_logb_ps(__m256);
extern __m256d _mm256_logb_pd(__m256d);
extern __m128 _mm_exp_ps(__m128);
extern __m128d _mm_exp_pd(__m128d);
extern __m256 _mm256_exp_ps(__m256);
extern __m256d _mm256_exp_pd(__m256d);
extern __m128 _mm_exp10_ps(__m128);
extern __m128d _mm_exp10_pd(__m128d);
extern __m256 _mm256_exp10_ps(__m256);
extern __m256d _mm256_exp10_pd(__m256d);
extern __m128 _mm_exp2_ps(__m128);
extern __m128d _mm_exp2_pd(__m128d);
extern __m256 _mm256_exp2_ps(__m256);
extern __m256d _mm256_exp2_pd(__m256d);
extern __m128 _mm_expm1_ps(__m128);
extern __m128d _mm_expm1_pd(__m128d);
extern __m256 _mm256_expm1_ps(__m256);
extern __m256d _mm256_expm1_pd(__m256d);
extern __m128 _mm_pow_ps(__m128, __m128);
extern __m128d _mm_pow_pd(__m128d, __m128d);
extern __m256 _mm256_pow_ps(__m256, __m256);
extern __m256d _mm256_pow_pd(__m256d, __m256d);
extern __m128 _mm_trunc_ps(__m128);
extern __m128d _mm_trunc_pd(__m128d);
extern __m256 _mm256_trunc_ps(__m256);
extern __m256d _mm256_trunc_pd(__m256d);
extern __m128 _mm_svml_floor_ps(__m128);
extern __m128d _mm_svml_floor_pd(__m128d);
extern __m256 _mm256_svml_floor_ps(__m256);
extern __m256d _mm256_svml_floor_pd(__m256d);
extern __m128 _mm_svml_ceil_ps(__m128);
extern __m128d _mm_svml_ceil_pd(__m128d);
extern __m256 _mm256_svml_ceil_ps(__m256);
extern __m256d _mm256_svml_ceil_pd(__m256d);
extern __m128 _mm_svml_round_ps(__m128);
extern __m128d _mm_svml_round_pd(__m128d);
extern __m256 _mm256_svml_round_ps(__m256);
extern __m256d _mm256_svml_round_pd(__m256d);
extern __m128 _mm_fmod_ps(__m128, __m128);
extern __m128d _mm_fmod_pd(__m128d, __m128d);
extern __m256 _mm256_fmod_ps(__m256, __m256);
extern __m256d _mm256_fmod_pd(__m256d, __m256d);
extern __m128 _mm_svml_sqrt_ps(__m128);
extern __m128d _mm_svml_sqrt_pd(__m128d);
extern __m256 _mm256_svml_sqrt_ps(__m256);
extern __m256d _mm256_svml_sqrt_pd(__m256d);
extern __m128 _mm_invsqrt_ps(__m128);
extern __m128d _mm_invsqrt_pd(__m128d);
extern __m256 _mm256_invsqrt_ps(__m256);
extern __m256d _mm256_invsqrt_pd(__m256d);
extern __m128 _mm_cbrt_ps(__m128);
extern __m128d _mm_cbrt_pd(__m128d);
extern __m256 _mm256_cbrt_ps(__m256);
extern __m256d _mm256_cbrt_pd(__m256d);
extern __m128 _mm_invcbrt_ps(__m128);
extern __m128d _mm_invcbrt_pd(__m128d);
extern __m256 _mm256_invcbrt_ps(__m256);
extern __m256d _mm256_invcbrt_pd(__m256d);
extern __m128 _mm_hypot_ps(__m128, __m128);
extern __m128d _mm_hypot_pd(__m128d, __m128d);
extern __m256 _mm256_hypot_ps(__m256, __m256);
extern __m256d _mm256_hypot_pd(__m256d, __m256d);
extern __m128 _mm_cdfnorm_ps(__m128);
extern __m128d _mm_cdfnorm_pd(__m128d);
extern __m256 _mm256_cdfnorm_ps(__m256);
extern __m256d _mm256_cdfnorm_pd(__m256d);
extern __m128 _mm_cdfnorminv_ps(__m128);
extern __m128d _mm_cdfnorminv_pd(__m128d);
extern __m256 _mm256_cdfnorminv_ps(__m256);
extern __m256d _mm256_cdfnorminv_pd(__m256d);
extern __m128 _mm_cexp_ps(__m128);
extern __m256 _mm256_cexp_ps(__m256);
extern __m128 _mm_clog_ps(__m128);
extern __m256 _mm256_clog_ps(__m256);
extern __m128 _mm_csqrt_ps(__m128);
extern __m256 _mm256_csqrt_ps(__m256);
extern __m128 _mm_erf_ps(__m128);
extern __m128d _mm_erf_pd(__m128d);
extern __m256 _mm256_erf_ps(__m256);
extern __m256d _mm256_erf_pd(__m256d);
extern __m128 _mm_erfc_ps(__m128);
extern __m128d _mm_erfc_pd(__m128d);
extern __m256 _mm256_erfc_ps(__m256);
extern __m256d _mm256_erfc_pd(__m256d);
extern __m128 _mm_erfcinv_ps(__m128);
extern __m128d _mm_erfcinv_pd(__m128d);
extern __m256 _mm256_erfcinv_ps(__m256);
extern __m256d _mm256_erfcinv_pd(__m256d);
extern __m128 _mm_erfinv_ps(__m128);
extern __m128d _mm_erfinv_pd(__m128d);
extern __m256 _mm256_erfinv_ps(__m256);
extern __m256d _mm256_erfinv_pd(__m256d);
/* Cache line demote */
extern void _mm_cldemote(void const *);
#define _cldemote _mm_cldemote
/* Direct stores */
extern void _directstoreu_u32(void *, unsigned int);
#if defined (_M_X64)
extern void _directstoreu_u64(void *, unsigned __int64);
#endif /* defined (_M_X64) */
extern void _movdir64b(void *, void const *);
Did you look through the Universal SIMD NEP and the work being done to make this not specific to x86? The idea is that abstract intrinsics will be replaced (at compile time, by macros) with platform-specific ones.
As long as that package is external to numpy and python, and is a lib that can be loaded and used universally, then I agree. The vector math routines are universal to all platforms and languages and should be decoupled from numpy. Then others can participate -- for instance contribute routines like BooleanSum for all platforms/languages.. for instance if Rust wants a fast BooleanSum... or what about SDS file loader which has a boolean mask, and so on.
This is dependent on action from NumPy. It would be good to open an issue there and link that back here.
btw, last week I took the time to show a complete compare & contrast example of the log operation written in numpy with C macro expansion vs C++ templates.
I am biased, but I think the C++ template is more portable, easier to read and maintain. The cpp file ops_log.cpp stands on its own.
And then another routine here: open source example
Then both Intel and Microsoft compilers have _mm256_log_ps()
intrinsics (which appear to run faster than existing one in numpy).
gcc has __m256d _ZGVdN4v_log(__m256d x)
; built in (and maybe more - which appears to run slower)
For clang, it might have one also but I have not bothered look.
That is at least 5 different versions of AVX log functions...which one is the best of breed?
Then further, the numpy implementation (which is well optimized by a skilled developer who knew what they were doing)... even that well written code has a missing optimization:
exponent = @isa@_get_exponent(x);
x = @isa@_get_mantissa(x);
the call to get_exponent and get_mantissa are very similar and can be further optimized into one routine that returns both the exponent and mantissa.
numpy get_exponent templated get_exponent
It is easier to spot this optimization in the templated code because it is easier to read and follow. My C++ editor can tag all the functions, showing me both template and intrinsics information. Compiler errors are easier to understand and fix. But my editor has little clue what is going on with the macro expansion C code.
Is anyone going to further optimize the log operation in numpy src/umath/simd.inc.src
? If they do, that optimization is only for numpy, as opposed to the rest of world, plus that developer is writing in an macro expansion style that is specific to just numpy. Be nice if numpy would switch over to the templated style and to an external lib that the world could collaborate on and use.
xref numpy/numpy#17698
In particular there is this breakout of the Universal SIMD functions. Maybe we could be its first users?