sleef Implementation of masked functions using native masked intrinsic functions

Implementation of masked functions using native masked intrinsic functions

Open shibatch opened this issue 7 years ago • 1 comments

I made a pull request for implementing masked functions by combining the current unmasked functions and a selection(blending) function.

https://github.com/shibatch/sleef/pull/139

However, there is a concern on the performance of this implementation, since the ALUs for unused elements are all active. This could lead to increased power consumption and generated heat by the computer. It is considered better to implement the masked functions in such a way that they utilize native masked intrinsic functions.

My plan is to approve the above PR, and after the release of version 3.2, we will start implementing masked functions using native masked intrinsic functions in the following way.

All existing FP functions in the helper files will be converted to masked functions. For example,

vdouble vadd_vd_vd_vd_vo(vdouble x, vdouble y, vopmask m) {
  return vaddq_f64(x, y);
}

for an unmasked intrinsic function, and

vdouble vadd_vd_vd_vd_vo(vdouble x, vdouble y, vopmask m) {
  return svadd_f64_x(m, x, y);
}

for a masked intrinsic function.

Then, the implementation of each math function would be like the following.

static const inline vdouble xsin(vdouble arg, vopmask mask) { ... }

EXPORT const vdouble Sleef_sindX_u35YYY(vdouble arg) {
  return xsin(arg, SLEEF_OPMASK_ALLONE);
}

EXPORT const vdouble Sleef_mask_sindX_u35YYY(vdouble arg, vopmask mask) {
  return xsin(arg, mask);
}

The mask argument is assumed to be optimized away if it is not used.

Jan 07 '18 05:01 shibatch

static const inline vdouble xfdim_base(vdouble arg1, vdouble arg2, vopmask mask) { ... }

EXPORT const vdouble xfdim(vdouble arg1, vdouble arg2) {
  return xfdim_base(arg1, arg2, SLEEF_OPMASK_ALLONE);
}

EXPORT const vdouble xfdim_mask(vdouble arg1, vdouble arg2, vopmask mask) {
  return xfdim_base(arg1, arg2, mask);
}

add the masked intrinsics that are needed in the AVX512F header
use tester for the unmasked version, and tester3 for the masked one to do bit-to-bit testing of the masked version versus the unmasked one
and then after doing one function (fdim) we go wide to the rest, not for the whole library but for groups of functions (say first all the vfloat(vfloat) signatures, then the vfloat(vfloat,vfloat), and so on
when we are done with all the functions, we remove the unmasked intrinsics from the helper files.

Mar 08 '18 06:03 fpetrogalli

sleef sleef copied to clipboard

Implementation of masked functions using native masked intrinsic functions

sleef
sleef copied to clipboard