ispc icon indicating copy to clipboard operation
ispc copied to clipboard

"is mask all on" paradigm to be added to stdlib

Open dbabokin opened this issue 3 years ago • 4 comments

Doing performance fine tuning, it may be beneficial to have separate code paths if the mask is "all on". Right now there are multiple ways of expressing this paradigm, which are not guaranteed to be optimal on all platforms. Possible ways of checking this are:

  1. __all(__mask). Seem to produces good code on all platforms, but uses builtins (that are not recommended for the use outside of standard library)
  2. ((1<<TARGET_WIDTH)-1 ^ lanemask()) == 0. Produces same good code, but not an obvious construct.
  3. popcnt(lanemask()) == programCount. Does no produce optimal code.

I suggest adding a standard library function, which implement the first case:

inline uniform bool is_mask_all_on() {
    return __all(__mask);
}

dbabokin avatar Apr 06 '22 01:04 dbabokin

On the topic of operations on the execution mask, have you considered supporting operations such as flipping the bits of the execution mask and getting the first active lane?

I find myself writing code like this pretty often:

if (programIndex == 0) {
    // Some code that should only run for one (active) program instance, doesn't matter which
}

which isn't ideal if the 0'th lane happens to be inactive when this is called.

Is there any way to directly manipulate the mask in general? Is __mask writeable?

pema99 avatar Apr 21 '22 19:04 pema99

Thanks for bringing this up.

For single lane execution, it depends on the kind of code that you run, but in general you should not be using (programIndex == 0) paradigm, but execute computations with uniform type - that will guarantee that they are executed as scalar if the execution has reached them with any mask on. What you are doing makes sense for the languages that do not have concept of uniform (i.e. DPC++, CUDA, OpenCL).

For flipping bits, I haven't considered this operation for several reasons:

  1. I don't see the clear use case. If you have it, would be good to see it.
  2. It's not clear what fits should be flipped - all bits, or only those turned off during the function execution.
  3. Would be very difficult / impossible to implement on some of GPUs (if going forward we support more than Intel GPUs, or use hardware mask on Intel GPUs).

dbabokin avatar Apr 21 '22 19:04 dbabokin

I see my first suggestion is nonsense since external functions are only executed once per call regardless, which was my primary use case. That was a misunderstanding on my part, apologies.

As for the second operation, my primary usecase has been to work around issues caused by values on inactive lanes. I've experienced segfaults when accessing memory on only some lanes, when the inactive lanes have invalid addresses. Also, SIGFPE's in divisions when inactive lanes have 0 as their divisor. Writing this out, I realize these are probably both bugs that should be addressed as such... Didn't think much of them at the time. To be clear, I had something like this in mind:

flip_mask();
address = /* something valid */
flip_mask();
val = arr[address];

pema99 avatar Apr 21 '22 20:04 pema99

Don't know if it's what you're looking for but you can manipulate the mask

unmasked inline void masked_store(uniform float * uniform addr, unsigned int32 mask, float src) {
#if 0
    @llvm.x86.avx.maskstore.ps.256((uniform int8* uniform)&addr[0], mask, src);
#else
    unsigned int32 oldmask = __mask;
    *((varying unsigned int32*)&__mask) = mask;
    addr[programIndex] = src;
    *((varying unsigned int32*)&__mask) = oldmask;
#endif
}

*((varying unsigned int32*)&__mask) = sign_extend(condition); // sets if the lane is active according to varying condition

alex-mat avatar May 03 '22 11:05 alex-mat