It would be nice if histogram could take a mask to select active data
This would make code a lot easier to write, because when I have code like this:
# data_pointer is my data, with size count
# BLOCK_SIZE is some power-of-two working set size
base = BLOCK_SIZE * triton.language.program_id(axis=0)
mask = base + triton.language.arange(0, BLOCK_SIZE) < count
data = triton.language.load(data_pointer, mask=mask)
counts = triton.language.histogram(data, DATA_MAX_VALUE) # oops
The issue here is histogram will look at the entire data tensor, even though only the elements that mask marks as inbounds were loaded. So it will go through and read BLOCK_SIZE elements, even if some of them are garbage. This is actually quite inconvenient to fix. If I can fit a sentinel value into my data I can fill the unmasked region with that, and then histogram will put all of those extra elements into that bucket. But finding such a value is difficult and often requires some extra effort regardless, because I will have to make my binned tensor larger than it needs to be to accommodate this "junk" bin and then figure out a way to get rid of it. It would be much nicer if histogram just took a mask directly like load and store do, so it only read the data I care about.