sgkit
sgkit copied to clipboard
Add CuPy masked array to design discussions
@eric-czech mentioned on a recent developer call that we use Numba rather than CuPy to target GPUs because CuPy does not have masked array support https://github.com/cupy/cupy/issues/2225.
How are masks handled today with Numba?
They aren't afaik, looking at https://github.com/numba/numba/issues/1834.
We use -1 in int8's as a placemark and then roll our own nan-aware numba operations.
Adding a little more detail to that, our design iterated like this:
- Let's try to use the numpy API for everything, or at the very least try to stick to the Dask API
- This broke down at not having nan-aware operations for small integers
- Genetic variant calls only have 3 states, present, absent, or missing so small integer dtypes are ideal
- This broke down at not having nan-aware operations for small integers
- Instead, let's rely on
guvectorize(..., target=[cpu|gpu])as a means to write hardware-agnostic code with support for missing values- This broke down first, as @aktech found, at broadcasting not being supported: https://github.com/numba/numba/issues/6421
- Let's write separate numba GPU kernels and CPU functions (where we are today)
Masked array support in CuPy might suffice, but we had a number of issues even making that work well on a CPU w/ Dask iirc. Nullable int support (or maybe even float8?) would be better.
Yeah NumPy supports masked arrays as does Dask. Though of course CuPy does not ( https://github.com/cupy/cupy/issues/2225 ). Implementing the structure of a masked array is not hard. The hard part comes when making all of the different kernel mask aware. As masks were a bit of an afterthought in the array-side of the ecosystem, there's probably still some performance left on the table (using packed bits, etc.). That said, ufuncs, which are supported in NumPy, CuPy, and Dask, do support where, which would allow computing on portions of arrays. So there may be enough things implemented in terms of ufuncs to avoid needing to implement essential kernels.
The DataFrame story is a bit cleaner as masked data in that space is more common. So Pandas, cuDF, and Dask support this. IDK whether that would make sense for this use case.