sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Add CuPy masked array to design discussions

Open hammer opened this issue 4 years ago • 4 comments

@eric-czech mentioned on a recent developer call that we use Numba rather than CuPy to target GPUs because CuPy does not have masked array support https://github.com/cupy/cupy/issues/2225.

hammer avatar Apr 29 '21 16:04 hammer

How are masks handled today with Numba?

jakirkham avatar Apr 29 '21 16:04 jakirkham

They aren't afaik, looking at https://github.com/numba/numba/issues/1834.

We use -1 in int8's as a placemark and then roll our own nan-aware numba operations.

eric-czech avatar Apr 29 '21 16:04 eric-czech

Adding a little more detail to that, our design iterated like this:

  1. Let's try to use the numpy API for everything, or at the very least try to stick to the Dask API
    • This broke down at not having nan-aware operations for small integers
      • Genetic variant calls only have 3 states, present, absent, or missing so small integer dtypes are ideal
  2. Instead, let's rely on guvectorize(..., target=[cpu|gpu]) as a means to write hardware-agnostic code with support for missing values
    • This broke down first, as @aktech found, at broadcasting not being supported: https://github.com/numba/numba/issues/6421
  3. Let's write separate numba GPU kernels and CPU functions (where we are today)

Masked array support in CuPy might suffice, but we had a number of issues even making that work well on a CPU w/ Dask iirc. Nullable int support (or maybe even float8?) would be better.

eric-czech avatar Apr 29 '21 16:04 eric-czech

Yeah NumPy supports masked arrays as does Dask. Though of course CuPy does not ( https://github.com/cupy/cupy/issues/2225 ). Implementing the structure of a masked array is not hard. The hard part comes when making all of the different kernel mask aware. As masks were a bit of an afterthought in the array-side of the ecosystem, there's probably still some performance left on the table (using packed bits, etc.). That said, ufuncs, which are supported in NumPy, CuPy, and Dask, do support where, which would allow computing on portions of arrays. So there may be enough things implemented in terms of ufuncs to avoid needing to implement essential kernels.

The DataFrame story is a bit cleaner as masked data in that space is more common. So Pandas, cuDF, and Dask support this. IDK whether that would make sense for this use case.

jakirkham avatar Apr 29 '21 16:04 jakirkham