sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Windowing functions should be lazy

Open tomwhite opened this issue 5 years ago • 2 comments

See https://github.com/pystatgen/sgkit/pull/303#discussion_r507906940

tomwhite avatar Oct 20 '20 13:10 tomwhite

Currently, window variables (window_{contig,start,stop}) are numpy arrays, which as Eric points out in the comment do not scale well to 100M variants. I think there are two things to do:

  1. Make the windowing functions construct Dask arrays for the window variables so that they can be computed more efficiently.
  2. Make window computations (window_statistic and ld_matrix) use Dask arrays.

For 1. we need a Dask implementation of np.searchsorted, both for finding contig boundaries, and for finding windows by genomic position. See https://github.com/dask/dask/pull/7696

2 is more work, since we can't materialize the window arrays as we do now. Instead, the strategy is probably: rechunk the window arrays to match the target variable chunks (e.g. genotypes for computing some of the popgen stats), then use map_overlap to process the target variable and the window definitions chunk by chunk. We'll also need a Dask delayed version of this in cases where we can't use map_overlap, such as when using window_by_position where the overlap depth is not bounded. This is similar to the current implementation in ld_matrix, which has the additional complication of returning a Dask dataframe, rather than a Dask array like map_overlap does.

So it might be worth having a couple of utility functions, map_windows and a more flexible Dask delayed version.

tomwhite avatar May 25 '21 12:05 tomwhite

It's not obvious to me that there really is a scaling issue here - have we done some experiments to see what sort of time these windowing functions take? Sounds like the dask versions are doing to be fairly tricky so might be worth doing a few quick checks before embarking on it?

jeromekelleher avatar Jun 02 '21 18:06 jeromekelleher

Closing this for the time being, can re-open if we see scalability issues.

tomwhite avatar Jan 04 '23 16:01 tomwhite