sgkit Diversity calculations on windowed datasets generate lots of unmanaged memory

When working on a reasonably large dataset (7TiB Zarr store), I noticed that diversity calculations, in particular windowed ones, generate lots of unmanaged memory. The call_genotype data portion is 280GiB stored, 7.2TiB actual size. There are some 3 billion sites and 1000 samples. At the end of a run, there is almost 300GiB unmanaged memory, which rules out the use of 256 and possibly 512GB memory nodes (I've been testing dask LocalCluster). Maybe this is more an issue with dask, but I thought I'd post it in case there is something that can be done to free up memory in the underlying implementation.

Dec 05 '24 09:12 percyfal

Very interesting, thanks @percyfal! Can you give more details about how you got these numbers please?

Dec 05 '24 09:12 jeromekelleher

Basically by monitoring the task manager. I will be running lots of these calculations in the coming days so I'll make sure to include a screenshot. On the low-memory nodes (256GB RAM), the worker memory bars very quickly fill up with light blue (unmanaged) and orange (spillover).

Dec 05 '24 10:12 percyfal

Unfortunately the task graph is too large to display so I couldn't easily tell what the scheduler is holding on to (@benjeffery suggested I look at this).

Dec 05 '24 10:12 percyfal

I see, this is coming from Dask. Unfortunately our experience has been that it's quite hard to keep a handle on Dask's memory usage - this is one of the key motivations for cubed, which we plan to support in sgkit (#908).

@tomwhite - are the popgen calculations something that will run on Cubed currently?

Dec 05 '24 10:12 jeromekelleher

I see, this is coming from Dask. Unfortunately our experience has been that it's quite hard to keep a handle on Dask's memory usage - this is one of the key motivations for cubed, which we plan to support in sgkit (#908).

@tomwhite - are the popgen calculations something that will run on Cubed currently?

Not yet unfortunately, since the windowing introduces variable-sized chunks, which Cubed can't currently handle.

Having said that, I'd like to look at getting popgen methods working on Cubed in the new year. I'm wondering if the xarray groupby work (using flox) would work here, since that is something that Cubed does support (see https://github.com/cubed-dev/cubed/pull/476).

@percyfal what windowing method are you using? It would be useful to have an example to target if it's possible to share the data (or even just the windows) easily.

Dec 06 '24 12:12 tomwhite

I'm currently using window_by_position, as fixed-size windows are commonly used to show genome-wide distributions of various statistics. I'm targeting 50k and 100k windows. I'm running the analyses as we speak, ramping up the node memory did solve this issue though, but I could certainly share the windows - what data arrays in particular are you thinking of here?

Dec 06 '24 13:12 percyfal

It would be useful to have the precise call to window_by_position (there are lots of options), and the variant_contig and variant_position data, if possible.

Dec 06 '24 13:12 tomwhite

Ok, I'll compile the data and let you know where you can get hold of it.

Dec 06 '24 13:12 percyfal

I've got some updates for this.

First, here's a PR that gets diversity working under Cubed: #1291. And secondly, here's an example that computes diversity over fixed-sized windows specified by window_by_position: https://github.com/sgkit-dev/sgkit/commit/cdb9431a46b448d6527dd3bcdc7b6245ea2da186

This uses flox, which provides powerful groupby operations for Xarray. Flox works with Cubed, so this approach will work with Dask and Cubed.
The main limitation with groupby for windowing is that it won't work for overlapping windows (since each element can only be in one group). For the example here it's fine since the windows don't overlap, but there may be some genomics cases where it is a problem. (A possible workaround would be to do two groupby operations each of which is non-overlapping.)
The example actually computes per-variant diversity then performs the group aggregate operation (sum). Perhaps this is better in general: separate the two operations rather than make the diversity function (and other popgen functions) aware of windowing.
In terms of running this on a real dataset, I would suggest using a large multicore machine and using Cubed's threads (the default) or processes executor. You also need to tell Xarray (as sgkit) to use Cubed as its chunk manager as follows (before running any other code):

import xarray as xr
# set xarray to use cubed by default
xr.set_options(chunk_manager="cubed")

(This is how it's done for the tests.)

Feb 25 '25 09:02 tomwhite

Thanks @tomwhite I'll have a look within short.

Mar 03 '25 07:03 percyfal

sgkit sgkit copied to clipboard

Diversity calculations on windowed datasets generate lots of unmanaged memory

sgkit
sgkit copied to clipboard