sgkit
sgkit copied to clipboard
Diversity calculations on windowed datasets generate lots of unmanaged memory
When working on a reasonably large dataset (7TiB Zarr store), I noticed that diversity calculations, in particular windowed ones, generate lots of unmanaged memory. The call_genotype data portion is 280GiB stored, 7.2TiB actual size. There are some 3 billion sites and 1000 samples. At the end of a run, there is almost 300GiB unmanaged memory, which rules out the use of 256 and possibly 512GB memory nodes (I've been testing dask LocalCluster). Maybe this is more an issue with dask, but I thought I'd post it in case there is something that can be done to free up memory in the underlying implementation.
Very interesting, thanks @percyfal! Can you give more details about how you got these numbers please?
Basically by monitoring the task manager. I will be running lots of these calculations in the coming days so I'll make sure to include a screenshot. On the low-memory nodes (256GB RAM), the worker memory bars very quickly fill up with light blue (unmanaged) and orange (spillover).
Unfortunately the task graph is too large to display so I couldn't easily tell what the scheduler is holding on to (@benjeffery suggested I look at this).
I see, this is coming from Dask. Unfortunately our experience has been that it's quite hard to keep a handle on Dask's memory usage - this is one of the key motivations for cubed, which we plan to support in sgkit (#908).
@tomwhite - are the popgen calculations something that will run on Cubed currently?
I see, this is coming from Dask. Unfortunately our experience has been that it's quite hard to keep a handle on Dask's memory usage - this is one of the key motivations for cubed, which we plan to support in sgkit (#908).
@tomwhite - are the popgen calculations something that will run on Cubed currently?
Not yet unfortunately, since the windowing introduces variable-sized chunks, which Cubed can't currently handle.
Having said that, I'd like to look at getting popgen methods working on Cubed in the new year. I'm wondering if the xarray groupby work (using flox) would work here, since that is something that Cubed does support (see https://github.com/cubed-dev/cubed/pull/476).
@percyfal what windowing method are you using? It would be useful to have an example to target if it's possible to share the data (or even just the windows) easily.
I'm currently using window_by_position, as fixed-size windows are commonly used to show genome-wide distributions of various statistics. I'm targeting 50k and 100k windows. I'm running the analyses as we speak, ramping up the node memory did solve this issue though, but I could certainly share the windows - what data arrays in particular are you thinking of here?
It would be useful to have the precise call to window_by_position (there are lots of options), and the variant_contig and variant_position data, if possible.
Ok, I'll compile the data and let you know where you can get hold of it.
I've got some updates for this.
First, here's a PR that gets diversity working under Cubed: #1291. And secondly, here's an example that computes diversity over fixed-sized windows specified by window_by_position: https://github.com/sgkit-dev/sgkit/commit/cdb9431a46b448d6527dd3bcdc7b6245ea2da186
- This uses flox, which provides powerful groupby operations for Xarray. Flox works with Cubed, so this approach will work with Dask and Cubed.
- The main limitation with groupby for windowing is that it won't work for overlapping windows (since each element can only be in one group). For the example here it's fine since the windows don't overlap, but there may be some genomics cases where it is a problem. (A possible workaround would be to do two groupby operations each of which is non-overlapping.)
- The example actually computes per-variant diversity then performs the group aggregate operation (sum). Perhaps this is better in general: separate the two operations rather than make the
diversityfunction (and other popgen functions) aware of windowing. - In terms of running this on a real dataset, I would suggest using a large multicore machine and using Cubed's threads (the default) or processes executor. You also need to tell Xarray (as sgkit) to use Cubed as its chunk manager as follows (before running any other code):
import xarray as xr
# set xarray to use cubed by default
xr.set_options(chunk_manager="cubed")
(This is how it's done for the tests.)
Thanks @tomwhite I'll have a look within short.