tsinfer icon indicating copy to clipboard operation
tsinfer copied to clipboard

Out-of-core ancestor matching

Open benjeffery opened this issue 2 years ago • 0 comments
trafficstars

Ancestor matching with 100k diploid sample data sets is looking like it will take more than a month if the number of cores available is in the 20-30 range (a typical value on some HPCs) no matter how we large we can make the epochs, so we need to go to multiple machines. I consider this to be an orthogonal concern to how exactly we slice the epochs.

The outline for the loop that ancestor matching goes through is this (I'm skipping a lot of detail here):

tsb = empty tree sequence builder
for epoch in epochs:
    paths = [find_path(ancestor, tsb) for ancestor in epoch.ancestors]
    for path in paths:
        tsb.add_path()

We could use a similar scheme to the sample matching in #807 as a reminder the loop there is:

tsb = ancestor tree sequence
paths = [find_path(sample, tsb) for sample in samples]
for path in paths:
     tsb.add_path()

With a workflow of:

tsinfer match-sample-slice sampledata.zarr ancestors.ts --start 0 --end M
tsinfer match-sample-slice sampledata.zarr ancestors.ts --start M --end N
....
tsinfer combine-sample-slices sampledata.zarr ancestors.ts path-data.0-M.slice path-data.M-N.slice ...

Translating this style to a strawman for ancestor matching:

tsinfer determine-epochs ancestors.zarr -O epochs.json
Then for each epoch_index E
    tsinfer match-ancestor-slice partial-ancestors-{E-1}.ts --epoch=E --partition-count N --partition-index M (this writes path-data.epoch-{E}.part-{M}.slice, partial-ancestors-{E-1}.ts is ignored for epoch 0)
    tsinfer match-ancestor-slice partial-ancestors-{E-1}.ts --epoch=E --partition-count N --partition-index M+1 
    ... other partitions ...
    tsinfer combine-ancestor-slice partial-ancestors-{E-1}.ts -O partial-ancestors-{E}.ts path-data.epoch-{E}.part-{M}.slice path-data.epoch-{E}.part-{M+1}.slice ...

As the number of ancestors varies the partition count and partition index are specified rather than ancestor indexes (we should prob do this for the samples thinking about it).

Whilst I think this would work, it leaves a lot of coordination up to the user, this probably be possible with snakemake, but it certainly would not be simple.

I am of the opinion that it might be better to bring the coordination into python, maybe by using dask actors: https://distributed.dask.org/en/stable/actors.html This enables stateful objects on workers that are plain Python classes that methods can be called on. The user is likely to already have a dask cluster setup from working with sgkit to get the sample data prepared. Each worker would have an AncestorMatcher class in an actor, with the coordinating process calling find_path with a different ancestor on each worker, then calling add_path for each path on all workers once the epoch was complete. This is similar to the thread management in the existing __match_ancestors_multi_threaded.

Anyway, it was useful to write this out to get straight in my head what the process is here, prob best to thrash this out on a whiteboard.

benjeffery avatar Mar 31 '23 14:03 benjeffery