Ben Jeffery issues

Results 116 issues of


Ben Jeffery

Batching during match ancestors wastes worker CPU.

When using dask for ancestor matching we currently batch the ancestors into batches of 5000, and send these to dask serially. This lets us store batch results in the resume...

enhancement

Sample matching progress bar incorrect

When sample matching with dask the progress bar counts each sample twice or so it seems.

Update docs build

Ancestral allele handling

@hyanwong and I sat down and tried to think through properly how ancestral allele handling from sgkit should work. The sgkit `variant_ancestral_allele` string array needs to be converted to a...

Python 3.11 support

Excessive RAM usage in the final stages of `match_samples`

At some point after the `Splitting ultimate ancestor` log line tsinfer tries to allocate over 128GB for large datasets. Hopefully can investigate locally with smaller datasets.

Parallelise mapping additional sites

Mapping these sites takes over 12 hours on large datasets, would be good to use multiple threads when doing this.

Add `AncestorData` diagnostic plots function.

To plan a large ancestor match one needs to see plots of: - Group size (determines parallelism) - Ancestor size distribution (determines needed worker RAM) - Total ancestor length in...

Add "Inferring large datasets" documentation

Need to document all the tips and tricks for each stage of inference when working with biobank-scale data.

Split large ancestor groups up for both caching and dask scheduling.

We have very large ancestor groups towards the end of matching. As these take over a month of CPU each it would be best to split them up. This would...