sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Add 'matching' method option to identity_by_state

Open timothymillar opened this issue 1 year ago • 3 comments

  • [x] Fixes #1227
  • [x] Tests added
  • [x] User visible changes (including notable bug fixes) are documented in changelog.rst

timothymillar avatar Jun 19 '24 09:06 timothymillar

Requiring a single chunk in the samples dimension is a bit restrictive - perhaps we could add a note on how that could be generalised?

jeromekelleher avatar Jun 24 '24 13:06 jeromekelleher

Thanks @jeromekelleher, I can have a look into chunking in the samples dimension before I merge.

timothymillar avatar Jun 24 '24 20:06 timothymillar

Do you want to have another look at this @jeromekelleher? It now supports chunking of the samples dimension with quite a bit more code. I would have preferred to avoid having two gufuncs but doing so gives an almost 2x speed up when there is no chunking in the samples dimension (most common case).

Some rough benchmarks show how the 'matching' approach scales much better with the number of alleles. Memory ussage is also much better. The 'frequencies' method is roughly an order of magnitude faster in the biallelic case:

matching_perf

(Other params where: n_variant=10_000, n_sample=100, n_ploidy=4, missing_pct=0.01 and chunks of 5000 variants * 50 samples).

timothymillar avatar Jul 01 '24 08:07 timothymillar