Decoding haplotypes without trees

Open benjeffery opened this issue 2 months ago • 3 comments

I started to have a look at https://github.com/tskit-dev/tskit/issues/1896 where one of the possible solutions is decoding haplotypes in C. (The current methods all do this by iterating over sites then returning the full samples*sites matrix).

Recent work on parent_index_array in the JIT code gave me an idea, I then couldn't sleep until I had tried it. The code in this PR is NSFJK (Not Safe For Jerome) but I thought I would share it now as I think it is interesting and I have to draw a line! Iteresting bits are here

The outline is:

Build parent edge index (using the fancy tricks in the JIT code)
Then for every node you are interested in:
- Create a bitset resolved over sites and a result string of ancestral site states.
- Iterate through the mutations above your chosen node, setting result of the mutations site to the derived allele, marking these sites as resolved
- Iterate over the parent edges of your node, if they overlap any unresolved sites push them onto a stack
- While there are entries in the stack:
- - Pop an edge, process mutations you see, marking in result and resolved. Push any parent edges that have unresolved sites on the stack
- result should now be your node's haplotype
- Clear the bitset and result, do the next node

Here is the perf on a tree sequence which is a simplified subset of 100k samples of chr 21 Quebecois sim: variant_vs_haplotype The summary being "if you are extracting less than 1000 genotypes the haplotype way is faster" and "if you're grabbing one node it can be 100 times faster"

I have done a little perf work, but I think there are probably big wins to be had by clever caching of haplotypes of nodes higher up the tree or something like that, and not calculating the full parent edge index when you're grabbing a slice of the genome.

Oct 16 '25 12:10 benjeffery