Decoding haplotypes without trees
I started to have a look at https://github.com/tskit-dev/tskit/issues/1896 where one of the possible solutions is decoding haplotypes in C. (The current methods all do this by iterating over sites then returning the full samples*sites matrix).
Recent work on parent_index_array in the JIT code gave me an idea, I then couldn't sleep until I had tried it. The code in this PR is NSFJK (Not Safe For Jerome) but I thought I would share it now as I think it is interesting and I have to draw a line! Iteresting bits are here
The outline is:
- Build parent edge index (using the fancy tricks in the JIT code)
- Then for every node you are interested in:
-
- Create a bitset
resolvedover sites and aresultstring of ancestral site states.
- Create a bitset
-
- Iterate through the mutations above your chosen node, setting
resultof the mutations site to the derived allele, marking these sites asresolved
- Iterate through the mutations above your chosen node, setting
-
- Iterate over the parent edges of your node, if they overlap any unresolved sites push them onto a stack
-
- While there are entries in the stack:
-
-
- Pop an edge, process mutations you see, marking in
resultandresolved. Push any parent edges that have unresolved sites on the stack
- Pop an edge, process mutations you see, marking in
-
-
resultshould now be your node's haplotype
-
- Clear the bitset and result, do the next node
Here is the perf on a tree sequence which is a simplified subset of 100k samples of chr 21 Quebecois sim:
The summary being "if you are extracting less than 1000 genotypes the haplotype way is faster" and "if you're grabbing one node it can be 100 times faster"
I have done a little perf work, but I think there are probably big wins to be had by clever caching of haplotypes of nodes higher up the tree or something like that, and not calculating the full parent edge index when you're grabbing a slice of the genome.