msprime Python algorithm without S avl tree

Proposal for adapting algorithms.py (see #1993) such that we can track the number of samples each segment is ancestral to without having to rely on AVL tree S.

Nov 07 '22 11:11 GertjanBisschop

@benjeffery any idea what's up with CircleCI here?

Nov 07 '22 12:11 jeromekelleher

Can we remove the bintrees dependency or are we still using it somewhere?

I think the main issues here are with the verification code - how much have you run this? The tests run it a little bit but it's good to run the algorithm for a longish time to really shake out weird problems.

Nov 07 '22 12:11 jeromekelleher

@benjeffery any idea what's up with CircleCI here?

Try pushing again, seen some discussion that this might be transitory.

Nov 07 '22 12:11 benjeffery

Can we remove the bintrees dependency or are we still using it somewhere?

I think the main issues here are with the verification code - how much have you run this? The tests run it a little bit but it's good to run the algorithm for a longish time to really shake out weird problems.

Don't think I have run it enough then. Will do that.

Nov 07 '22 13:11 GertjanBisschop

Can we remove the bintrees dependency or are we still using it somewhere?

I think the main issues here are with the verification code - how much have you run this? The tests run it a little bit but it's good to run the algorithm for a longish time to really shake out weird problems.

bintrees is still used here: dtwf_generation()

Nov 07 '22 13:11 GertjanBisschop

First benchmarks are not looking good:

dev branch

Human: length: 75000000.0, sample size:1000 Function _run() {} Took 45.8196 seconds Human: length: 75000000.0, sample size:10000 Function _run() {} Took 78.5620 seconds Drosophila: length: 10000000, sample size:100 Function _run() {} Took 61.1237 seconds

main branch

Human: length: 75000000.0, sample size:1000 Function _run() {} Took 15.0323 seconds Human: length: 75000000.0, sample size:10000 Function _run() {} Took 17.4346 seconds Drosophila: length: 10000000, sample size:100 Function _run() {} Took 23.6431 seconds

This is for a single run for each of the parameter settings. The performance penalty seems to be mostly due to the fact that when keeping track of ancestral_to per segment, being able to refactor the segment chain becomes quite a rare event. This is because we cannot merge two adjoining segments that are ancestral to a different number of samples. This leads to a clear increase in the total and mean number of segments (see graph). So instead of updating the avl tree. We seem to be spending our time looping over a larger number of segments.

Dev - branch: human like parameters, averaged across 50 reps.

human_dev

Main - branch: human_main

Nov 23 '22 11:11 GertjanBisschop

Dammit, didn't see that coming :frowning_face:

I'm surprised that there's so many adjacent segments with identical nodes but different ancestral_to values, I thought that would be rare...

Nov 23 '22 12:11 jeromekelleher

Can you push the changes to the C code please @GertjanBisschop - I'd like to have a look at the segment squashing logic

Nov 23 '22 12:11 jeromekelleher

Hmm, maybe we're not defragging often enough. Can you set it up so that we always defrag the segment chain after msp_merge_two_segments (probably just set defrag_required=true) and look at those numbers of segment plots again?

Nov 23 '22 15:11 jeromekelleher

This does not have any impact: human_main_defrag

Nov 23 '22 15:11 GertjanBisschop

This is a real head scratcher... The extra segments all seem to be late in the simulation, when they should be pretty widely separated within the lineage (I would have thought!)

Can you paste in the top of the perf report (from a full run) for the Drosophila example please?

Nov 23 '22 16:11 jeromekelleher

Sorry @jeromekelleher, forgot to mention that the time (x-axis) is in generations, and not on the coalescent time scale. So this is the very beginning of the coalescent process.

Nov 23 '22 22:11 GertjanBisschop

Perf report for the Drosophila example:

35.37% python3 _msprime.cpython-38-x86_64-linux-gnu.so 8.90% python3 _msprime.cpython-38-x86_64-linux-gnu.so 7.67% python3 _msprime.cpython-38-x86_64-linux-gnu.so 6.68% python3 _msprime.cpython-38-x86_64-linux-gnu.so 6.32% python3 _msprime.cpython-38-x86_64-linux-gnu.so 5.31% python3 _msprime.cpython-38-x86_64-linux-gnu.so 4.66% python3 _msprime.cpython-38-x86_64-linux-gnu.so 4.35% python3 _msprime.cpython-38-x86_64-linux-gnu.so 2.90% python3 _msprime.cpython-38-x86_64-linux-gnu.so 2.57% python3 _msprime.cpython-38-x86_64-linux-gnu.so 2.06% python3 _msprime.cpython-38-x86_64-linux-gnu.so 1.62% python3 libm-2.31.so 1.42% python3 _msprime.cpython-38-x86_64-linux-gnu.so 0.82% python3 _msprime.cpython-38-x86_64-linux-gnu.so 0.77% python3 _msprime.cpython-38-x86_64-linux-gnu.so 0.63% python3 _msprime.cpython-38-x86_64-linux-gnu.so 0.61% python3 _msprime.cpython-38-x86_64-linux-gnu.so 0.45% python3 libgsl.so.23.1.0 0.39% python3 libc-2.31.so 0.36% python3 _msprime.cpython-38-x86_64-linux-gnu.so 0.28% python3 _msprime.cpython-38-x86_64-linux-gnu.so 0.25% python3 libgsl.so.23.1.0 [.] msp_merge_two_ancestors [.] rate_map_position_to_mass [.] avl_rebalance [.] fenwick_set_value [.] msp_defrag_segment_chain [.] avl_at [.] fenwick_find [.] cmp_individual [.] avl_search_closest [.] msp_set_segment_mass [.] fenwick_increment [.] __log1p [.] rate_map_mass_between [.] msp_run [.] msp_choose_uniform_breakpoint [.] msp_get_segment.isra.0 [.] avl_unlink_node [.] gsl_rng_uniform_int [.] msort_with_tmp.part.0 [.] cmp_node_mapping [.] fenwick_get_cumulative_sum [.] gsl_ran_exponential

Nov 24 '22 10:11 GertjanBisschop

Yes, that looks about right for your hypothesis. I'm still surprised there are so many adjacent, squashable segments with different ancestral_to values - how can this happen?

Nov 24 '22 10:11 jeromekelleher

For every pair of overlapping segments with non-corresponding starts and ends I think you create three squashable segments in the new merged lineage. One section will correspond to a coalescence event (ancestral_to = sum of both) and the two other bordering segments have the ancestral_to value of one of the original segments.

Nov 24 '22 10:11 GertjanBisschop

Right, I guess that's it.

At times I've wondered if actually keeping track of these values is worth it at all. In principle, you could just look through all segments periodically and when you see an interval that contains only one segment, throw it away. There would a tradeoff, of course, between the frequency of doing this vs generated unnecessary recombination events. Do you think this would be worth prototyping?

Nov 24 '22 13:11 jeromekelleher

For every pair of overlapping segments with non-corresponding starts and ends I think you create three squashable segments in the new merged lineage. One section will correspond to a coalescence event (ancestral_to = sum of both) and the two other bordering segments have the ancestral_to value of one of the original segments.

Gee, this sounds related to the 'extend edges' thing again! (although maybe only tangentially)

Nov 24 '22 17:11 petrelharp

Gee, this sounds related to the 'extend edges' thing again! (although maybe only tangentially)

Hah, yes! Really great point Peter.

This is definitely worth exploring, but we should look at it in the algorithms.py before diving into C. What we want to do is do merge_ancestors in two passes: first check if there's any coalescence, and if so we map both lineages to the new node at all positions along the genome whether they coalesce or not. This is something we'll want to support in msprime anyway, so good to explore it in this context in case it fixes these performance issues.

Nov 24 '22 20:11 jeromekelleher

Cool. These issues are related indeed. Just thinking. When mapping both lineages to the new node, wouldn't we still have to merge all segments of both original lineages? So we would still be iterating over the extra segments we create by using ancestral_to.

Nov 25 '22 15:11 GertjanBisschop

So we would still be iterating over the extra segments we create by using ancestral_to.

Yes, you're right. So maybe this idea of "keeping more of the ARG" is orthogonal and we should investigate any performance benefits we might get separately. Let's talk about this offline.

A thought that just struck me then is that if we are keeping this extra ARG information then we could probably take the max of the ancestral_to values in adjacent squashable segments, and it might all work out quite nicely.

Nov 28 '22 09:11 jeromekelleher

Additional reason to switch is that bintrees is deprecated and the development has stopped: https://github.com/mozman/bintrees

Mar 18 '24 10:03 not-a-feature

msprime msprime copied to clipboard

Python algorithm without S avl tree

msprime
msprime copied to clipboard