msprime
msprime copied to clipboard
Python algorithm without S avl tree
Proposal for adapting algorithms.py
(see #1993) such that we can track the number of samples each segment is ancestral to without having to rely on AVL tree S
.
@benjeffery any idea what's up with CircleCI here?
Can we remove the bintrees dependency or are we still using it somewhere?
I think the main issues here are with the verification code - how much have you run this? The tests run it a little bit but it's good to run the algorithm for a longish time to really shake out weird problems.
@benjeffery any idea what's up with CircleCI here?
Try pushing again, seen some discussion that this might be transitory.
Can we remove the bintrees dependency or are we still using it somewhere?
I think the main issues here are with the verification code - how much have you run this? The tests run it a little bit but it's good to run the algorithm for a longish time to really shake out weird problems.
Don't think I have run it enough then. Will do that.
Can we remove the bintrees dependency or are we still using it somewhere?
I think the main issues here are with the verification code - how much have you run this? The tests run it a little bit but it's good to run the algorithm for a longish time to really shake out weird problems.
bintrees is still used here: dtwf_generation()
First benchmarks are not looking good:
- dev branch
Human: length: 75000000.0, sample size:1000 Function _run() {} Took 45.8196 seconds Human: length: 75000000.0, sample size:10000 Function _run() {} Took 78.5620 seconds Drosophila: length: 10000000, sample size:100 Function _run() {} Took 61.1237 seconds
- main branch
Human: length: 75000000.0, sample size:1000 Function _run() {} Took 15.0323 seconds Human: length: 75000000.0, sample size:10000 Function _run() {} Took 17.4346 seconds Drosophila: length: 10000000, sample size:100 Function _run() {} Took 23.6431 seconds
This is for a single run for each of the parameter settings.
The performance penalty seems to be mostly due to the fact that when keeping track of ancestral_to
per segment, being able to refactor the segment chain becomes quite a rare event. This is because we cannot merge two adjoining segments that are ancestral to a different number of samples. This leads to a clear increase in the total and mean number of segments (see graph). So instead of updating the avl tree. We seem to be spending our time looping over a larger number of segments.
Dev - branch: human like parameters, averaged across 50 reps.
Main - branch:
Dammit, didn't see that coming :frowning_face:
I'm surprised that there's so many adjacent segments with identical nodes but different ancestral_to values, I thought that would be rare...
Can you push the changes to the C code please @GertjanBisschop - I'd like to have a look at the segment squashing logic
Hmm, maybe we're not defragging often enough. Can you set it up so that we always defrag the segment chain after msp_merge_two_segments (probably just set defrag_required=true
) and look at those numbers of segment plots again?
This does not have any impact:
This is a real head scratcher... The extra segments all seem to be late in the simulation, when they should be pretty widely separated within the lineage (I would have thought!)
Can you paste in the top of the perf report (from a full run) for the Drosophila example please?
Sorry @jeromekelleher, forgot to mention that the time (x-axis) is in generations, and not on the coalescent time scale. So this is the very beginning of the coalescent process.
Perf report for the Drosophila example:
35.37% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] msp_merge_two_ancestors 8.90% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] rate_map_position_to_mass 7.67% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] avl_rebalance 6.68% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] fenwick_set_value 6.32% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] msp_defrag_segment_chain 5.31% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] avl_at 4.66% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] fenwick_find 4.35% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] cmp_individual 2.90% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] avl_search_closest 2.57% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] msp_set_segment_mass 2.06% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] fenwick_increment 1.62% python3 libm-2.31.so [.] __log1p 1.42% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] rate_map_mass_between 0.82% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] msp_run 0.77% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] msp_choose_uniform_breakpoint 0.63% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] msp_get_segment.isra.0 0.61% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] avl_unlink_node 0.45% python3 libgsl.so.23.1.0 [.] gsl_rng_uniform_int 0.39% python3 libc-2.31.so [.] msort_with_tmp.part.0 0.36% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] cmp_node_mapping 0.28% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] fenwick_get_cumulative_sum 0.25% python3 libgsl.so.23.1.0 [.] gsl_ran_exponential
Yes, that looks about right for your hypothesis. I'm still surprised there are so many adjacent, squashable segments with different ancestral_to
values - how can this happen?
For every pair of overlapping segments with non-corresponding starts and ends I think you create three squashable segments in the new merged lineage. One section will correspond to a coalescence event (ancestral_to = sum of both) and the two other bordering segments have the ancestral_to value of one of the original segments.
Right, I guess that's it.
At times I've wondered if actually keeping track of these values is worth it at all. In principle, you could just look through all segments periodically and when you see an interval that contains only one segment, throw it away. There would a tradeoff, of course, between the frequency of doing this vs generated unnecessary recombination events. Do you think this would be worth prototyping?
For every pair of overlapping segments with non-corresponding starts and ends I think you create three squashable segments in the new merged lineage. One section will correspond to a coalescence event (ancestral_to = sum of both) and the two other bordering segments have the ancestral_to value of one of the original segments.
Gee, this sounds related to the 'extend edges' thing again! (although maybe only tangentially)
Gee, this sounds related to the 'extend edges' thing again! (although maybe only tangentially)
Hah, yes! Really great point Peter.
This is definitely worth exploring, but we should look at it in the algorithms.py before diving into C. What we want to do is do merge_ancestors
in two passes: first check if there's any coalescence, and if so we map both lineages to the new node at all positions along the genome whether they coalesce or not. This is something we'll want to support in msprime anyway, so good to explore it in this context in case it fixes these performance issues.
Cool. These issues are related indeed.
Just thinking. When mapping both lineages to the new node, wouldn't we still have to merge all segments of both original lineages? So we would still be iterating over the extra segments we create by using ancestral_to
.
So we would still be iterating over the extra segments we create by using ancestral_to.
Yes, you're right. So maybe this idea of "keeping more of the ARG" is orthogonal and we should investigate any performance benefits we might get separately. Let's talk about this offline.
A thought that just struck me then is that if we are keeping this extra ARG information then we could probably take the max of the ancestral_to
values in adjacent squashable segments, and it might all work out quite nicely.
Additional reason to switch is that bintrees
is deprecated and the development has stopped:
https://github.com/mozman/bintrees