msprime [C code] DTWF performance

[C code] DTWF performance

Open jeromekelleher opened this issue 4 years ago • 23 comments

Run a performance analysis on the DTWF for a large simulation and see where the time is being spent. Since it seems like we're not using the Fenwick tree at all within DTWF simulations, it may be worthwhile 'turning off' Fenwick tree updates during a DTWF sim (which we should be able to do easily enough now, with the new abstractions).

If we do, we can update the Fenwick tree after we finish the DTWF phase so that the simulation state is correct (we want to be able to transition into a Hudson simulation if necessary).

Feb 05 '20 14:02 jeromekelleher

@DomNelson, any thoughts?

Feb 05 '20 14:02 jeromekelleher

@jeromekelleher absolutely, sounds like a good plan. As I remember it, we only updated the Fenwick tree to simplify hybrid simulation models, so if that can be done another way I don't see any reason to keep it. I imagine we'll want something similar for the pedigree sims too then.

Any idea what kind of improvement we could see?

Feb 05 '20 14:02 DomNelson

Any idea what kind of improvement we could see?

Dunno, we'll see once @daniel-goldstein has a chance to do a little benchmarking on it.

Feb 05 '20 14:02 jeromekelleher

I fiddled with a few parameters to get some benchmarks and here are some times looking at sample size and Ne, with sequence length 1000 and recombination rate 1e-10. Doesn't look from here that sample size has much impact and Ne dominates performance. I stopped any runs with Ne over 10k because they were taking a long time. Some of the times seem a bit out of line to me, but this is averaged over just a few replicates per configuration so maybe that's why. Ran it a few times both with and without a set seed and the results are pretty consistent.

Will do some profiling to investigate perf fixes.

Feb 13 '20 14:02 daniel-goldstein

Good stuff, thanks @daniel-goldstein. What's your sequence length here? 1e-10 is quite a low recombination rate (1e-8 and Ne=10^4 is the usual human regime). I think a good benchmark would be Ne=10^4, r=1e-8 and L=100 megabases, and to run this for a fixed number of generations (say, 100 or 1000, depending on what's feasible). That's what we really want DTWF for, I think. (Also, for Ne=10^5 and 10^6)

Feb 13 '20 16:02 jeromekelleher

Whoops, I had written down 1e-10 for recombination rate (and quite a short sequence length since I thought that didn't affect much). I'll rerun the sim with the above specs.

Feb 13 '20 17:02 daniel-goldstein

Here's are times for a simulation with r=1e-8, L=1e8, and the end time set to 100 generations. Definitely seeing sample size play a factor now.

Feb 13 '20 19:02 daniel-goldstein

Also, I'm seeing ~70% of simulation time spent in malloc/free in msp_merge_ancestors.

Feb 13 '20 20:02 daniel-goldstein

Also, I'm seeing ~70% of simulation time spent in malloc/free in msp_merge_ancestors.

Great - sounds like some low-hanging fruit for perf then!

Feb 14 '20 07:02 jeromekelleher

Adding the merge queue to the msp struct did solve the time spent on malloc. I had an unfortunate error in my benchmarking though that I was not measuring the largest sample size, but one order of magnitude less, and not surprisingly the limiting factor for 10^4 samples is not malloc. Time for smaller number of samples was greatly reduced but there's a deeper problem for larger sample sizes. We're seeing the time in msp_merge_ancestors increase steadily as the simulation continues. A good chunk of time in the single function itself, as well as some time spent searching/rebalancing the avl tree. Not a single bottleneck that I have the granularity to see at the moment.

Feb 17 '20 14:02 daniel-goldstein

Removing unnecessary calls to merge_ancestors and you can see the results here.

However, DTWF separately struggle very quickly as Ne increases. Here's DTWF performance for 1Mb, r = 1e-8, sample_size = 1_000, and Ne up to 100_000.

I have not yet been able to find something in the implementation explaining the slowdown that is easily fixable, and profiling has shown a lot of time spent in merging ancestors and tracking overlap counts.

@DomNelson Any thoughts on this? Are these numbers expected?

Mar 10 '20 14:03 daniel-goldstein

I did some work on this, and I think the culprit is a massive calloc(Ne, ...) here. I managed to fix this by using the avl tree as a generic container to index parents, and attaching a linked list (of offspring) to each node. For Ne=50M, this brings down the runtime from 13s to 3s.

Still have some tests failing (zero_pop_size), will submit a pull request once it is fixed.

Question: using avl as a generic container feels a little hacky. Specifically, I rely on avl_insert, which is not used anywhere else in the code. I also had to create a one-off linked list to keep track of every offspring of a given parent. Is there a better way?

@DomNelson @daniel-goldstein https://github.com/ivan-krukov/msprime/blob/878/lib/msprime.c

Apr 06 '20 18:04 ivan-krukov

That's awesome @ivan-krukov! Do you have numbers from profiling of where time was being spent/what other parameters you're running? I was hoping it would be some wasted memory problem but wasn't able to uncover it in profiling.

An avl tree seems reasonable to me because otherwise we would be dealing with a very sparse array of parents. Basically we want the dictionary-like behavior we have in python but we have to roll our own dictionaries and this is how we'd do it.

However, I do think you're right that there's a little unnecessary complexity in how you're using it currently. An avl node wraps a record, or key-value pair in our case, that knows nothing about the avl tree. The avl tree then manages connections between the nodes that allow for quick querying/traversal. So, I think your parent_record_t type could look something like this:

typedef struct {
  uint32_t parent;
  segment_list_t *children;
} parent_record_t;

Then when you're iterating over the tree you can start at tree->head, which gives you an avl node with a pointer to the next in the tree. Here's a small example of iterating over the nodes in an avl tree:

https://github.com/tskit-dev/msprime/blob/ae6402c16dd82da643c490fb9595ad2167413410/lib/msprime.c#L768-L774

When updating the tree, I would recommend the strategy of search --> if there then update --> if not then insert new.

Apr 06 '20 19:04 daniel-goldstein

You're right we don't seem to use avl_insert anywhere. Seems the pattern in place is to do inserts like msp_insert_individual. Personally though, I don't really see why not to use avl_insert in this case.

Apr 06 '20 19:04 daniel-goldstein

Awesome! I still need to do some refactoring, so your comments are on point. I was mostly concerned that the avl_tree was specifically intended for msprime segments (since the struct has left/right pointers, which seemed eerily familiar). But I was wrong about it.

I lifted the iteration from the avl_free I think. But what you have does look more elegant.

Wrt search first, insert after - makes sense. That will let me avoid checking errno, which is a little clunky.

Thank you

Apr 06 '20 20:04 ivan-krukov

This is fantastic, thanks @ivan-krukov! Can you open a PR with your changes? It'll make it easier to see the diff and disuss. Don't worry if it's not fully ready for action.

Apr 07 '20 07:04 jeromekelleher

Working on the benchmarks now. Also looks like I need to add a few more tests.

As far as the code goes - I needed to add a new structure parent_record_t, which has an identifying parent member. This struct has a custom cmp to be used within the avl_tree. I also use the same struct as a node in a linked list, each node corresponding to an offspring of a parent. Compared to the alternative of having a different struct for linked list nodes, this allows me to call malloc less often. Since this structure is only used here (so far), this seemed like a better option.

If you think that it would be better to be more explicit about the data structures used, I won't mind adding a generic linked_list_t and avl_storable_t implementations into util.

Apr 07 '20 17:04 ivan-krukov

The runtime is less clean than I have expected. Seems that I ah paying an avl price to maintain the tree with a large sample size. However, with smaller sample sizes, it seems to be quite beneficial. The runtimes below are based on:

msprime.simulate(sample_size=n, Ne=Ne, 
    length=1e8, recombination_rate=1e-8, 
    model="dtwf", end_time=100)

bench_plot

I am not sure why the decrease in runtime with sample size n=10_000 is happening, though. I managed to reproduce it, too, so this is not an artifact.

Apr 07 '20 19:04 ivan-krukov

@ivan-krukov Interesting, and great for the small sample sizes! I also don't fully understand the drop, but have found that different configurations can go through the first 100 generations in very different ways. Are you able to see what the timings look like for the whole simulation? That might give some more insight.

My guess would be that when the sample size is close to Ne there are more coalescences early on which make those 100 generations take longer, but that's a rough guess. As for why the avl tree is taking longer, maybe we're getting good cache behavior when Ne is small so allocating the full array + quick access beats the overhead of managing the AVL tree. But that can only go so far. Is the calloc the primary factor in the spike from Ne=10^7 to 10^8? @jeromekelleher Do you think we're facing a tradeoff with the cache, or am I in the wrong ballpark here?

Apr 07 '20 20:04 daniel-goldstein

This looks like an excellent improvement, well done @ivan-krukov! I've no real idea what's going on with the drop in run time, but I guess there could be strong stochastic effects here so maybe we should be wary of single realisations.

I think we need to run a few examples through linux perf to get more insight. I'd be interested to see the profile of n=1000, Ne=10**7 for both (since they're about equal). This should give us a good view into the tradeoffs.

Apr 08 '20 07:04 jeromekelleher

I wonder if the whole sims will be less informative, since their runtime will be dependent on the total coalescent time. But I will check.

The calloc is not the primary issue on it's own - it's iterating over all the elements that was the problem, I believe. Which is why this was not seen in the profiler - the runtime was spread over several calls. I think a line-based profiler would have picked this up.

The scaling with respect to the population size looks vaguely logarithmic to me, but it remains quadratic with respect to the sample size. I've replotted the same numbers, but facetted on the population size this time:

bench_plot_n

Apr 08 '20 15:04 ivan-krukov

Really interesting, thanks @ivan-krukov. I'd be suprised if the big uptick we're seeing for your code on large sample size is real (well, more accurately, something we can't easily fix). Let's run this through perf to get some insights. I'm guessing there's a crapton of little mallocs happening there because we're not currently using the local allocators for this yet.

(One minor request: would you mind drawing lines on these plots if you're doing them again? For some reason I just can't see the trends as easily when it's just points!)

Apr 09 '20 08:04 jeromekelleher

Dropping this one out of the 1.0 list, we don't have time to get this done before release now.

Mar 13 '21 19:03 jeromekelleher

msprime msprime copied to clipboard

[C code] DTWF performance

msprime
msprime copied to clipboard