Performance on trees with internal samples

Open hyanwong opened this issue 4 years ago • 0 comments

Chatting to @molpopgen, he says:

When there are sufficient numbers of ancient samples, doing anything with trees is terribly inefficient, and you can recover literally orders of magnitude by simplifying to each time point for which there are ancient samples. An example that I run into a lot, and I'm sure that @petrelharp has, his simulations where you remember everyone for some period of time. In those cases, performance regresses from logarithmic to linear, and there's a tremendous amount of time spent updating information about nodes that have nothing to do with your current time slice. In a simulation, most ancient samples will tend to be internal. And many are not ancestral to the final generation. Here's a figure I made yesterday based on a massively polygenic simulation. There are millions of internal nodes making up the time series. The plot takes over an hour to make if you don't simplify to each time point separately.

20,000 nodes per time point by 100 time points. The D statistic is calculated from a random sample of 50 diploid individuals. You basically have to simplify to that sample in order for the figure to be possible. If you have few samples, it's closer to kind of logarithmic. If you have lots, it's quite poor.... like I said. this is an extremely common case. I'm pretty sure that Peter does this routinely. And I certainly do.

This seems like prime material for one of the "High performance" tutorials (see #151 ). There's an open issue on it in https://github.com/molpopgen/fwdpy11/issues/394 but I guess this is a general tree sequence issue and so might well be a candidate for incorporation here

Dec 09 '21 15:12 hyanwong