Benchmarking the Two-Locus Framework and LD Calculator
For the purposes of assessing how the two-locus framework performs relative to the LD Calculator, I've performed a series of benchmarks. In our initial benchmarks, the two-locus framework was rather slow relative to the LD Calculator (up to 10x slower). We went ahead and improved some of the low-hanging inefficiencies in the code and now I'm happy to report that the two-locus framework is within 2x the speed of the LD Calculator, and in some cases faster. I have created a document that attempts to thoroughly detail the changes made and the reasoning behind them: benchmarks.pdf. For brevity, I've listed some of the highlights below for a high-level overview.
Here is a brief description of the changes that we made to increase performance:
- Base case: The original code that is currently in tskit (link).
- Malloc out of hot loop: Preallocate a structure of arrays used for temporary calculations instead of allocating and freeing for every pair of sites (link).
- Refactor Bit Arrays: Refactor bit array interface, removing the need for temporary arrays in many cases. All functions now take a row index as a parameter (link).
- Precompute Allele Counts and Biallelic Summary Function: Store precomputed bit arrays for each sample set and allele and the count of samples for each allele. Introduce a biallelic summary function that avoids multiple redundant computations, leaving the original normalized summary function for multiallelic sites (link).
Python and C tests are all passing for each of these patches.
We benchmark each change (in our benchmarks, each change is layered on the next) on a set of tree sequences with the following parameters:
| parameter | value |
|---|---|
| r | 1e-8 |
| $N_e$ | 1e4 |
| L | 2e6 |
With ploidy=1 and sample sizes ranging from $10^2$ - $10^5$. We sample 15 replicates to obtain these results:
In these plots the Relative difference is $(\textrm{tl} - \textrm{ldc}) / \textrm{ldc}$ where ldc is the LD Calculator and tl is the two-locus framework.
Next, we generated a larger tree sequence with L=6e6 and sample sizes ranging from $10^3$ - $10^6$. Comparing all optimizations on the smaller (panel A) and larger (panel B), we get the following results:
I'm not sure that there's any desire to deprecate/remove the LD Calculator at this point, but at least this provides some real-world numbers on the relative performance between these two methods. I'd like to incorporate these patches into the codebase if at all possible.
cc @jeromekelleher @petrelharp @apragsdale