tskit `ts.pair_coalescence_rates` can give large or infinite rates

Here's an example giving a maximum value of infinity in the returned matrix:

L = 10e6  # 10 Mb
params = {"sequence_length": L, "population_size":1e4, "recombination_rate": 1e-8, "random_seed": 6} 
sweep_model = msprime.SweepGenicSelection(
    position=L/2, start_frequency=0.0001, end_frequency=0.9999, s=0.25, dt=1e-6)
ts = msprime.sim_ancestry(20, model=[sweep_model, msprime.StandardCoalescent()], **params)
time_intervals = np.logspace(0, np.log10(ts.max_time), 20)
# time intervals need to go from 0 -> infinity
time_intervals = np.concatenate(([0], time_intervals, [np.inf]))
genome_intervals = np.linspace(0, L, 21)

rates = ts.pair_coalescence_rates(time_windows=time_intervals, windows=genome_intervals)
np.nanmax(rates)  # Gives np.inf

If you swap the random_seed to 3, you get a maximum rate of 134217728.0, which also seems wrong. These large rates are all in the last timeslice. Presumably we are hitting a division-by-zero or division-by-epsilon error.

Nov 06 '24 09:11 hyanwong

As discussed previously @nspope. I assigned you to this: I hope that's OK.

Nov 06 '24 09:11 hyanwong

The issue is that the last time bin starts with the oldest node in the ts. The last bin has a single node in this case, and the corresponding rate estimator is effectively 1/(E[pair time | events > epoch start] - epoch start) , which blows up as only FP error keeps the denominator away from 0. This should be easy enough to catch in practice.

Changing time_intervals = np.logspace(0, np.log10(ts.max_time), 20) to time_intervals = np.logspace(0, np.log10(ts.max_time + 1), 20) gives something much more reasonable. Starting the last epoch at a fixed time just before the final node gives rate estimates that are bogus (e.g. totally dependent on the choice of fixed timepoint), but at least numerically stable.

Nov 06 '24 16:11 nspope