tsdate icon indicating copy to clipboard operation
tsdate copied to clipboard

Revisit time discretization for path rescaling

Open nspope opened this issue 6 months ago • 2 comments

The way rescaling intervals are set in tsdate is by taking quantiles of "mutational area" or "mutational path length"; e.g. dividing time into bins such that an equal amount of area/path length is in each bin. In the latter case, it's possible to cook up scenarios where these quantiles can get heavily skewed towards older times (basically when there's a ton of polytomies and a ton of samples). In which case the adjustment is too coarse in recent times.

For example, here is an example where there's a bunch of artefactual polytomies and 40k samples, using the default settings (x-axis true node ages, y-axis inferred node ages):

Image

where it's clear that there's only a single rescaling interval from 0-100 generations. Upping the number of intervals by 10x gives:

Image

One solution would be to use a fixed logarithmic grid, collapsing out bins with zero mutational mass (but this needs some thought, as it might blow up in the lower tail). Further, I'm not sure if this is actually a problem with real data (the examples above are pathological by design, tsinfer makes nowhere near that many polytomies), but I worry it would start to be with UKB numbers of samples. So it'd be worth seeing what the time discretization looks like on UKB.

nspope avatar Jun 20 '25 00:06 nspope

I'd be happy to look at this in UKB, @nspope. What might be a good way to visualise what time discretisation looks like, as we don't have true ages? (Would it heIp to compare node ages for different values of rescale_intervals?)

savitakartik avatar Jun 20 '25 13:06 savitakartik

Thanks Savita! We'll have to string together some internal functions to do this. Will ping you about it later

nspope avatar Jun 20 '25 15:06 nspope