tsinfer
tsinfer copied to clipboard
Allow sample data and ancestors files to store their time units
In https://github.com/tskit-dev/tsinfer/pull/680 we set the time units in a tsinferred tree sequence to "uncalibrated" (unless specified otherwise when calling infer() or match_ancestors()). The time_units parameter was deliberately left undocumented.
A better solution, IMO, would be for both the sample data file and (more importantly) the ancestors file to know about the time units that they are using. In particular, each ancestor has a time, and the meaning of that time would be a useful thing both to know and to pass on to the inferred tree sequence. It would be sensible to detect whether time-as-frequency was used here or here, and if so, set the time_units to "uncalibrated". This would also allow us to detect if, for example, different meanings of time are used when matching historical samples.
An initial pass at this was made in https://github.com/tskit-dev/tsinfer/pull/679 but it turned out to be rather complicated, so I closed that PR without merging. We can perhaps discuss the requirements in this issue before attempting it again.
A quick note: I think it is uncontroversial to add time units to the ancestors file. However, the sample data file could potentially have 2 different sorts of time units: one associated with the sites time, and another associated with the individuals time. We could imagine historical samples in a sample data file associated with time units of (say) "years", but where we don't have any obvious time units for the sites.
Given the potential move to SGkit, it seems premature to associate time units with the sample data file, when it's a complex issue. So I think we should punt it down the line.
Actually, my comment above is wrong, I think. There is no call to mix time units within a sample data file apart from when the individuals have a specific time unit, and the sites are all of unknown time (which is indicated by a special NaN value). So I take back my comment above. I think it is reasonable to associate a single time_units attribute with a sample data file. If the site times are tskit.is_unknown_time then we set the AncestorData file to tskit.TIME_UNITS_UNCALIBRATED
I'd rather punt this one to the next release after we've got basic interface with sgkit done (#728). We'll get a better sense of how attributes like this could be done then.
This is just a nicety so we get the time units right at the end of the pipeline, right?
I'd rather punt this one to the next release after we've got basic interface with sgkit done (#728). We'll get a better sense of how attributes like this could be done then.
Yes, happy to punt this. Actually it's much easier than we had thought: I sketched out a simple PR yesterday evening and it's only a few lines of code (and will remove some complexity that we added in https://github.com/tskit-dev/tsinfer/commit/4f76b6adfb40665994ad48c15da0bf1c27b2ff11). Should I open as a draft PR just so we can look at it, or is this a distraction?
This is just a nicety so we get the time units right at the end of the pipeline, right?
Essentially, yes. It would hopefully stop people doing silly things when matching historical samples (such as adding them into a tree sequence where the time units are frequencies)
Let's avoid adding anything we don't need to the data file formats before the switch over.
Yes, fair point. I wonder whether it's worth saving the PR somewhere though (for future ref) or simply detailing the logic verbally in an issue (e.g. here).
See also #729 - will need to deal with this problem more generally.
I'm hoping that we can get to a point ultimately where we can provide a list of data files containing dated haplotypes drawn from various sources as input to tsinfer, and it will go ahead and do the inference by pulling from all those files in time steps. This would massively increase our flexibility.
Probably to be thought about in the context of the SGkit move, either release 0.4 or 0.4.1