tsinfer
tsinfer copied to clipboard
Should we always set a metadata "name" for populations in tsinfer?
Similar to https://github.com/MesserLab/SLiM/issues/169, should tsinfer always create a "name" (and maybe "description") for populations in the resulting tree sequence? I think we should.
If a name is not given in the metadata field (when calling add_population ) then we could simply create a name like "sample_data_pop1". Alternatively, we could simply add a required parameter name to add_population, either storing it in the sample data metadata field (easiest), or creating a new zarr array for it, and requiring that each population name is unique. Do you have any preferences/thoughts @jeromekelleher ?
This would help with https://github.com/tskit-dev/tsinfer/issues/604 as then we could match populations by name.
In the transition to SGkit, we would presumably use the "name" string for a cohort (AKA population) that @jeromekelleher argued for at https://github.com/pystatgen/sgkit/issues/224#issuecomment-694350208, so this would fit reasonably nicely into that structure.
Without looking at the details, I think we should always set a name. I'd lean towards sticking it into metadata, but it depends on how things work out. I guess we can just keep a set of population names during the add phase to make sure they are unique?
I also think we should stick them in metadata. I wonder if during the creation of a SampleData file, we should have a parameter "population_metadata_schema", which is a schema that allows "name" and "description" to be set. If None, we use the msprime population schema.
Then during the add phase, we simply write to "name" and "description", assuming they exist (and we would automatically error out here if the user passed in an incompatible schema when creating the file).
Alternatively, we could just set the population metadata schema to the permissive_json_schema().
Sounds good, but we should be a little wary about any performance implications (we've hit some walls recently with metadata encoding) I see the required flag in the JSON schema as just telling users that the key is guaranteed to exist BTW.
So, setting the schema at the end may be short term pragmatic option.
I assume (with no evidence!) that metadata in populations is unlikely to cause performance issues, because we rarely have large numbers of them in a loop. I guess we might want to avoid doing lots of comparisons of population names between samples though?
Yes, of course - my bad
I've just been trying to get this to work with the test suite. I actually wonder if we want to make the "default" population names more-or-less unique to the sample data file. Otherwise, when we merge sample data files (there is a function to do that), and assuming we merge on the name, we we'll simply end up merging the first pop in one file with the first in the other, etc. So I wonder if the default name, if no name is given, should also include the UUID of the sample data file?
That's pretty messy from a user perspective - you'd expect the names to be something you could predict. We'll have to figure out a different way of fixing the tests.
That's pretty messy from a user perspective - you'd expect the names to be something you could predict. We'll have to figure out a different way of fixing the tests.
Yeah, fair point. It's not so much fixing tests though. It's a question of whether, if you merge together 2 separate sample data files (both with populations defined, say 2 in the first file and 3 in the second), you would expect the populations in each file to be treated as separate (i.e. 5 populations in the final file), or whether you would expect the 2 populations in the first file to be equated to the first 2 populations in the second. I think you would expect the former behaviour, creating 5 populations. If we allocate the same set of names by default, and match on names only we will get the second behaviour.
A few other ways we could get the behaviour I would expect
- by default, put the UUID in the description field and merge only if both name and description are the same. But then how do we make names unique in the final file?
- Treat the default, auto-generated notation (e.g.
p1,p2etc) as special: if we detect names of that format when merging, we don't merge togetherp1from self andp1from other, even thought they have the same name. That seems a bit of a hack. - Somehow flag up the population names as autogenerated. E.g. put "autogenerated name" in the description field, and don't merge populations that have that string as a description.
I'm leaning towards 3.
Here's the (rather complicated) sketch of the SampleData.add_population method that I propose. What do you think @jeromekelleher ?
def add_population(self, metadata=None, name=None, description=None):
"""
Adds a new :ref:`sec_inference_data_model_population` to this
:class:`.SampleData` and returns its ID. Metadata including (at a minimum)
a name and a description, will be associated with each population.
All calls to this method must be made **before** individuals or sites
are defined.
:param dict metadata: A JSON encodable dict-like object containing
metadata that is to be associated with this population. Keys called
"name" or "description" may have their values overwritten using the
`name` and `description` parameters below.
:param str name: A unique name for this population (of length > 0). If given,
this will override any name defined in `metadata`. If not given, and
no "name" key exists in `metadata`, a suitable name will be autogenerated:
population 0 will be given the name `p0`, population 1 `p1`, etc.
:param str description: A description for this population (default: None). If not
None, this will override any description provided in `metadata`. Otherwise
(i.e. no "description" key exists in `metadata` and `description` is None)
if the population name was name was autogenerated the description will be
set to the string "autogenerated name", whereas if the name was not
autogenerated the description will default to the empty string.