specify samples by population name
It'd be nice to do e.g.
ts.samples("popA")
where "popA" is the name of a population, rather than the index.
Currently we end up doing things like this:
pop_names = [p.metadata['name'] for p in ts.populations()]
sample_sets = [ts.samples(pop_names.index(n)) for n in ["popA", "popB"]]
ts.divergence(
sample_sets,
indexes=[(0, 0), (0, 1), (1, 1)]
)
However, this does mean that name is maybe "no longer metadata" since we're using it...
I'm totally +1 for this. I think we can blur the line a bit here with the data/metadata distinction by saying it's data if we use it in the C library. Also, if we do decide we want to add a name field to the population at some point (which wouldn't be a terrible idea) we can easily imagine dealing with that here by saying the name field takes precedence, and if this is null we check for a metadata key called name.
Yes, I'm happy with this, and with the idea that the C API defines what is metadata/data. Especially as we start to standardise metadata, I see it that there are three tiers:
- data: used by C API and in tables
- standard metadata: used by the Python API
- user metadata: not used by tskit
Also note that you "can" do ts.samples(population="A") and it returns an empty array, which I think should be counted as a bug.
Some more up-to-date notes as this issue got posted as dupe in #2294.
This is very close to https://github.com/tskit-dev/tskit/issues/2239 where there is a proposal for the TreeSequence row-accessors to get similar functionality.
Personally I would prefer an explicit ts.samples(name='BIG') to ts.samples('BIG'). One option would be to make this the equivalent of #2239 which would be ts.samples(metadata=("name", "BIG"))
For the syntactically sweeter options name is promoted to "a metadata key used by the Python API" which means the standardization mentioned in https://github.com/tskit-dev/tskit/issues/1051 is relevant.
Personally I would prefer an explicit
ts.samples(name='BIG')tots.samples('BIG')
If so, I suggest that ts.samples('BIG') should raise an error.
I just hit this "bug" again, when I tried to use ts.samples(population='BIG'). Note that In @benjeffery's suggestion, it's not clear that the 'BIG' is the name of a population (not, e.g an individual). It so happens that the first argument of ts.samples is population, but that seems a bit obscure to me. Perhaps
ts.samples(population_name='big')
or allow any argument to take a metadata dictionary:
ts.samples(population={'name': 'BIG'}) # if dict, assume we look in the metadata
or we could just use the .population accessor with Ben's suggestion:
ts.samples(population=ts.population(name="BIG")) # if dict, assume we look in the metadata