tskit icon indicating copy to clipboard operation
tskit copied to clipboard

specify samples by population name

Open petrelharp opened this issue 4 years ago • 8 comments

It'd be nice to do e.g.

ts.samples("popA")

where "popA" is the name of a population, rather than the index. Currently we end up doing things like this:

pop_names = [p.metadata['name'] for p in ts.populations()]
sample_sets = [ts.samples(pop_names.index(n)) for n in ["popA", "popB"]]
ts.divergence(
        sample_sets,
        indexes=[(0, 0), (0, 1), (1, 1)]
)

However, this does mean that name is maybe "no longer metadata" since we're using it...

petrelharp avatar Sep 13 '21 17:09 petrelharp

I'm totally +1 for this. I think we can blur the line a bit here with the data/metadata distinction by saying it's data if we use it in the C library. Also, if we do decide we want to add a name field to the population at some point (which wouldn't be a terrible idea) we can easily imagine dealing with that here by saying the name field takes precedence, and if this is null we check for a metadata key called name.

jeromekelleher avatar Sep 14 '21 09:09 jeromekelleher

Yes, I'm happy with this, and with the idea that the C API defines what is metadata/data. Especially as we start to standardise metadata, I see it that there are three tiers:

  • data: used by C API and in tables
  • standard metadata: used by the Python API
  • user metadata: not used by tskit

benjeffery avatar Sep 14 '21 13:09 benjeffery

Also note that you "can" do ts.samples(population="A") and it returns an empty array, which I think should be counted as a bug.

hyanwong avatar Dec 10 '21 11:12 hyanwong

Some more up-to-date notes as this issue got posted as dupe in #2294.

This is very close to https://github.com/tskit-dev/tskit/issues/2239 where there is a proposal for the TreeSequence row-accessors to get similar functionality. Personally I would prefer an explicit ts.samples(name='BIG') to ts.samples('BIG'). One option would be to make this the equivalent of #2239 which would be ts.samples(metadata=("name", "BIG"))

For the syntactically sweeter options name is promoted to "a metadata key used by the Python API" which means the standardization mentioned in https://github.com/tskit-dev/tskit/issues/1051 is relevant.

benjeffery avatar May 24 '22 09:05 benjeffery

Personally I would prefer an explicit ts.samples(name='BIG') to ts.samples('BIG')

If so, I suggest that ts.samples('BIG') should raise an error.

hyanwong avatar Jan 01 '23 12:01 hyanwong

I just hit this "bug" again, when I tried to use ts.samples(population='BIG'). Note that In @benjeffery's suggestion, it's not clear that the 'BIG' is the name of a population (not, e.g an individual). It so happens that the first argument of ts.samples is population, but that seems a bit obscure to me. Perhaps

ts.samples(population_name='big')

or allow any argument to take a metadata dictionary:

ts.samples(population={'name': 'BIG'})  # if dict, assume we look in the metadata

or we could just use the .population accessor with Ben's suggestion:

ts.samples(population=ts.population(name="BIG"))  # if dict, assume we look in the metadata

hyanwong avatar Nov 13 '23 13:11 hyanwong