tskit icon indicating copy to clipboard operation
tskit copied to clipboard

add `individuals=` argument to simplify

Open petrelharp opened this issue 4 years ago • 9 comments

An error that I have seen quite a few people make already is that they pass in a list of individual ids to samples=, instead of their constituent nodes. It's an easy mistake to make. If we had an "individuals" argument to simplify, this could help avoid this confusion. It might work like this:

for ind in individuals:
   nn = self.individual(ind).nodes
   warned = False
   for n in n:
     if (not warned) and n in samples:
        raise warnings.warn("Some individual nodes are already present in the 'samples' argument - did you mean to pass both individuals and their nodes?")
        warned = True
     samples.append(n)

petrelharp avatar Feb 20 '20 16:02 petrelharp

Good idea. Why not make the individuals and samples arguments mutually exclusive though? Seems simpler, and I can't think of a good reason for mixing them.

jeromekelleher avatar Feb 20 '20 17:02 jeromekelleher

Yeah, you're right. I guess the main question is whether to make samples optional with default None or make people pass in None explicitly (the first, I imagine?).

petrelharp avatar Feb 20 '20 17:02 petrelharp

I think samples defaults to None already (in this case it'll take all the samples in the ts). So another individuals=None argument should work fine.

jeromekelleher avatar Feb 20 '20 17:02 jeromekelleher

Here's the nuclear option. We could do the above and then also make all the arguments to simplify keyword-only, so users would have to say either samples= or individuals= in the call. It would break a lot of code (mostly our own, I think), so we'd have to have a release where we issued a deprecation warning first. But it would (hopefully) reduce the number of times users would make the error. Almost certainly too extreme a solution, but I thought it worth mentioning.

hyanwong avatar Nov 27 '20 09:11 hyanwong

+1 for this option. Simplifying to n individuals by looking just at their nodes does not always result in n individuals after simplification.

Consider a temporally sampled tree sequence, and the desire to simplify to a subset (say A) of individuals. Suppose also that in the pre-simplified tree sequence, there is an individual not in A which contains a node that is an ancestor to individuals in A. As far as I can tell, simplifying to the nodes in the A subset of individuals also retains the ancestor individual who is not in A.

grahamgower avatar Jun 09 '21 13:06 grahamgower

Consider a temporally sampled tree sequence, and the desire to simplify to a subset (say A) of individuals. Suppose also that in the pre-simplified tree sequence, there is an individual not in A which contains a node that is an ancestor to individuals in A. As far as I can tell, simplifying to the nodes in the A subset of individuals also retains the ancestor individual who is not in A.

That's true, and by design: what simplify does is to retain only information about the samples and the nodes necessary to describe their genealogies. "Individuals" are really just metadata about nodes, and we don't throw away the metadata of the ancestral nodes either.

But, I realize that this is a common error, to assume that samples and individuals are interchangeable. This is why in pyslim we've got individuals_alive_at( ), to easily pull out the individuals you want. I guess I think we should deal with that elsewhere, not in simplify? Options might be (a) remind people more about this common error (b) provide a tool to remove certain individuals so people can 'tidy' up their tree sequence (c) provide a tool to more easily extract the individuals they want (I think we have an issue for this: e.g., give me the individuals attached to these nodes).

petrelharp avatar Jun 09 '21 14:06 petrelharp

I was hoping that an individuals= option might return a tree sequence with the individuals table limited to the specified individuals. Wouldn't that be the least surprising thing?

grahamgower avatar Jun 09 '21 14:06 grahamgower

I was hoping that an individuals= option might return a tree sequence with the individuals table limited to the specified individuals. Wouldn't that be the least surprising thing?

I don't think so - for instance, you might argue by analogy that you should also have those individuals parents, now that we have those. I think the least confusing thing is just to make individuals= argument basically an alias for nodes=(the nodes in those individuals.

petrelharp avatar Jun 09 '21 17:06 petrelharp

I was hoping that an individuals= option might return a tree sequence with the individuals table limited to the specified individuals. Wouldn't that be the least surprising thing?

Perhaps what you want, @grahamgower, is a delete_individuals function. You call that first, then simplify(individuals=np.arange(ts.num_individuals)). This does, however, result in the nodes associated with all the individuals being marked as samples, which might not be what you want (there are several simulation cases where non-sample nodes are associated with individuals)?

hyanwong avatar Dec 09 '21 19:12 hyanwong