tskit
tskit copied to clipboard
add `individuals=` argument to simplify
An error that I have seen quite a few people make already is that they pass in a list of individual ids to samples=, instead of their constituent nodes. It's an easy mistake to make. If we had an "individuals" argument to simplify, this could help avoid this confusion. It might work like this:
for ind in individuals:
nn = self.individual(ind).nodes
warned = False
for n in n:
if (not warned) and n in samples:
raise warnings.warn("Some individual nodes are already present in the 'samples' argument - did you mean to pass both individuals and their nodes?")
warned = True
samples.append(n)
Good idea. Why not make the individuals and samples arguments mutually exclusive though? Seems simpler, and I can't think of a good reason for mixing them.
Yeah, you're right. I guess the main question is whether to make samples optional with default None or make people pass in None explicitly (the first, I imagine?).
I think samples defaults to None already (in this case it'll take all the samples in the ts). So another individuals=None argument should work fine.
Here's the nuclear option. We could do the above and then also make all the arguments to simplify keyword-only, so users would have to say either samples= or individuals= in the call. It would break a lot of code (mostly our own, I think), so we'd have to have a release where we issued a deprecation warning first. But it would (hopefully) reduce the number of times users would make the error. Almost certainly too extreme a solution, but I thought it worth mentioning.
+1 for this option. Simplifying to n individuals by looking just at their nodes does not always result in n individuals after simplification.
Consider a temporally sampled tree sequence, and the desire to simplify to a subset (say A) of individuals. Suppose also that in the pre-simplified tree sequence, there is an individual not in A which contains a node that is an ancestor to individuals in A. As far as I can tell, simplifying to the nodes in the A subset of individuals also retains the ancestor individual who is not in A.
Consider a temporally sampled tree sequence, and the desire to simplify to a subset (say A) of individuals. Suppose also that in the pre-simplified tree sequence, there is an individual not in A which contains a node that is an ancestor to individuals in A. As far as I can tell, simplifying to the nodes in the A subset of individuals also retains the ancestor individual who is not in A.
That's true, and by design: what simplify does is to retain only information about the samples and the nodes necessary to describe their genealogies. "Individuals" are really just metadata about nodes, and we don't throw away the metadata of the ancestral nodes either.
But, I realize that this is a common error, to assume that samples and individuals are interchangeable. This is why in pyslim we've got individuals_alive_at( ), to easily pull out the individuals you want. I guess I think we should deal with that elsewhere, not in simplify? Options might be (a) remind people more about this common error (b) provide a tool to remove certain individuals so people can 'tidy' up their tree sequence (c) provide a tool to more easily extract the individuals they want (I think we have an issue for this: e.g., give me the individuals attached to these nodes).
I was hoping that an individuals= option might return a tree sequence with the individuals table limited to the specified individuals. Wouldn't that be the least surprising thing?
I was hoping that an individuals= option might return a tree sequence with the individuals table limited to the specified individuals. Wouldn't that be the least surprising thing?
I don't think so - for instance, you might argue by analogy that you should also have those individuals parents, now that we have those. I think the least confusing thing is just to make individuals= argument basically an alias for nodes=(the nodes in those individuals.
I was hoping that an
individuals=option might return a tree sequence with the individuals table limited to the specified individuals. Wouldn't that be the least surprising thing?
Perhaps what you want, @grahamgower, is a delete_individuals function. You call that first, then simplify(individuals=np.arange(ts.num_individuals)). This does, however, result in the nodes associated with all the individuals being marked as samples, which might not be what you want (there are several simulation cases where non-sample nodes are associated with individuals)?