tskit icon indicating copy to clipboard operation
tskit copied to clipboard

sort so mutation parents are above children

Open petrelharp opened this issue 4 years ago • 9 comments

Now #1199 does this for individuals (yay!) but we still don't have a method, besides canonicalise, that will do this. Seems like it should be a standard part of sort. We could:

  1. decide that people who need this should use canonicalise, or
  2. copy the relevant code over from canonicalise, or
  3. be fancier and make it optional for performance reasons

I think the main impact on performance would be memory use? And, it's probably not a big deal?

I think this is the same thing as #1227, but perhaps what's happening there is different (and thus a bug).

petrelharp avatar Mar 04 '21 17:03 petrelharp

Would we do this using the same topological sort algorithm as individuals @petrelharp? If so, I'd vote for taking the same strategy as individuals: doing the O(n) topological sort as for TableCollection.sort() and the more expensive version for canonicalise(). I'm sure we could abstract the topological sorting code out if we wanted to, but it may not be worth it.

The next question is, when do we need this? I think we can probably put this off until the next C release (i.e., not the impending one), since we have non-parent sorting code for mutations out in the wild already?

jeromekelleher avatar Mar 05 '21 08:03 jeromekelleher

I'd also like to defer this until the next release as this one is already long in the tooth.

benjeffery avatar Mar 05 '21 11:03 benjeffery

Yep, no hurry on this.

petrelharp avatar Mar 05 '21 12:03 petrelharp

About the algorithm: if the number of mutations per site is k, the canonicalise implementation is O(n log k), so I don't think the topological sort has any practical advantages (it might even be slower in typical cases?).

petrelharp avatar Mar 05 '21 12:03 petrelharp

About the algorithm: if the number of mutations per site is k, the canonicalise implementation is O(n log k), so I don't think the topological sort has any practical advantages (it might even be slower in typical cases?).

Yes, that's true, but consistency is nice? I find the topological sort algorithm very pretty, and I really doubt it could be slower, as calling qsort() lots of times is much more expensive than a bunch of linear passes through the arrays.

jeromekelleher avatar Mar 05 '21 12:03 jeromekelleher

Note: The test at https://github.com/tskit-dev/tskit/blob/main/python/tests/test_tables.py#L4195 should be reinstated when this is fixed.

benjeffery avatar Apr 21 '21 22:04 benjeffery

Adding this to the 1.0 release for now. I guess it's not a critical thing API wise, though?

jeromekelleher avatar Jul 23 '21 13:07 jeromekelleher

@petrelharp we're triaging stuff for the C 1.0 API. Since this doesn't affect the API as such, I think we can probably drop it from this milestone (and move to C Upcoming). Unless you'd like to pick it up?

jeromekelleher avatar Sep 10 '21 13:09 jeromekelleher

Sounds good.

petrelharp avatar Sep 10 '21 16:09 petrelharp

Fixed in #3257

benjeffery avatar Sep 23 '25 16:09 benjeffery