tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Create a new merge() function, and use that for concatenate

Open hyanwong opened this issue 6 months ago • 19 comments

Description

It appears as if union wasn't really the right underlying function to use for concatenate, e.g. it doesn't deal as expected with sites above samples, so I had to roll my own merge utility.

Fixes #3181

PR Checklist:

  • [x] Tests that fully cover new/changed functionality.
  • [x] Documentation including tutorial content if appropriate.
  • [x] Changelogs, if there are API changes.

hyanwong avatar May 29 '25 17:05 hyanwong

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 86.76%. Comparing base (9d2ff8e) to head (294dd89). :warning: Report is 76 commits behind head on main.

:exclamation: There is a different number of reports uploaded between BASE (9d2ff8e) and HEAD (294dd89). Click for more details.

HEAD has 6 uploads less than BASE
Flag BASE (9d2ff8e) HEAD (294dd89)
python-tests 6 0
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #3183       +/-   ##
===========================================
- Coverage   98.91%   86.76%   -12.16%     
===========================================
  Files          17       11        -6     
  Lines        7542    24427    +16885     
  Branches     1258     4615     +3357     
===========================================
+ Hits         7460    21193    +13733     
- Misses         33     1853     +1820     
- Partials       49     1381     +1332     
Flag Coverage Δ
c-tests 86.66% <ø> (?)
lwt-tests 80.38% <ø> (?)
python-c-tests 88.18% <ø> (?)
python-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more. see 28 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar May 29 '25 17:05 codecov[bot]

I strongly vote for a more descriptive name than merge, since (as we saw with union) a generic name can lead to wrong expectations. Why not union_edges or merge_edges? (this also merges eg mutations but we could think of mutations as being carried along with the edge they're on)

I also wonder if we actually want the TreeSequence method, as these kinds of operations are best done on tables?

Might you be able to write down an explicit list of what exactly gets added, in the style of union?

Other things needed for the docs:

  • a warning that the result might not be a legal tree sequence
  • more explicit description of the arguments in the TreeSequence method

petrelharp avatar May 29 '25 17:05 petrelharp

We also need some tests that use the WF-with-overlapping-gens sims (and so thus might have some samples that are parents of others, etc) - I think we have this built in to tsutil.

petrelharp avatar May 29 '25 17:05 petrelharp

Thanks @petrelharp

I strongly vote for a more descriptive name than merge

Yep, I can see that argument. I don't think of this as purely edge-based, however: it also merges mutations, sites, migrations, etc.

I also wonder if we actually want the TreeSequence method

The tree sequence method comes along for free, but we could easily 'nix it. We tend to expose these table methods to tree-sequence space too, like union, so it seemed remiss not to?

a warning that the result might not be a legal tree sequence

I think it always will be, because of the sort() call? Otherwise that will fail, I think.

more explicit description of the arguments in the TreeSequence method

Good catch: I forgot that and have now added it.

hyanwong avatar May 29 '25 17:05 hyanwong

p.s. I set add_populations=False in the concatenate method, overriding the default, as I expect concatenate to be used e.g. on multiple chromosomes with the same population definitions, and it would be confusing to create new populations for each chromosome, I feel.

Defaulting to add_populations=True for the merge method seems OK though, as this is more of a low-level thing, and matches union.

hyanwong avatar May 29 '25 17:05 hyanwong

Might you be able to write down an explicit list of what exactly gets added, in the style of union?

Good idea. I'll do that, and add some WF-with-overlapping-gens type tests (I have hacked a few with internal samples in there already, but I can add one or two more)

hyanwong avatar May 29 '25 17:05 hyanwong

Added the following list, and a couple of WF tests for concatenate. We still need to agree on a different name than "merge" though...

Items that will be ported from ``other`` into ``self`` (and given new IDs) are:

1. All edges in ``other``
2. All migrations in ``other``
3. All mutations in ``other``
4. Sites whose positions are new to ``self``
5. Individuals whose nodes are new to ``self``.
6. If ``add_populations=True``, populations whose nodes are new to ``self``

hyanwong avatar May 29 '25 19:05 hyanwong

We still need to agree on a different name than "merge" though

Here's a suggestion for reducing confusion. I implement a 'union' like part to the function in this PR which skips merging edges if they are identical. Then we call this one union_edges and rename the existing union to union_nodes. We keep the name union as a deprecated (but permanently maintained) alias to union_nodes (we could or could not document it, I suppose)

hyanwong avatar May 29 '25 21:05 hyanwong

a warning that the result might not be a legal tree sequence I think it always will be, because of the sort() call? Otherwise that will fail, I think.

Yes, that's the point - I'm imagining a sentence after the list of things that will be added which says "If this results in an invalid tree sequence, this will produce an error."

petrelharp avatar May 30 '25 13:05 petrelharp

TODO: put this in https://tskit.dev/tskit/docs/latest/python-api.html#modification

petrelharp avatar May 30 '25 13:05 petrelharp

Items that will be ported from other into self (and given new IDs) are:

  1. All edges in other
  2. All migrations in other
  3. All mutations in other
  4. Sites whose positions are new to self
  5. Individuals whose nodes are new to self.
  6. If add_populations=True, populations whose nodes are new to ``self`

This list is missing the new nodes thenselves? (those for which node_mapping is NULL); otherwise, looks good!

petrelharp avatar May 30 '25 14:05 petrelharp

A general question is how much checking we do: do we check for equality of nodes that are the same? Of sites? (they could have metadata) If we allow overlapping stuff, what about mutations?

I think by default we do check it - the checking with union is very annoying but in nearly all cases where it fails because of the check_shared_overlap, the thing the person was trying to do was not right.

petrelharp avatar May 30 '25 14:05 petrelharp

About the name - I do think this could be called "edgewise" (even though it does mutations and migrations) because those could be thought of as living on edges?

I guess I'm up for the renaming of union to union_node (should it be union_nodes?), although maybe it's confusing because it adds on more than just nodes, just as this one adds on more than just edges. I need to think about this.

petrelharp avatar May 30 '25 14:05 petrelharp

Thinking out loud here: what's the difference between this and union? (I'll call "this" merge here.) Comparing union to the list of what's added here:

  • merge will add edges between existing nodes; union won't.
  • merge will add mutations above existing nodes; union won't.
  • merge and union add the same individuals; neither adds new individuals whose nodes are all shared between the two.

Note in particular that merge does strictly more than union.

So: merge just basically goes ahead and adds everything (except not all individuals); whereas union only adds stuff related to new nodes. The reason for this is the different use cases:

  • union is designed for the case where they share history entirely past some point but have diverged more recently;
  • merge is designed for the case when they describe dijoint sections of genome.

What would union do, abstractly? Well, if a set A is the disjoint union of X and Y, and B is the disjoint union of Y and Z; then unioning A and B adds Z to A; so a fully general tree sequence union would need to figure out what Y is (ie what's the intersection). This is in general a hard (at least annoying) problem and we don't have use cases for the general case, so we're not trying to do this. Instead, we're specializing to these two special cases of union, where (union) Y is "attributes of and relationships between a given set of existing nodes"; and (merge) Y is "a given set of existing nodes".

So - given that merge does strictly more than union, and otherwise they are quite similar, it seems like a good idea to just add a new argument to union which toggles the behavior of merge. Probably something qutie simple like ts.union(other, all_edges=True, all_mutations=True)? These would turn off check_shared_equality. (I thought about calling it disjoint=True; or disjoint_edges=True or some such, since we're basically assuming that the edges to be disjoint, but this is more straightforward.) I think this would be straightforward: just changing this condition and this condition.

How does this sound?

petrelharp avatar May 31 '25 20:05 petrelharp

Note: I found another current use of union: https://tskit.dev/pyslim/docs/latest/vignette_continuing.html#continuing-the-simulation; this one could be done with either union or merge.

petrelharp avatar Jun 01 '25 00:06 petrelharp

How does this sound?

I like the idea of not creating more confusion by similarly named functions, so something like that SGTM.

I guess the current function could be used as a python implementation in the test suite if we want. How about the checking of e.g. populations against each other (which includes metadata). I suppose this would be a good thing for the current union too?

hyanwong avatar Jun 01 '25 07:06 hyanwong

I guess the current function could be used as a python implementation in the test suite if we want. How about the checking of e.g. populations against each other (which includes metadata). I suppose this would be a good thing for the current union too?

Good question. I believe the current union does this by check_shared_overlap (but we can't use that for the new option).

Want me to have a go at the C version?

petrelharp avatar Jun 01 '25 14:06 petrelharp

Want me to have a go at the C version?

I would love it if you have time, thanks so much.

I'm happy to modify my python tests to check it works as intended.

hyanwong avatar Jun 01 '25 15:06 hyanwong

I converted this to draft, as @petrelharp 's suggestion is different, but we should use the tests from this one.

hyanwong avatar Jun 11 '25 12:06 hyanwong

Closing in favour of #3283

hyanwong avatar Oct 08 '25 19:10 hyanwong