Create a new merge() function, and use that for concatenate
Description
It appears as if union wasn't really the right underlying function to use for concatenate, e.g. it doesn't deal as expected with sites above samples, so I had to roll my own merge utility.
Fixes #3181
PR Checklist:
- [x] Tests that fully cover new/changed functionality.
- [x] Documentation including tutorial content if appropriate.
- [x] Changelogs, if there are API changes.
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 86.76%. Comparing base (9d2ff8e) to head (294dd89).
:warning: Report is 76 commits behind head on main.
:exclamation: There is a different number of reports uploaded between BASE (9d2ff8e) and HEAD (294dd89). Click for more details.
HEAD has 6 uploads less than BASE
Flag BASE (9d2ff8e) HEAD (294dd89) python-tests 6 0
Additional details and impacted files
@@ Coverage Diff @@
## main #3183 +/- ##
===========================================
- Coverage 98.91% 86.76% -12.16%
===========================================
Files 17 11 -6
Lines 7542 24427 +16885
Branches 1258 4615 +3357
===========================================
+ Hits 7460 21193 +13733
- Misses 33 1853 +1820
- Partials 49 1381 +1332
| Flag | Coverage Δ | |
|---|---|---|
| c-tests | 86.66% <ø> (?) |
|
| lwt-tests | 80.38% <ø> (?) |
|
| python-c-tests | 88.18% <ø> (?) |
|
| python-tests | ? |
Flags with carried forward coverage won't be shown. Click here to find out more. see 28 files with indirect coverage changes
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
I strongly vote for a more descriptive name than merge, since (as we saw with union) a generic name can lead to wrong expectations. Why not union_edges or merge_edges? (this also merges eg mutations but we could think of mutations as being carried along with the edge they're on)
I also wonder if we actually want the TreeSequence method, as these kinds of operations are best done on tables?
Might you be able to write down an explicit list of what exactly gets added, in the style of union?
Other things needed for the docs:
- a warning that the result might not be a legal tree sequence
- more explicit description of the arguments in the TreeSequence method
We also need some tests that use the WF-with-overlapping-gens sims (and so thus might have some samples that are parents of others, etc) - I think we have this built in to tsutil.
Thanks @petrelharp
I strongly vote for a more descriptive name than merge
Yep, I can see that argument. I don't think of this as purely edge-based, however: it also merges mutations, sites, migrations, etc.
I also wonder if we actually want the TreeSequence method
The tree sequence method comes along for free, but we could easily 'nix it. We tend to expose these table methods to tree-sequence space too, like union, so it seemed remiss not to?
a warning that the result might not be a legal tree sequence
I think it always will be, because of the sort() call? Otherwise that will fail, I think.
more explicit description of the arguments in the TreeSequence method
Good catch: I forgot that and have now added it.
p.s. I set add_populations=False in the concatenate method, overriding the default, as I expect concatenate to be used e.g. on multiple chromosomes with the same population definitions, and it would be confusing to create new populations for each chromosome, I feel.
Defaulting to add_populations=True for the merge method seems OK though, as this is more of a low-level thing, and matches union.
Might you be able to write down an explicit list of what exactly gets added, in the style of union?
Good idea. I'll do that, and add some WF-with-overlapping-gens type tests (I have hacked a few with internal samples in there already, but I can add one or two more)
Added the following list, and a couple of WF tests for concatenate. We still need to agree on a different name than "merge" though...
Items that will be ported from ``other`` into ``self`` (and given new IDs) are:
1. All edges in ``other``
2. All migrations in ``other``
3. All mutations in ``other``
4. Sites whose positions are new to ``self``
5. Individuals whose nodes are new to ``self``.
6. If ``add_populations=True``, populations whose nodes are new to ``self``
We still need to agree on a different name than "merge" though
Here's a suggestion for reducing confusion. I implement a 'union' like part to the function in this PR which skips merging edges if they are identical. Then we call this one union_edges and rename the existing union to union_nodes. We keep the name union as a deprecated (but permanently maintained) alias to union_nodes (we could or could not document it, I suppose)
a warning that the result might not be a legal tree sequence I think it always will be, because of the sort() call? Otherwise that will fail, I think.
Yes, that's the point - I'm imagining a sentence after the list of things that will be added which says "If this results in an invalid tree sequence, this will produce an error."
TODO: put this in https://tskit.dev/tskit/docs/latest/python-api.html#modification
Items that will be ported from
otherintoself(and given new IDs) are:
- All edges in
other- All migrations in
other- All mutations in
other- Sites whose positions are new to
self- Individuals whose nodes are new to
self.- If
add_populations=True, populations whose nodes are new to ``self`
This list is missing the new nodes thenselves? (those for which node_mapping is NULL); otherwise, looks good!
A general question is how much checking we do: do we check for equality of nodes that are the same? Of sites? (they could have metadata) If we allow overlapping stuff, what about mutations?
I think by default we do check it - the checking with union is very annoying but in nearly all cases where it fails because of the check_shared_overlap, the thing the person was trying to do was not right.
About the name - I do think this could be called "edgewise" (even though it does mutations and migrations) because those could be thought of as living on edges?
I guess I'm up for the renaming of union to union_node (should it be union_nodes?), although maybe it's confusing because it adds on more than just nodes, just as this one adds on more than just edges. I need to think about this.
Thinking out loud here: what's the difference between this and union? (I'll call "this" merge here.)
Comparing union to the list of what's added here:
mergewill add edges between existing nodes;unionwon't.mergewill add mutations above existing nodes;unionwon't.mergeandunionadd the same individuals; neither adds new individuals whose nodes are all shared between the two.
Note in particular that merge does strictly more than union.
So: merge just basically goes ahead and adds everything (except not all individuals); whereas union only adds stuff related to new nodes. The reason for this is the different use cases:
unionis designed for the case where they share history entirely past some point but have diverged more recently;mergeis designed for the case when they describe dijoint sections of genome.
What would union do, abstractly? Well, if a set A is the disjoint union of X and Y, and B is the disjoint union of Y and Z; then unioning A and B adds Z to A; so a fully general tree sequence union would need to figure out what Y is (ie what's the intersection). This is in general a hard (at least annoying) problem and we don't have use cases for the general case, so we're not trying to do this. Instead, we're specializing to these two special cases of union, where (union) Y is "attributes of and relationships between a given set of existing nodes"; and (merge) Y is "a given set of existing nodes".
So - given that merge does strictly more than union, and otherwise they are quite similar, it seems like a good idea to just add a new argument to union which toggles the behavior of merge. Probably something qutie simple like ts.union(other, all_edges=True, all_mutations=True)? These would turn off check_shared_equality. (I thought about calling it disjoint=True; or disjoint_edges=True or some such, since we're basically assuming that the edges to be disjoint, but this is more straightforward.) I think this would be straightforward: just changing this condition and this condition.
How does this sound?
Note: I found another current use of union: https://tskit.dev/pyslim/docs/latest/vignette_continuing.html#continuing-the-simulation; this one could be done with either union or merge.
How does this sound?
I like the idea of not creating more confusion by similarly named functions, so something like that SGTM.
I guess the current function could be used as a python implementation in the test suite if we want. How about the checking of e.g. populations against each other (which includes metadata). I suppose this would be a good thing for the current union too?
I guess the current function could be used as a python implementation in the test suite if we want. How about the checking of e.g. populations against each other (which includes metadata). I suppose this would be a good thing for the current union too?
Good question. I believe the current union does this by check_shared_overlap (but we can't use that for the new option).
Want me to have a go at the C version?
Want me to have a go at the C version?
I would love it if you have time, thanks so much.
I'm happy to modify my python tests to check it works as intended.
I converted this to draft, as @petrelharp 's suggestion is different, but we should use the tests from this one.
Closing in favour of #3283