span-wise union
As discussed in #3168, something we'd like to do is squash together tree sequences describing different bits of the genome. For instance: say I've got simulations of 3L and 3R and would like to make a tree sequence of the entire chromosome 3. One possible not-very-good name: ts1.union_span(ts2)? (Or: merge?)
Possible functionality:
- the span covered by all edges in ts1 and in ts2 do not overlap (no overlap)
- any edge in ts2 describes a parent for a node in ts2 that has no parent on that span (no node-wise overlap)
- ts1 and ts2 can overlap but they must be identical where they do
So, where ts.union was a node-wise union, this would be an edge-wise union. Case (3) is the most general, but also definitely the most annoying to code up and I am not aware of any use cases. Case (1) would be easy and efficient: simply loop through edges in each in insertion order and check for any overlap; @hyanwong suggested special-casing this one anyhow. Case (2) would require building the trees, I think, so I'm not sure if it'd be much easier than (3).
So I think I'm proposing doing (1), and am looking for a name? Edit: this is covered by concatenate. Suggestions of use cases for (2) or (3) welcome?
This is very similar to the (new) ts.concatenate method, but without the built-in "shift" functionality.
However, that function was using union so has some bugs, which I am now trying to fix.
If we did implement a merge, it makes concatenate trivial (it's a shift then a merge). So perhaps I should spin out the main body into a "merge" function, with appropriate checking?
Here's what I currently have in Python. I haven't implemented checks for 1, 2, or 3, but these would presumably be picked up when running the self.sort() line. We could remove identical edges if we want to do 3. Is this the sort of thing you had in mind @petrelharp ?
def merge(
self,
other,
node_mapping,
*,
add_populations=None,
check_populations=None,
record_provenance=True,
):
"""
Merge another table collection into this one, edgewise. Nodes in ``other``
whose mapping is set to :data:`tskit.NULL` will be added as new nodes.
All other nodes will remain unaltered (hence will be associated with the
same populations and individuals as in ``self``). Individuals associated
with newly created nodes will be added to the individuals table. If
``add_populations`` is ``True``, populations associated with newly created
nodes will be also be added to the populations table, otherwise it will be
assumed that the populations for each node in ``other`` correspond to the
populations with the same ID in ``self``.
.. seealso::
:meth:`.union` for a combining two tables nodewise
:param TableCollection other: Another table collection.
:param list node_mapping: An array of node IDs that relate nodes in
``other`` to nodes in ``self``: the k-th element of ``node_mapping``
should be the index of the equivalent node in ``self``, or
``tskit.NULL`` if the node is not present in ``self`` (in which case it
will be added to self).
:param bool record_provenance: If True (default), record details of this
call to ``concatenate`` in the returned tree sequence's provenance
information (Default: True).
:param bool add_populations: If True, any populations referred to from nodes
new to ``self`` will be added as new populations. If False (default)
new populations will not be created, and populations with the same ID
in ``self`` and the other tree sequences will be reused.
:param bool check_populations: If True (default), check that the
populations referred to from nodes in the other tree sequences
are identical to those in ``self``.
"""
if add_populations is None:
add_populations = False # Note that this is different from `union`: perhaps we want to default to `True`?
if check_populations is None:
check_populations = True
node_map = util.safe_np_int_cast(node_mapping, np.int32, copy=True)
if check_populations:
nodes_pop = other.nodes.population
if add_populations:
# Only need to check those populations used by nodes in the node_mapping
nodes_pop = nodes_pop[node_map != tskit.NULL]
for i in np.unique(nodes_pop[nodes_pop != tskit.NULL]):
if self.populations[i] != other.populations[i]:
raise ValueError("Non-matching populations")
individual_map = {}
population_map = {}
for new_node in np.where(node_map == tskit.NULL)[0]:
params = {}
node = other.nodes[new_node]
if node.individual != tskit.NULL:
if node.individual not in individual_map:
individual_map[node.individual] = self.individuals.append(
other.individuals[node.individual]
)
params["individual"] = individual_map[node.individual]
if node.population != tskit.NULL:
if add_populations:
if node.population not in population_map:
population_map[node.population] = self.populations.append(
other.populations[node.population]
)
params["population"] = population_map[node.population]
else:
if node.population >= self.populations.num_rows:
raise ValueError(
"One of the tree sequences to concatenate has a "
"population not present in the existing tree sequence")
node_map[new_node] = self.nodes.append(node.replace(**params))
for e in other.edges:
self.edges.append(
e.replace(child=node_map[e.child], parent=node_map[e.parent])
)
for site in other.sites:
self.sites.append(site)
for mut in other.mutations:
self.mutations.append(mut.replace(node=node_map[mut.node]))
for mig in other.migrations:
self.migrations.append(mig.replace(
node=node_map[mig.node],
source=population_map.get(mig.source, mig.source),
dest=population_map.get(mig.dest, mig.dest),
))
self.sort()
if record_provenance:
parameters = {
"command": "merge",
"TODO": "add merge parameters", # tricky as both have provenances
}
self.provenances.add_row(
record=json.dumps(provenance.get_provenance_dict(parameters))
)
Oh, right! concatenate, aha. #3164, #3165 - I think this addresses (1) quite well.
And, very nice! This is a nice way to do (2), much simpler than I was imagining. As you say, conflicts will be caught when making it a tree sequence. Hm, and you're right, (3) could be done easily by checking for and removing identical edges post-sort. (We'd just need to say that if we've got two edges where one is a subset of the other we don't deal with that case: edges must be disjoint or identical.)
I guess my vote is that the method you've got here is called merge_edges, or if it does (3) then edge_union or union_edges, and that it be a TableCollection method, not a TreeSequence method (this looks like what you've got anyhow, but rationale: this operation does not require building trees, and this sort of mucking about editing tables is best done only in table-land).
Most of the complexity of union was due to the check_shared_overlap, and that was desired because of our use case.
The proposal now (which @petrelharp and I agree with) is to modify union to allow this behaviour. See https://github.com/tskit-dev/tskit/pull/3183#issuecomment-2925689823
Finally fixed in #3283