tskit icon indicating copy to clipboard operation
tskit copied to clipboard

span-wise union

Open petrelharp opened this issue 6 months ago • 4 comments

As discussed in #3168, something we'd like to do is squash together tree sequences describing different bits of the genome. For instance: say I've got simulations of 3L and 3R and would like to make a tree sequence of the entire chromosome 3. One possible not-very-good name: ts1.union_span(ts2)? (Or: merge?)

Possible functionality:

  1. the span covered by all edges in ts1 and in ts2 do not overlap (no overlap)
  2. any edge in ts2 describes a parent for a node in ts2 that has no parent on that span (no node-wise overlap)
  3. ts1 and ts2 can overlap but they must be identical where they do

So, where ts.union was a node-wise union, this would be an edge-wise union. Case (3) is the most general, but also definitely the most annoying to code up and I am not aware of any use cases. Case (1) would be easy and efficient: simply loop through edges in each in insertion order and check for any overlap; @hyanwong suggested special-casing this one anyhow. Case (2) would require building the trees, I think, so I'm not sure if it'd be much easier than (3).

So I think I'm proposing doing (1), and am looking for a name? Edit: this is covered by concatenate. Suggestions of use cases for (2) or (3) welcome?

petrelharp avatar May 29 '25 12:05 petrelharp

This is very similar to the (new) ts.concatenate method, but without the built-in "shift" functionality.

However, that function was using union so has some bugs, which I am now trying to fix.

If we did implement a merge, it makes concatenate trivial (it's a shift then a merge). So perhaps I should spin out the main body into a "merge" function, with appropriate checking?

hyanwong avatar May 29 '25 13:05 hyanwong

Here's what I currently have in Python. I haven't implemented checks for 1, 2, or 3, but these would presumably be picked up when running the self.sort() line. We could remove identical edges if we want to do 3. Is this the sort of thing you had in mind @petrelharp ?

    def merge(
        self,
        other,
        node_mapping,
        *,
        add_populations=None,
        check_populations=None,
        record_provenance=True,
    ):
        """
        Merge another table collection into this one, edgewise. Nodes in ``other``
        whose mapping is set to :data:`tskit.NULL` will be added as new nodes.
        All other nodes will remain unaltered (hence will be associated with the
        same populations and individuals as in ``self``). Individuals associated
        with newly created nodes will be added to the individuals table. If
        ``add_populations`` is ``True``, populations associated with newly created 
        nodes will be also be added to the populations table, otherwise it will be
        assumed that the populations for each node in ``other`` correspond to the
        populations with the same ID  in ``self``.
        

        .. seealso::
            :meth:`.union` for a combining two tables nodewise

        :param TableCollection other: Another table collection.
        :param list node_mapping: An array of node IDs that relate nodes in
            ``other`` to nodes in ``self``: the k-th element of ``node_mapping``
            should be the index of the equivalent node in ``self``, or
            ``tskit.NULL`` if the node is not present in ``self`` (in which case it
            will be added to self).
        :param bool record_provenance: If True (default), record details of this
            call to ``concatenate`` in the returned tree sequence's provenance
            information (Default: True).
        :param bool add_populations: If True, any populations referred to from nodes
            new to ``self`` will be added as new populations. If False (default)
            new populations will not be created, and populations with the same ID
            in ``self`` and the other tree sequences will be reused.
        :param bool check_populations: If True (default), check that the
            populations referred to from nodes in the other tree sequences
            are identical to those in ``self``.

        """
        if add_populations is None:
            add_populations = False  # Note that this is different from `union`: perhaps we want to default to `True`?
        if check_populations is None:
            check_populations = True

        node_map = util.safe_np_int_cast(node_mapping, np.int32, copy=True)
        if check_populations:
            nodes_pop = other.nodes.population
            if add_populations:
                # Only need to check those populations used by nodes in the node_mapping
                nodes_pop = nodes_pop[node_map != tskit.NULL]
            for i in np.unique(nodes_pop[nodes_pop != tskit.NULL]):
                if self.populations[i] != other.populations[i]:
                    raise ValueError("Non-matching populations")
            individual_map = {}
            population_map = {}
            for new_node in np.where(node_map == tskit.NULL)[0]:
                params = {}
                node = other.nodes[new_node]
                if node.individual != tskit.NULL:
                    if node.individual not in individual_map:
                        individual_map[node.individual] = self.individuals.append(
                            other.individuals[node.individual]
                        )
                    params["individual"] = individual_map[node.individual]
                if node.population != tskit.NULL:
                    if add_populations:
                        if node.population not in population_map:
                            population_map[node.population] = self.populations.append(
                                other.populations[node.population]
                            )
                        params["population"] = population_map[node.population]
                    else:
                        if node.population >= self.populations.num_rows:
                            raise ValueError(
                                "One of the tree sequences to concatenate has a "
                                "population not present in the existing tree sequence")
                node_map[new_node] = self.nodes.append(node.replace(**params))

            for e in other.edges:
                self.edges.append(
                    e.replace(child=node_map[e.child], parent=node_map[e.parent])
                )
            for site in other.sites:
                self.sites.append(site)
            for mut in other.mutations:
                self.mutations.append(mut.replace(node=node_map[mut.node]))
            for mig in other.migrations:
                self.migrations.append(mig.replace(
                    node=node_map[mig.node],
                    source=population_map.get(mig.source, mig.source),
                    dest=population_map.get(mig.dest, mig.dest),
                ))
        self.sort()
    
        if record_provenance:
            parameters = {
                "command": "merge",
                "TODO": "add merge parameters",  # tricky as both have provenances
            }
            self.provenances.add_row(
                record=json.dumps(provenance.get_provenance_dict(parameters))
            )

hyanwong avatar May 29 '25 13:05 hyanwong

Oh, right! concatenate, aha. #3164, #3165 - I think this addresses (1) quite well.

And, very nice! This is a nice way to do (2), much simpler than I was imagining. As you say, conflicts will be caught when making it a tree sequence. Hm, and you're right, (3) could be done easily by checking for and removing identical edges post-sort. (We'd just need to say that if we've got two edges where one is a subset of the other we don't deal with that case: edges must be disjoint or identical.)

I guess my vote is that the method you've got here is called merge_edges, or if it does (3) then edge_union or union_edges, and that it be a TableCollection method, not a TreeSequence method (this looks like what you've got anyhow, but rationale: this operation does not require building trees, and this sort of mucking about editing tables is best done only in table-land).

Most of the complexity of union was due to the check_shared_overlap, and that was desired because of our use case.

petrelharp avatar May 29 '25 16:05 petrelharp

The proposal now (which @petrelharp and I agree with) is to modify union to allow this behaviour. See https://github.com/tskit-dev/tskit/pull/3183#issuecomment-2925689823

hyanwong avatar Jun 10 '25 13:06 hyanwong

Finally fixed in #3283

hyanwong avatar Nov 09 '25 22:11 hyanwong