tskit
tskit copied to clipboard
Document visual ways of summarising subtrees (clades) when plotting
As part of https://github.com/tskit-dev/tutorials/issues/182 we should think of alternative ways of showing big trees. One possibility is to collapse clades in a tree if e.g. all the nodes underneath belong to the same population (or have no population). I think this is more of a viz issue than a tree-sequence manipulation issue, as it would be done per tree, and we would want to visually distinguish the collapsed clades somehow (perhaps we could use a larger triangle: I don't know if we would want to vary the triangle size depending on the number of samples in the clade).
We could allow this to happen even if e.g. a small proportion of the samples are in a different population. But then that gets pretty complicated. A more sophisticated thing would be to replace the circular internal nodes with a pie chart of the proportions of sample tips underneath the node. I guess in the viz you could specify which nodes you wanted to do this for.
@savitakartik would be interested in this.
One thing we could do without requiring population semantics is to have a cutoff on the number of leaves that we draw. I'd find this super helpful for the SARS-COV-2 trees, where you'd like to look at the deep structure of the tree. It would really help if we could always draw something quickly.
So, basically if you hit an internal node that has > sample_threshold samples below it, we draw a big box which says (X samples) and stop traversing downwards at that point.
Ih yes, that's a nice idea. Another possibility if the tree is dated is to cut off a section at the bottom such that only X lineages are in the resulting tree (and summarise the tips somehow).
Lots of these operations are probably per-tree, rather than on the entire ts, by the way.
Here's a quick hack where we limit drawing by depth:
import numpy as np
import tskit
import msprime
ts = msprime.sim_ancestry(10, random_seed=1)
print(ts.first().draw_text())
def chop_draw(tree, max_depth):
ts = tree.tree_sequence
tables = ts.tables.copy()
tables.edges.clear()
tables.nodes.flags = np.zeros_like(tables.nodes.flags)
stack = [(root, 0) for root in tree.roots]
node_labels = {}
while len(stack) > 0:
u, depth = stack.pop()
node = ts.node(u)
node_labels[u] = f"{u}"
# print(u, depth)
if depth < max_depth:
for v in tree.children(u):
stack.append((v, depth + 1))
else:
node_labels[u] = f"{tree.num_samples(u)} samples"
node = node.replace(flags=1)
tables.nodes[u] = node
parent = tree.parent(u)
if parent != -1:
tables.edges.add_row(0, tree.span, parent, u)
tables.sort()
ts = tables.tree_sequence()
ctree = ts.at(tree.interval.left)
print(ctree.draw_text(node_labels=node_labels))
chop_draw(ts.first(), 4)
gives
38
┏━━━━━━━━┻━━━━━━━┓
┃ 37
┃ ┏━┻━┓
┃ ┃ 36
┃ ┃ ┏┻━┓
35 ┃ ┃ ┃
┏━━━━━━━━┻━━━━━━━━┓ ┃ ┃ ┃
34 ┃ ┃ ┃ ┃
┏━━━━━━━━┻━━━━━━━━━┓ ┃ ┃ ┃ ┃
33 ┃ ┃ ┃ ┃ ┃
┏━━━━━┻━━━━━┓ ┃ ┃ ┃ ┃ ┃
┃ ┃ 32 ┃ ┃ ┃ ┃
┃ ┃ ┏━━━┻━━━┓ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ 31 ┃ ┃ ┃
┃ ┃ ┃ ┃ ┏┻━┓ ┃ ┃ ┃
┃ 30 ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┏━━┻━━┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ 29 ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┏━┻━━┓ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ 28 ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┏━┻┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ 27 ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ 26 ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┏━┻━━┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ 25 ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┏━━┻━┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ 24 ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┏━┻┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ 23 ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┏━┻┓ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ 22 ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┏━┻┓ ┃ ┃
21 ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┏━┻━┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
20 ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┏┻┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
0 6 17 2 12 13 5 14 3 7 9 19 11 18 8 10 1 16 4 15
38
┏━━━━━━━━━┻━━━━━━━━┓
37 ┃
┏━┻━┓ ┃
┃ 36 ┃
┃ ┏┻━┓ ┃
┃ ┃ ┃ 35
┃ ┃ ┃ ┏━━━━━━━━━━┻━━━━━━━━━━┓
┃ ┃ ┃ ┃ 34
┃ ┃ ┃ ┃ ┏━━━━━━━━━┻━━━━━━━━━┓
┃ ┃ ┃ ┃ 33 ┃
┃ ┃ ┃ ┃ ┏━━━━┻━━━━┓ ┃
┃ ┃ ┃ ┃ ┃ ┃ 32
┃ ┃ ┃ ┃ ┃ ┃ ┏━━━━┻━━━━┓
┃ ┃ ┃ 31 ┃ ┃ ┃ ┃
┃ ┃ ┃ ┏━┻┓ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ 5 samples ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ ┃ 4 samples
┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ 2 samples
┃ ┃ ┃ ┃ ┃ ┃
22 ┃ ┃ ┃ ┃ ┃
┏━┻┓ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ 3 samples
┃ ┃ ┃ ┃ ┃ ┃
1 16 4 15 8 10
Really nice! And building on that, here's another way, which restricts the total number of lineages instead. Not tested much though,
import numpy as np
import tskit
import msprime
ts = msprime.sim_ancestry(10, random_seed=1)
print(ts.first().draw_text())
def chop_draw2(tree, max_lineages):
ts = tree.tree_sequence
tables = ts.tables.copy()
tables.edges.clear()
tables.nodes.flags = np.zeros_like(tables.nodes.flags)
node_labels = {}
tips = set(tree.roots)
for n in tree.nodes(order="timedesc"):
if tree.num_children(n) + len(tips) > max_lineages:
break
children = tree.children(n)
if len(children) > 0:
tips.remove(n)
for c in children:
tips.add(c)
tables.edges.add_row(0, tree.span, n, c)
for u in tips:
node_labels[u] = str(u) if tree.is_leaf(u) else f"{tree.num_samples(u)} samples"
node = ts.node(u).replace(flags=1)
tables.nodes[u] = node
tables.sort()
ts = tables.tree_sequence()
ctree = ts.at(tree.interval.left)
print(ctree.draw_text(node_labels=node_labels))
chop_draw2(ts.first(), 10)
38
┏━━━━━━━━┻━━━━━━━━┓
┃ 37
┃ ┏━┻━┓
┃ ┃ 36
┃ ┃ ┏┻━┓
35 ┃ ┃ ┃
┏━━━━━━━━━┻━━━━━━━━━┓ ┃ ┃ ┃
34 ┃ ┃ ┃ ┃
┏━━━━━━━━━┻━━━━━━━━━┓ ┃ ┃ ┃ ┃
┃ 33 ┃ ┃ ┃ ┃
┃ ┏━━━┻━━━┓ ┃ ┃ ┃ ┃
32 ┃ ┃ ┃ ┃ ┃ ┃
┏━━━┻━━━┓ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ 31 ┃ ┃ ┃
┃ ┃ ┃ ┃ ┏┻━┓ ┃ ┃ ┃
┃ ┃ 30 ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┏━━┻━━┓ ┃ ┃ ┃ ┃ ┃ ┃
29 ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┏━┻━┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ 28 ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┏━┻┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ 27 ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┏┻┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ 26 ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┏━━┻━━┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ 25 ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┏━┻━┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ 24 ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ ┏┻━┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ 23 ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┏┻━┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ 22 ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┏━┻┓ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ 21 ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┏━┻━━┓ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ 20 ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┏┻━┓ ┃ ┃ ┃ ┃ ┃ ┃ ┃
0 3 7 9 11 18 2 12 13 5 14 6 19 17 8 10 1 16 4 15
┃
┏━━━━━━━━━━┻━━━━━━━━━━┓
┃ ┃
┏━━━┻━━┓ ┃
┃ ┃ ┃
┏━┻┓ ┃ ┃
┃ ┃ ┃ ┃
┃ ┃ ┃ ┏━━━━━━━━━━━┻━━━━━━━━━━┓
┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┏━━━━━━━━━┻━━━━━━━━━┓
┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┏━━━━┻━━━━┓ ┃
┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ ┏━━━━┻━━━━┓
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┏┻━┓ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ 5 samples ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ ┃ 4 samples
┃ ┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃ ┃ 2 samples
┃ ┃ ┃ ┃ ┃ ┃
┃ ┃ 2 samples ┃ ┃ ┃
┃ ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃ 3 samples
┃ ┃ ┃ ┃
4 15 8 10
This is a perfect candidate for a dynamic notebook widget, where you click a node to toggle summarisation. One day!
Meanwhile, is this something that @savitakartik would like to tackle? I could help.
Yes, I'm very interested in this issue and would love to work on it!
Great. Shall we chat about it tomorrow, perhaps?
I've just been discussing this with @savitakartik. One problem with the "edit the tree sequence" approach is that it might be hard to apply to an entire tree sequence, for instance, if an internal node should be collapsed in one tree in the ts, but uncollapsed in another.
Here's another idea: we could create a new function for iterating over the nodes of a tree in a tree sequence, that flags up whether a node is a "collapsed" node or not. Something like
class Tree
def nodes_collapsed(order=None, return_hidden_nodes=None, collapse_method="time", ...):
"Returns an iterator over the tuples (node_id, is_collapsed)"
Then we adjust the drawing routines to use tree.nodes_collapsed() instead of tree.nodes()
for (u, is_collapsed) in ts.first().nodes_collapsed():
# use this in the plotting routines instead of tree.nodes()
# is is_collapsed==True then plot labels using tree.samples(u)
This is more involved, but I think more flexible, and I can see how we could use it to implement a (naive) v version where we have SVG interactivity to hide and show clades in a tree (sequence) plot. I think we should meet with @jeromekelleher or @benjeffery to discuss the best approach here.
On a quick glance I would agree with having the collapsing be done in the drawing code rather than in the tree sequence. This also allows you to do fancy things in future such as having a node icon that summarises the clade below (e.g. pie chart by population).
Here's a fun thing: download this SVG and open it in a browser. It should allow you to hide and show subclades by clicking on them:
@benjeffery can probably improve my poorly coded JS, which I added to the end of the SVG file.
function toggle_child_node_visibility(evt) {
children = evt.currentTarget.getElementsByClassName("node");
for (var i = 0; i < children.length; i++) {
if (children[i].style.visibility == "hidden") {
children[i].style.visibility = null;
} else {
children[i].style.visibility = "hidden";
}
}
evt.stopPropagation()
}
nodes = document.getElementsByClassName("node");
for (var i = 0; i < nodes.length; i++) {
nodes[i].addEventListener("click", toggle_child_node_visibility);
}
Cool! I'm guessing in the general case you'd want to expand/collapse the layout though?
Cool! I'm guessing in the general case you'd want to expand/collapse the layout though?
Yes, I think usually you would want to squash up nodes on the x axis if they contain branches that have been collapsed, but doing it interactively would mean the node positions would hop about, so I was wondering if the interactive version might want an statically positioned option such as this.
Yes, I think usually you would want to squash up nodes on the x axis
Just chiming in here to remind you that you can use link_ancestors to do this pushing up of nodes
Yes, I think usually you would want to squash up nodes on the x axis
Just chiming in here to remind you that you can use
link_ancestorsto do this pushing up of nodes
Erm, I'm not sure I follow. Here we are talking about adjusting the X position of the nodes, I think?
ah, sorry. I've only skimmed this thread -- when I saw this
One possibility is to collapse clades in a tree if e.g. all the nodes underneath belong to the same population
I assumed you meant you were going to find nodes whose descendants all have the same 'population' label, and 'simplify' the tree by getting rid of the intermediate edges between them and the leaves
so it wouldn't necessarily help you do the plotting itself, but it could help to show you which nodes need to be collapsed together
Right. Re viz, I know that the ETE developer has thought about interactive large tree viz in a conventional style and has some funding for it (e.g. demos with circular trees at https://www.youtube.com/watch?v=jnkuNrfx6iM).
Following on from our discussion just now:
- We want to reorganise the
SvGDraw.assign_x_coordinates()method so that it doesn't require num_leaves to be calculated up top, but used at the end - We want to be able to specify the starting node(s) for tree drawing, e.g.
tree.draw_svg(start_nodes=[1, 2])wherestart_nodes=Nonemeans use tree.roots here: https://github.com/tskit-dev/tskit/blob/10a74b4df18641ff32c29b8d0b3632b094c6cbb9/python/tskit/drawing.py#L1448 . I'm not sure this makes huge amounts of sense to do for a entire tree sequence, through. Certainly not for a first pass anyway. - We want to be able to specify a cutoff for the number of lineages displayed. This can be done by carrying out a level order traversal until a certain number of lineages are surpassed, then marking all the nodes so far visited as to-be-drawn, and terminating early when drawing here: https://github.com/tskit-dev/tskit/blob/10a74b4df18641ff32c29b8d0b3632b094c6cbb9/python/tskit/drawing.py#L1450
This paper has some discussion about identifying identical clades in two partially collapsed trees: https://academic.oup.com/mbe/article/33/8/2163/2579233?login=false