Add dict_of_dicts to tree sequence
Description
Fixes #1294. Should probably add some tests for attributes of the graph like branch_length, left, and right.
PR Checklist:
- [x] Tests that fully cover new/changed functionality.
- [x] Documentation including tutorial content if appropriate.
- [x] Changelogs, if there are API changes.
Codecov Report
Merging #1296 (935d052) into main (f277006) will increase coverage by
0.00%. The diff coverage is100.00%.
@@ Coverage Diff @@
## main #1296 +/- ##
=======================================
Coverage 93.81% 93.81%
=======================================
Files 26 26
Lines 22187 22198 +11
Branches 1006 1009 +3
=======================================
+ Hits 20814 20825 +11
Misses 1340 1340
Partials 33 33
| Flag | Coverage Δ | |
|---|---|---|
| c-tests | 92.44% <ø> (ø) |
|
| lwt-tests | 92.97% <ø> (ø) |
|
| python-c-tests | 95.15% <100.00%> (+<0.01%) |
:arrow_up: |
| python-tests | 98.86% <100.00%> (+<0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| python/tskit/trees.py | 97.91% <100.00%> (+0.01%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update f277006...935d052. Read the comment docs.
Very neat! Looks very nice, but I think we want to be sure that this is the "right" graph representation before we make it part of the public API. Some questions:
- What do these graphs look like when we plot them? I.e., what does a single tree ts look like, and one with two trees?
- What would be need to do to round trip the topology? Ideally we would want
d = ts.as_dict_of_dicts()
tsp = tskit.TreeSequence.from_dict_of_dicts(d)
assert tsp.equals(ts, ignore_provenance=True)
We would need to have some visibilty on whether this is possible before we can fix on the graph representation.
Nice! In the round trip example @jeromekelleher, do you mean that just the topology would round trip? Or other things like populations and migrations? (I guess these could be included as an attribute at the root dict?)
Just the topology initially, we don't have to worry about the other stuff like individuals, migrations etc. But yes, in the case of populations it'd probably be simplest to add them in as attrs in the root dict.
AFAIK when NetworkX imports from dict of dicts you can only set attributes on edges: I don't know how to set them on nodes. So round tripping might be difficult. Even setting node times would have to be imperfectly guessed from edge lengths.
AFAIK when NetworkX imports from dict of dicts you can only set attributes on edges: I don't know how to set them on nodes. So round tripping might be difficult. Even setting node times would have to be imperfectly guessed from edge lengths.
We would round-trip the dicts, not the NetworkX representation.
Yes - I'm afraid we have to do the due diligence here @hyanwong. We need to spend time exploring the properties of this representation if we're doing to make it part of the API. The first question we need to answer is "is this a bijection"?
I assumed this would simply be an analogue to Tree.as_dict_of_dicts, so purely a (non complete) export format for NetworkX. If you want it to be something more, then it gets complex, I agree.
We don't have a Tree.from_dod method, so I'm not sure we need (want?) a TS one?
Apologies for the terse replies: I'm on a mobile. There's no reason to merge this any time soon. As it is, I need to use a slightly bespoke version anyway, which randomises the node IDs (otherwise locating isomorphic nodes is biased towards the input order). I just imagine that if I find a function like this useful for analysing the tree sequence as a graph, others might too. So it's probably worth having a similar function in the API. But as you say, worth doing the due diligence first. I'm in no hurry either way.
OK - since we're making quite a big decision here (what's the graph representation of a tree sequence) @hyanwong, I think it would be better if you used the version you have for your own work, and we took our time with this. The Tree version is much simpler and we don't worry about round tripping it because there's no way of creating a Tree on it's own anyway. The TreeSequence is the core data structure.
So, I think we can mark this as a draft?
So, I think we can mark this as a draft?
Done.
FWIW I don't think as_dict_of_dict should be the "canonical" graph representation of a TS. I think this only needs to be a function used for reading a specific format into NetworkX. We shouldn't claim anything grander about it. If we want a canonical format we could just use .__dict__() (or something) instead.
Could we update this method to return the representation I'm advocating for here @hyanwong? https://github.com/tskit-dev/tskit/discussions/2068#discussioncomment-1821721
I think it's better to annotate the nodes with their time, and the edges with intervals, rather than making a multigraph with branch lengths like we have here. Making the multigraph version for viz should be simple enough as a post-processing step.
Could we update this method to return the representation I'm advocating for here @hyanwong? #2068 (reply in thread)
I haven't figured out how to use the "dict_of_dicts" version to allocate attributes (like time) to nodes. So maybe this isn't the right way anyway?
I think it's better to annotate the nodes with their time, and the edges with intervals, rather than making a multigraph with branch lengths like we have here. Making the multigraph version for viz should be simple enough as a post-processing step.
Hmm, is that true. Isn't it just as easy to go from multigraph -> graph as the other way round?