tsinfer
tsinfer copied to clipboard
Use struct for node metadata
Fixes #416
This has the new metadata scheme, where the tsinfer-specific metadata is in a "tsinfer" key.
Codecov Report
Merging #642 (eb4fce8) into main (99cad13) will increase coverage by
0.00%. The diff coverage is100.00%.
@@ Coverage Diff @@
## main #642 +/- ##
=======================================
Coverage 93.08% 93.09%
=======================================
Files 17 17
Lines 5136 5140 +4
Branches 954 954
=======================================
+ Hits 4781 4785 +4
Misses 235 235
Partials 120 120
| Flag | Coverage Δ | |
|---|---|---|
| C | 93.09% <100.00%> (+<0.01%) |
:arrow_up: |
| python | 96.31% <100.00%> (+<0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| tsinfer/formats.py | 98.43% <100.00%> (+<0.01%) |
:arrow_up: |
| tsinfer/inference.py | 98.57% <100.00%> (+<0.01%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 99cad13...eb4fce8. Read the comment docs.
This is ready for review. I reckon if we merge it (or whatever we end up with) at the same time as #607 then we make 2 changes to the formats at roughly the same time.
As I say, the main point of using struct here for me would be to enable direct access to the metadata as numpy arrays. Let's get some info on doing that first before discussing the other details any further.
Ah, that's not the main reason I would be using a struct, TBH. I didn't even think about wanting to get the entire array out. I can see that would be useful sometimes, though.
BTW, and not that it matters, but the sample_data_id is mostly unfilled, I think.
If the info is sparse and you mainly plan on leaving things blank then there's no point at all in struct, I think. But being able to process these things efficiently in Python seems attractive to me.
The sample_data_id value is sparse (I think). The ancestor_data_id is not, and storing 27 million copies of the identical string "ancestor_data_id" in JSON form in the public tree sequences seems pretty wasteful to me.
Edit - that's at least half a gigabyte in duplicated text in the tree sequence, I think?
Maybe some concrete numbers for chr1 would be helpful.
OK, for the p-arm of chr1, which is 123.4 MB (i.e. about 1/25th of the whole), we are storing 30Mb in node metadata, of which we have just over a million copies of the byte string "ancestor_data_id" (= 16 Mb), and no copies of the string "sample_data_id" (because we only use this for historically inserted samples, which we don't do in the unified ts).
So what would that be for 8 bytes per node?Around 8MB I guess?
So what would that be for 8 bytes per node?Around 8MB I guess?
Yep, 8.9MB. Note that the released unified genealogy tree sequences also don't contain the tsdate means and variances associated with node times (I'm not sure why they don't). If we included those it would add json info such as
b'"mn":11039.99, "vr":1218813.29'
to the JSON, which I estimate would multiply the metadata storage requirements in the JSON format by ~ 2.5 times, and would either increase the struct version by another 8.9 MB if we store mean and var as 4 byte floats, or 17.8 if we use doubles.
So if we store tsdate mean and var, for all chromosomes in total, we use up roughly 1.9 GB in a JSON encoding, and either 445 MB (if storing floats) or 667MB (if storing doubles) for the struct encoding.
If we don't store tsdate means and variances, we use up ~ 750MB for JSON and 220MB for struct.
I've come and gone several times on the "top-level key with name of generating software" concept several times, but I think I've convinced myself that it is a bad idea. Here's the thinking - if downstream software wants to use a metric stored in metadata it will have to know what software wrote it, likely hard-coding the name of that software. This would prevent a different upstream software from replacing the first, or even worse cause it to write metrics to a key that had the name of the first piece of software in order to be compatible! In other words the identity of the metric is it's key, it's name, and the generating software is incidental (but should absolutely be stored in the description of the field). Tagging @hyanwong and @jeromekelleher as I think this is something we need to decide for the ecosystem and put into the tskit docs at some point.
Thanks @benjeffery. I guess there might be a difference between metadata that is generalisable, and metadata that is very tightly linked to the software package. In the case of the sample_data_id and ancestor_data_id from tsinfer, these values reference specific internal IDs created by the tsinfer inference process, so I can't imagine they would be useful outside that context.
Having said which, I'm perfectly happy to label them simply sample_data_id and documenting their origin in the description field (or even use tsinfer_sample_data_id, although that slightly ruins the justification of not hard-coding the software name).
Yes, good point. What downstream uses are there of sample_data_id and ancestor_data_id? Is it just for analysis of tsinfer's performance? I'm trying to get an idea of where they sit in the generalisability spectrum.
Some of the rationale is at https://github.com/tskit-dev/tsinfer/issues/301 and https://github.com/tskit-dev/tsinfer/issues/604#issuecomment-968979060
Basically, it is useful for manipulating the tree sequences for matching historical individuals into a tree sequence, and inserting them etc. This could either be done in the standard internal tsinfer routines, or as part of a post-processing step
Closing as we decided simply to use permissive json, at least for the time being