tsinfer icon indicating copy to clipboard operation
tsinfer copied to clipboard

Use struct for node metadata

Open hyanwong opened this issue 3 years ago • 16 comments

Fixes #416

hyanwong avatar Mar 05 '22 11:03 hyanwong

This has the new metadata scheme, where the tsinfer-specific metadata is in a "tsinfer" key.

hyanwong avatar Mar 07 '22 18:03 hyanwong

Codecov Report

Merging #642 (eb4fce8) into main (99cad13) will increase coverage by 0.00%. The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #642   +/-   ##
=======================================
  Coverage   93.08%   93.09%           
=======================================
  Files          17       17           
  Lines        5136     5140    +4     
  Branches      954      954           
=======================================
+ Hits         4781     4785    +4     
  Misses        235      235           
  Partials      120      120           
Flag Coverage Δ
C 93.09% <100.00%> (+<0.01%) :arrow_up:
python 96.31% <100.00%> (+<0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tsinfer/formats.py 98.43% <100.00%> (+<0.01%) :arrow_up:
tsinfer/inference.py 98.57% <100.00%> (+<0.01%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 99cad13...eb4fce8. Read the comment docs.

codecov[bot] avatar Mar 07 '22 18:03 codecov[bot]

This is ready for review. I reckon if we merge it (or whatever we end up with) at the same time as #607 then we make 2 changes to the formats at roughly the same time.

hyanwong avatar Mar 08 '22 08:03 hyanwong

As I say, the main point of using struct here for me would be to enable direct access to the metadata as numpy arrays. Let's get some info on doing that first before discussing the other details any further.

jeromekelleher avatar Mar 08 '22 10:03 jeromekelleher

Ah, that's not the main reason I would be using a struct, TBH. I didn't even think about wanting to get the entire array out. I can see that would be useful sometimes, though.

BTW, and not that it matters, but the sample_data_id is mostly unfilled, I think.

hyanwong avatar Mar 08 '22 10:03 hyanwong

If the info is sparse and you mainly plan on leaving things blank then there's no point at all in struct, I think. But being able to process these things efficiently in Python seems attractive to me.

jeromekelleher avatar Mar 08 '22 10:03 jeromekelleher

The sample_data_id value is sparse (I think). The ancestor_data_id is not, and storing 27 million copies of the identical string "ancestor_data_id" in JSON form in the public tree sequences seems pretty wasteful to me.

Edit - that's at least half a gigabyte in duplicated text in the tree sequence, I think?

hyanwong avatar Mar 08 '22 10:03 hyanwong

Maybe some concrete numbers for chr1 would be helpful.

jeromekelleher avatar Mar 08 '22 10:03 jeromekelleher

OK, for the p-arm of chr1, which is 123.4 MB (i.e. about 1/25th of the whole), we are storing 30Mb in node metadata, of which we have just over a million copies of the byte string "ancestor_data_id" (= 16 Mb), and no copies of the string "sample_data_id" (because we only use this for historically inserted samples, which we don't do in the unified ts).

hyanwong avatar Mar 08 '22 11:03 hyanwong

So what would that be for 8 bytes per node?Around 8MB I guess?

jeromekelleher avatar Mar 08 '22 12:03 jeromekelleher

So what would that be for 8 bytes per node?Around 8MB I guess?

Yep, 8.9MB. Note that the released unified genealogy tree sequences also don't contain the tsdate means and variances associated with node times (I'm not sure why they don't). If we included those it would add json info such as

b'"mn":11039.99, "vr":1218813.29'

to the JSON, which I estimate would multiply the metadata storage requirements in the JSON format by ~ 2.5 times, and would either increase the struct version by another 8.9 MB if we store mean and var as 4 byte floats, or 17.8 if we use doubles.

hyanwong avatar Mar 08 '22 14:03 hyanwong

So if we store tsdate mean and var, for all chromosomes in total, we use up roughly 1.9 GB in a JSON encoding, and either 445 MB (if storing floats) or 667MB (if storing doubles) for the struct encoding.

If we don't store tsdate means and variances, we use up ~ 750MB for JSON and 220MB for struct.

hyanwong avatar Mar 08 '22 14:03 hyanwong

I've come and gone several times on the "top-level key with name of generating software" concept several times, but I think I've convinced myself that it is a bad idea. Here's the thinking - if downstream software wants to use a metric stored in metadata it will have to know what software wrote it, likely hard-coding the name of that software. This would prevent a different upstream software from replacing the first, or even worse cause it to write metrics to a key that had the name of the first piece of software in order to be compatible! In other words the identity of the metric is it's key, it's name, and the generating software is incidental (but should absolutely be stored in the description of the field). Tagging @hyanwong and @jeromekelleher as I think this is something we need to decide for the ecosystem and put into the tskit docs at some point.

benjeffery avatar Apr 07 '22 23:04 benjeffery

Thanks @benjeffery. I guess there might be a difference between metadata that is generalisable, and metadata that is very tightly linked to the software package. In the case of the sample_data_id and ancestor_data_id from tsinfer, these values reference specific internal IDs created by the tsinfer inference process, so I can't imagine they would be useful outside that context.

Having said which, I'm perfectly happy to label them simply sample_data_id and documenting their origin in the description field (or even use tsinfer_sample_data_id, although that slightly ruins the justification of not hard-coding the software name).

hyanwong avatar Apr 08 '22 09:04 hyanwong

Yes, good point. What downstream uses are there of sample_data_id and ancestor_data_id? Is it just for analysis of tsinfer's performance? I'm trying to get an idea of where they sit in the generalisability spectrum.

benjeffery avatar Apr 08 '22 10:04 benjeffery

Some of the rationale is at https://github.com/tskit-dev/tsinfer/issues/301 and https://github.com/tskit-dev/tsinfer/issues/604#issuecomment-968979060

Basically, it is useful for manipulating the tree sequences for matching historical individuals into a tree sequence, and inserting them etc. This could either be done in the standard internal tsinfer routines, or as part of a post-processing step

hyanwong avatar Apr 08 '22 10:04 hyanwong

Closing as we decided simply to use permissive json, at least for the time being

hyanwong avatar Jul 22 '24 21:07 hyanwong