augur icon indicating copy to clipboard operation
augur copied to clipboard

Ideas for improving how data flows through augur export v2

Open jameshadfield opened this issue 2 years ago • 1 comments

See https://github.com/nextstrain/docs.nextstrain.org/pull/113 for (new) documentation detailing how augur export v2 treats metadata (TSV and node-data JSONs). Writing those docs allowed me to collect my thoughts on how we could improve this.

Exporting metadata fields must be explicit

Currently the question of whether a metadata field gets exported depends on some or all of (a) whether the metadata was tsv/json, (b) whether it matches a hardcoded exclude field, of which there are at least 19, (c) whether it matches a hardcoded include field, of which there are at least a dozen, (d) whether the key will be changed at read time, of which there are at least 2 occurrences, (e) what command line arguments are provided and (f) what colorings, filters and geo-resolutions are provided in the auspice-config.

We could make steps to simplify this via the following logic: * All metadata is read in without changes * A certain set of specific fields are “magic” - num_date, branch_length etc. and will be exported. These are clearly documented. No colorings will be automatically produced. * Colorings are taken from the command line or the auspice-config. If you don’t specify either then there will be no available colorings in the resulting dataset. * Geo-resolutions are taken from the command line or the auspice-config… * Filters are specified via the auspice config. If you don’t specify any, then there will be no filters in the footer of auspice [1]. * Other traits specified via --extra-meta or “extra_traits” in the auspice config will be exported. These will be available for sidebar filtering [1] or shown when clicking on the node in the auspice tree.

However it is arguable how much simpler that really is.

A complementary direction would be to allow node-data JSONs to encode relevant config settings. It is often the case that the command which generates them knows about how they should be displayed. As an example:

/* output from augur refine */
auspice_config: {
  colorings: [{key: "num_date", title: "Sampling Date", scale: "temporal"}],
  divergence_key: "branch_length",
  divergence_unit: "mutations",
  temporal_key: "num_date"
},
nodes: ...

The merging of these would be tricky, but the idea seems sound to me. This shifts the logic into each subcommand rather than maintaining code to “export everything except these 19 exclusions” as we currently do. It’s plausible that this approach would remove the need for an auspice-config JSON in many workflows.

Branch labels should be defined

Continuing the theme above, we currently only allow one hardcoded branch label, and one “magic” branch label generated during export. This is https://github.com/nextstrain/augur/issues/720 which should be revisited. Using the above idea to encode this in the node-data JSON this could be encoded as:

/* output from augur clades */
auspice_config: {
  colorings: [{key: "clade_membership", title: "Clade", scale: "categorical"}],
  branch_labels: [{key: "clade_annotation", title: "Clade"}] // structure TBD
},
nodes: {
  "{NODE_NAME}": {
    "clade_annotation": "hMPXV-1 A",
    "clade_membership": "hMPXV-1 A"
  }
}

Similarly, the “magic” aa branch label could be computed by augur translate in a similar fashion.

Authors and accessions

The above approach doesn't take care of authors and accessions, both of which are automatically exported currently (assuming certain fields appear in the merged metadata). I don't have a great idea on how to improve this.

Deprecated auspice-config

This is more of a technical debt question than a feature, but the code is hard to read as we’re constantly considering deprecated key names and structures. We could wrap this up into a single function which is called at the start, e.g. config = update_config(parse_config(filename)). Or drop it entirely, as this change was introduced in augur v6!


[1] I want to make a corresponding change to augur so that the filters detailed in the footer are decoupled from the sidebar filtering. The sidebar filtering should use all metadata set on any node, regardless of whether it’s a colouring etc.

jameshadfield avatar Jun 23 '22 22:06 jameshadfield

A complementary direction would be to allow node-data JSONs to encode relevant config settings. It is often the case that the command which generates them knows about how they should be displayed.

I haven't thought thru any of the implications here, but on the surface, I like this suggestion a lot as a way to simplify the interface/mechanics of export.

tsibley avatar Jul 05 '22 22:07 tsibley