augur icon indicating copy to clipboard operation
augur copied to clipboard

augur measurements proposal

Open joverlee521 opened this issue 2 years ago • 1 comments

Context

A new measurements panel has been added to Auspice which displays data from a measurements sidecar JSON file. A previous PR added the measurements JSON schema and a new sub-subcommand augur validate measurements to validate measurements JSONs against the schema. This issue is created to discuss the UI/UX for the proposed augur measurements subcommand that generates the measurements sidecar JSON.

Description

The measurements JSON schema was designed such that each collection is completely independent. This makes it easy to generate a JSON for each collection then concatenate them all into a single sidecar JSON for a dataset.

I propose two sub-subcommands:

augur measurements export
augur measurements concat

Export

With so many display options that a can be customized, the export command line options can get very long and complicated. However, this may not be an issue if we expect this command to be used as a part of a workflow. I think the export command should only generate the JSON for a single collection of measurements. This would make the UI simpler since the user would not have to worry about matching specific options to specific collections. As for the command line UI itself, I think it can be developed in a two step process:

Step 1

At first, I expect this command would only be used by Nextstrain and our group is pretty comfortable with creating/editing JSONs. We can start with only the "essential" command line options and push all other display configs into a config JSON. The "essential" options alone would be sufficient to generate a measurements JSON without any display customizations:

augur measurements export \
    --collection <path-to-collection-TSV> \
    --strain-column <column-for-strain-names> \
    --value-column <column-for-measurement-values> \
    --grouping-column <grouping-column-1> <grouping-column-2> \
    --minify-json \
    --output-json <path-to-output-JSON>

To include display customizations, pass a --collection-config option pointing to a config JSON (which looks pretty similar to the final JSON):

{
    "key": "collection-key",
    "title": "collection-display-title",
    "fields": [
        {
            "key": "column-1-name",
            "title": "column-1-display-title"
        },
        {
            "key": "column-2-name",
            "title": "column-2-display-title"
        }
    ],
    "groupings": [
        {
            "key": "grouping-1-column-name",
            "order": [
                "grouping-1-value-1",
                "grouping-1-value-2"
            ]
        },
        {
            "key": "grouping-2-column-name",
            "order": [
                "grouping-2-value-1",
                "gropuing-2-value-2"
            ]
        }
    ],
    "filters": [
        "filter-1-column-name",
        "filter-2-column-name"
    ],
    "x_axis_label": "label",
    "threshold": 2.0,
    "display_defaults": {
        "group_by": "default-grouping-column-name",
        "measurements_display": "mean",
        "show_overall_mean": true,
        "show_threshold": true
    }
}
Step 2

We can then dive into pulling out specific parts of the config JSON into command line options. Similar to augur export, the command line arguments would override the config JSON. This is pretty straightforward for most of the display options. The only ones that would require more discussion are the configs with a nested structure, specifically the fields and groupings options.

For the --fields option, I propose we use structured arguments where each field can be passed as key:title or key=title. This does mean whatever delimiter we choose will not be allowed in the key or title values.

For the --groupings option, I honestly can't think of a clean way make this a command line option. We can potentially do structured arguments as well such as key:order_value_1,order_value_2,order_value_3, but that doesn't seem very intuitive.

Concat

The concat command would be very straightforward. It takes multiple collection JSONs, validates each collection against the schema, and then combines them into a single sidecar JSON.

augur measurements concat \
    --jsons <collection-1.json> <collection-2.json> <collection-3.json> \
    --default-collection <collection-1-key> \
    --minify-json \
    --output-json <path-to-output-JSON>

My only question here is what should happen if a single collection fails the validation. Should it just be excluded from the output with a warning message or should the whole command exit with an error?

Additional Thoughts

This proposal is just the most basic command that takes a measurements TSV and display config and exports them to the expected measurements sidecar JSON. I know there are additional scripts that would be required to generate the TSV file for modeling and calculations. I personally think these scripts will be very pathogen specific and therefore should live in their respective pathogen repos rather than in augur.

joverlee521 avatar Mar 16 '22 00:03 joverlee521

I'm a fan of the proposal. I think general workflow of generating TSVs through pathogen-specific scripts and packaging these up with necessary labeling via augur measurements export makes sense to me. And then packaging multiple stand-alone collections together via augur measurements concat also makes sense. Presumably if there's a only single collection you could just have the output of augur measurements export be {pathogen}_measurements.json and this could be passed directly to Auspice. Only two immediate thoughts here:

  1. With our use of augur export to generate main JSONs for Auspice we haven't tried to stuff all the options in --auspice-config into command line options. I think it's reasonable to have common configuration exposed in command line and less common only available via --collection-config.

  2. I would think about how augur measurements export is meant to interact with augur titers sub and augur titers tree. Ideally it should be pretty straight forward to pipe output of the augur titers commands into augur measurements (realizing that there will need to be config). I guess one potential option here if we want to enforce TSV input into augur measurements is update augur titers to allow --output-json vs --output-tsv.

Hmm... looking now, there's a lot that's bespoke in the output of augur titers sub, ie:

"nodes": {
    "A/Abidjan/456/2021": {
      "cTiterSub": 5.38020380914579,
      "dTiterSub": 0
    },
...
"substitution": {
    "HA1:D186S": 0.2942,
    "HA1:D190N": 0.3282,
    "HA1:D53N": 0.0521,
...

In order for this to be passed into augur measurements, you'd need a rawer output of predicted titer value for each strain against each serum. For HI titers, this would be one collection.

So, it's the same as before, we either need to have a ancillary script that processes output of augur titers sub into TSV for augur measurements or we need to update output options of augur titers sub. I'd lean towards the latter.

And in writing out (2), I realize that this tangential to the issue at hand. We can carve this out to a separate thread if desired.

trvrb avatar Mar 16 '22 01:03 trvrb