ncov icon indicating copy to clipboard operation
ncov copied to clipboard

Left join multiple metadata files for the same named input

Open huddlej opened this issue 4 years ago • 1 comments

Context A common use case for the ncov workflow is for users to download GISAID data for a specific analysis (e.g., a state-focused build) and combine other existing metadata they have about those state-level data (based on GISAID accession or strain name) with the GISAID data. This joining of public GISAID data and public/private local data allows users to create trees that can be colored, filtered, etc. by custom metadata annotations.

This type of joining requires programming and bioinformatics expertise that can be a burden or a complete roadblock to users trying to accomplish this task. The benefit of a merged metadata file in this case over a drag-and-drop metadata file in Auspice is the persistent state in the build JSONs that can be more easily shared with other users.

Description Add support for left joining standard GISAID/GenBank metadata with custom metadata files on strain name and/or standard accession ids. This merging should be supported per named input for a given builds.yaml file, so users can define input-specific custom metadata or reuse the same custom metadata across multiple inputs as they choose.

Possible solution One possible solution is to allow users to define multiple metadata paths for the same input like so:

inputs:
  - name: mydata
     sequences: data/sequences.fasta
     metadata: ["data/metadata.tsv", "data/local_metadata.tsv"]

We can modify the sanitize metadata script to accept one or more metadata inputs (nargs="+"). When multiple inputs are provided, we can merge these files on predefined columns (strain name, accession id, etc.) and stream the resulting data frame as the final output.

If we use this approach, where should we define the columns to join on? There is already a configuration section in the builds YAML for the sanitize metadata step, so we could add a list of columns to join on by default that the user could override as needed.

huddlej avatar Jul 30 '21 19:07 huddlej

This would be a big improvement. The sanitize metadata script already has code for inferring the column representing "strain", would requiring this to be present in all files be too onerous? If we want to specify a column name to merge on using config["sanitize_metadata"] are we ok with this being the same for all inputs (not a big limitation)? We'll probably need an extra step to resolve duplicates for this column as well.

jameshadfield avatar Aug 02 '21 04:08 jameshadfield