augur icon indicating copy to clipboard operation
augur copied to clipboard

Add `augur curate` subcommands

Open huddlej opened this issue 2 years ago • 25 comments

Context

Prior to the SARS-CoV-2 pandemic, we assumed that Nextstrain workflows would begin with pre-curated sequences and metadata sourced either from our internal database ("fauna") or custom scripts maintained by individual users. This expectation is reflected in the workflows for Zika, Ebola, and TB. These internal databases allowed us to curate a collection of high-quality sequences and metadata across multiple data sources (e.g., GISAID, ViPR, INSDC, etc.) and start workflows from this “source of truth”.

During development of the ncov workflow, we made an effort to support users outside of the Nextstrain team and accept input data from diverse sources. These input typically need to be standardized to a single format for sequences and metadata and duplicate records resolved prior to running the rest of the workflow. Standardization appears in several places in our codebase as "sanitize" or "transform" steps of ncov-ingest or ncov including:

Despite being written for the ncov workflow, these scripts perform common preprocessing tasks users will need for Nextstrain analyses of any pathogen. Much of the functionality in ncov-ingest scripts overlaps with that of the “sanitize” scripts in the ncov repository. This redundancy highlights the need for these common utilities even for a single pathogen with multiple valid input data sources (in this case, the GISAID API and the GISAID website search/downloads interface).

Some of the closest functionality of this kind lives in augur parse as functions to replace “forbidden characters” in strain names, “prettify” values of text fields, and fix date formatting. As augur parse was often the first command in a Nextstrain workflow, these functions fulfilled the “sanitize” or “transform” needs we had even after pulling from a curated database. As Emma pointed out in a previous discussion, users (ourselves included) benefit from learning about data quality issues at the beginning of the workflow instead of later on.

Similarly, latitude/longitude processing components of ncov-ingest originate from Emma’s own project to standardize locations and latitude/longitude values for use by Augur. This kind of functionality potentially applies to any pathogen with geographic information.

Description

Given the cross-pathogen nature of the standardization functions described above, this issue proposes a new augur subcommand to wrap sub-subcommands corresponding to the most useful sanitize/transform functions defined in the scripts above. The name of this subcommand should be generic enough to encompass the functions it performs but specific enough to clearly communicate what those functions are. Some possible candidates include:

augur clean
augur transform
augur sanitize
augur standardize
augur prepare (coming full circle to the old days!)

This command would implement the following functions from ncov-ingest and ncov scripts (and perhaps from augur parse and elsewhere):

Some of these functions obviously have many different implementations, so we should determine the best generic implementation for an augur command. Other functions may be too specific to SARS-CoV-2 or even the GISAID API ingest to warrant implementation in augur.

Most of these functions operate on individual records in metadata or sequences, while the deduplication logic operates across the entire input files (adding columns could have arguably the same scope). We may want to implement a separate augur deduplicate command to distinguish the scope of these operations. One may need to clean/transform/sanitize input, for example, before one can resolve duplicates.

Edit: As noted below, these subcommands should ideally support UNIX-style piping so we can compose/chain multiple subcommands in a single transformation operation.

Examples

A few examples of the functions above as commands (picking transform at random):

augur transform titlecase \
  --metadata metadata.tsv \
  --fields region country division \
  --output transformed_metadata.tsv

augur transform format-date \
  --metadata metadata.tsv \
  --date-column collection_date \
  --output transformed_metadata.tsv

augur transform rename-columns \
  --metadata metadata.tsv \
  --columns "Virus name=strain" "Collection date=date" \
  --output transformed_metadata.tsv

augur transform multi-to-one-line \
  --sequences sequences.fasta \
  --output transformed_sequences.fasta

augur transform normalize-strings \
  --metadata metadata.tsv \
  --output transformed_metadata.tsv

With ideal chaining support, we might be able to do the following:

augur transform titlecase --metadata metadata.tsv --fields region country division \
  | augur transform format-date --date-column collection_date \
  | augur transform rename-columns --columns "Virus name=strain" "Collection date=date" \
  | augur transform normalize-strings --output transformed_metadata.tsv

Proposed initial sub-subcommands

List of proposed initial sub-subcommands, will link individual issues for each sub-subcommand as we plan them out.

The following commands should be straight-forward to implement based on past scripts in ncov-ingest and monkeypox/ingest:

  • [x] normalize-strings (https://github.com/nextstrain/augur/issues/998)
  • [x] titlecase (https://github.com/nextstrain/augur/issues/999)
  • [x] format-dates (https://github.com/nextstrain/augur/issues/1001)
  • [ ] apply-geolocation-rules (https://github.com/nextstrain/augur/issues/1003)
  • [ ] apply-record-annotations

The following commands will require more thinking and design. There are implementations in fauna/ncov-ingest/monkepox that can be improved:

  • [ ] format-strain-name (https://github.com/nextstrain/augur/issues/1204)
  • [ ] deduplicate (https://github.com/nextstrain/augur/issues/919)

huddlej avatar Mar 10 '22 00:03 huddlej

This is a great detailed overview, @huddlej! Thank you for the obvious care in putting it together.

One thought for now after a first read: I imagine any given workflow will invoke many of these commands in some sort of combination, the same way our existing workflows combine several of these sorts of steps into bespoke programs. For efficiency, it would be good to ensure that the output of one command can be cleanly and easily piped into the input of another (e.g. augur transform rename-column … | augur transform format-date …) or possibly even chained together in the same process (e.g. augur transform rename-column … \| format-date …, syntax TBD). This allows unnecessary I/O overhead (disk reads/writes, serialization/deserialization, etc) to be avoided, naturally makes use of multiple cores, and (most importantly! ;-) avoids the need to name all the intermediate files.

Relatedly, it occurs to me that maybe I should give a tour of recs (recs.pl) sometime (maybe a lab meeting? or something ad-hoc sooner?), as this proposal is starting to cover some of the same ground recs does/did and things I learned with it could be useful here.

tsibley avatar Mar 10 '22 01:03 tsibley

Tom beat me to this, where I had similar response but different possible solution...

Thanks so much for the detailed write up. This really helped with my own understanding. I like the unix-y flavor of these small modular commands. However, I suspect in practice that I'd want to chain a bunch of bog-standard cleaning functions for my data from (say) ViPR. It will be somewhat annoying to write out a series of Snakemake rules that go from titlecaseformat-daterename-columnsmulti-to-one-linenormalize-strings, when I really want the default behavior for a number of these.

This feels a bit similar to the issue we have with augur filter being chained repeatedly in the workflow to handle subsampling and then the desire to create a wrapper function of augur subsample to accomplish this chaining with less hassle.

Maybe you'd end up with a config-driven augur clean that could run multiple augur transform commands? Would create complexity of an additional config file but result in a more streamlined workflow. Or maybe we could be better at breaking up workflow into manageable linear chunks and that this could all be wrapped into an initial .smk workflow file.

trvrb avatar Mar 10 '22 01:03 trvrb

A config file with each step's args/params could absolutely be the "syntax TBD" I mentioned for multiple transforms "chained together in the same process". :-)

It would be important that the fields for each step in the config file map without much change to the command line args for that step run separately, so there's not a new set of options to learn and document. One downside of config files is the hazard of embedding filenames into them, as this makes them brittle. For workflows with wildcards, for example, you couldn't use a static config file with filenames in it, but each step might need different filenames (so it might be hard to pass them separately from the config).

(Even with config files, I do think actually piping between individual invocations is also important to allow (and not all that hard).)

tsibley avatar Mar 10 '22 01:03 tsibley

Thank you both for pointing this out! I imagined we could support piping between these commands, as a minimal way to support composing transforms. I just updated the issue to note this goal and include an example chained command.

p.s. recs looks cool, @tsibley! I would love to hear more about it and your experience building it...

huddlej avatar Mar 11 '22 21:03 huddlej

How would chaining work for transforming both sequence and metadata? Or is it expected that transform would operate on sequences or metadata separately, transform metadata first then apply it to sequences?

Also I'd imagine when dealing with very large metadata files, loading into memory repeated during chaining is not a great idea (though I don't know the internals of how augur loads data into memory via chunking).

An alternative I'd like to propose: allowing multiple transformations in a single command with a task and fields flags. For example this:

augur transform titlecase --metadata metadata.tsv --fields region country division \
  | augur transform format-date --date-column collection_date \
  | augur transform rename-columns --columns "Virus name=strain" "Collection date=date" \
  | augur transform normalize-strings --output transformed_metadata.tsv

Would be:

augur transform --metadata metadata.tsv \
--task titlecase \
--fields region country division \
--task format-date \
--fields collection_date \
--task rename-columns \
--fields  "Virus name=strain" "Collection date=date" \
--task normalize-strings \
--output transformed_metadata.tsv

Note I used only a single flag to specify the columns to operate on (fields) compared to the 3 unique flags used for each transformation.

One issue is the more complex argument parsing that needs to happen, though I think argparse can handle such cases. Also another issue is multiple actions on the same column, eg titlecase + normalise-strings.

ammaraziz avatar May 10 '22 05:05 ammaraziz

@ammaraziz Thanks for your comment! It's nice to have more people thinking about this.

I think there's a number of downsides to the syntax you propose (some of which you note). My preference is for both multi-process chaining with pipes/files to be supported as well as one or more methods for single-process chaining.

I think one of the methods for single-process chaining can be driven by a config file describing each subcommand's normal arguments (see https://github.com/nextstrain/augur/issues/860#issuecomment-1063553185) instead of the --task + --fields deconstruction.

Another method for single-process chaining would be syntax like the following (or similar):

augur transform pipeline '
    titlecase --metadata metadata.tsv --fields region country division
  | format-date --date-column collection_date
  | rename-columns --columns "Virus name=strain" "Collection date=date"
  | normalize-strings --output transformed_metadata.tsv
'

This avoids having to genericize the arguments/options for each subcommand and should be familiar enough at a glance (albeit a bit magic-feeling maybe) to anyone used to the multi-process chaining with pipes. It does introduce potential for confusion around quoting, but that feels surmountable/avoidable in many cases (and there are always alternates; this wouldn't be the only invocation form allowed).

tsibley avatar May 10 '22 22:05 tsibley

Thank you @tsibley. I always worry I am intruding in these discussions!

A few things to clarify/add. I like generic arguments, but I have noticed the overall style of augur is to use specific arguments and options. There's both advantages and disadvantages to generic vs specific flags. But in my opinion for commands such as transform, which can potentially have many 'transformations', a specific flag for each of these transformations adds up quickly and it might be a better tradeoff if generic arguments are used. However, as you guys will be doing the hard yards I will be happy with any outcome!

I too like the idea of config file, excellent idea! However, as Emma mentioned in another threads, supporting more than 1 chaining could just add extra complexity. The piping style mentioned above as an example and by first post is really good.

Finally, just to clarify as I may have missed it, will the transformation subcommand also act on the sequences or is it designed specifically to work on the metafile with a separate and possibly final option to modify the fasta file corresponding to the meta file?

ammaraziz avatar May 15 '22 23:05 ammaraziz

@ammaraziz Not intruding at all. :-) It's good to get feedback and input and questions from outside the core team.

To answer your question about handling of sequences, our thinking has been that some of the transform subcommands will definitely have to handle sequences. For example, anything that's massaging strain name in metadata would want to make corresponding updates to the name used for the sequence. It's an interesting idea to batch these up and do them once at the end instead of doing them as the data streams through. I'm not sure how well that'd work with a more modular design of individual commands; it seems like it might make modularity harder. Would be interesting to think through some more though.

tsibley avatar May 27 '22 18:05 tsibley

FYI, I built the monkeypox/ingest transform rule as a prototype for this proposed subcommand.

The shell pipeline connects small Python scripts that all read NDJSON records from stdin and then output the transformed records to stdout. The NDJSON records contain both metadata and sequences so the final step separates the NDJSON records to a metadata TSV file and a sequence FASTA file.

joverlee521 avatar Jun 08 '22 21:06 joverlee521

Thanks @joverlee521 that example is super helpful :) Using ndjson as the format allows sequences & metadata to be piped together which allows us to change the strain name. Nice!

I'm thinking about how we would support data that wasn't row-based (like sequences + metadata are). For instance, given the following dataset:

ID city latitude longitude lineage lineage__color
AAA Auckland -37 174 B.1.1 #F020E2
BBB Auckland -39 175 C.12 #20C9F0
CCC Wellington -41 174 C.12 #20C9F0

I've written scripts at various points in time to extract out:

  • a colours TSV mapping lineage -> hex. (Including some form of colour averaging if there are multiple hexes per lineage, which happens.)
  • A city -> lat/long TSV, which sets the lat/long based on the average of observed lat/longs for each city. Here there'd be 2 demes: Auckland, Wellington
  • A new metadata column which contains the unique lat/longs observed in the dataset, and a corresponding lat/long TSV file. Here there would be 3 demes, but we may also parametrise this to aggregate so that there are fewer.

Would this be considered in scope for augur transform? (I'd hope so). Can this be achieved by a ndjson interchange file (and thus use pipes)?

jameshadfield avatar Jun 09 '22 03:06 jameshadfield

@jameshadfield The examples you've listed seem like very general CSV/TSV manipulations that can be done with other tools (e.g. tsv-utils or csvtk). I don't think we want to reinvent the wheel here with augur transform. I would prefer to limit augur transform to very specific data transformations needed to standardize metadata/sequences.

joverlee521 avatar Jun 09 '22 19:06 joverlee521

On naming: my only thought is not augur transform, since it looks/sounds similar to the existing but entirely different augur translate. I don't have strong thoughts on the other proposals so far, and don't have any other suggestions.

victorlin avatar Jun 16 '22 18:06 victorlin

Re: names: augur curate is my current frontrunner as the concept of curation encompasses all of the subcommands we've talked about. It occurred to me earlier this week when talking with @joverlee521.

tsibley avatar Jun 16 '22 18:06 tsibley

Reposting my comment from a Slack thread:

Augur also already has a standard of supporting multiple output formats for a single command, depending on the user’s needs (augur filter is the best example of this), but I think it’s reasonable for transform subcommands to support arguments like --output-ndjson, --output-sequences, and --output-metadata where NDJSON to stdout is the default output and NDJSON input from stdin is the default input. This would simplify chaining multiple transforms and still support the standard output formats a user would expect for any one-off transforms.

Along the same lines as that comment, our SURP intern recently needed to use the transform-date-fields tool from the monkeypox ingest to clean up dates from VIPR metadata. She wanted to run this command like:

./transform-date-fields \
  --date-fields date \
  --expected-date-formats "%Y_%m_%d" "%Y_%m" \
  metadata.tsv > metadata_cleaned.tsv

But this didn't work, since the script only reads from stdin and only expect to receive NDJSON files. She then thought to use the csv-to-ndjson script to convert her metadata prior to piping into the date script, but her metadata was in TSV format (our standard output from augur parse and elsewhere in our workflows), so she got stuck needing to maunally convert TSV to CSV. Then the output of the transform date script is NDJSON, so she had to convert NDJSON back to TSV. In the end, the command looked something more like:

./tsv-to-csv metadata.tsv  | # hypothetical transform 
  ./csv-to-ndjson |
  ./transform-date-fields --date-fields date --expected-date-formats "%Y_%m_%d" "%Y_%m" |
  ./ndjson-to-tsv > metadata.tsv # hypothetical transform

A couple of UI features appear important in these examples:

  • We should support TSV as a standard format for metadata, even if we also support CSV. The rest of our workflows and tools generally expect TSV (even if some commands like augur filter accept and produce both).
  • We should support optional specification of non-NDJSON inputs and outputs for each augur curate subcommand. Although many examples we've discussed to date emphasize the chained nature of these commands, it's a reasonable use case for users to apply a single curate command for a single purpose. In these cases, it is also reasonable to not ask the user to know about converting data formats that are unique to the curate command (and that are likely foreign to users). These I/O arguments will also facilitate simpler interactions with other bioinformatics tools without boilerplate code that we'd otherwise require to transform between file formats.

Putting those features together with the Slack comment above, I propose that we support explicit input and output flags for each curate command that allow users to specify standard file formats and that we support default inputs and outputs from stdin and stdout in NDJSON format. Specific examples include:

# Single use of date field transform.
./transform-date-fields [...args...] --metadata metadata.tsv --output-metadata metadata_cleaned.tsv

# Chained transforms within `augur curate`.
# Input is explicitly passed as a TSV file to the first command.
# Output from the first command is implicitly NDJSON to be consumed by the next command.
# Input and output for the second are implicitly NDJSON.
# Input for the third command is implicitly NDJSON and output is explicitly TSV.
./transform-date-fields [...args...] --metadata metadata.tsv |
  ./transform-strain-names [...args...] |
  ./transform-string-fields [...args...] --output-metadata metadata_cleaned.tsv

# Chained transforms with `augur curate` and another bioinformatics tool.
# Explicit output to stdout in TSV enables the interaction with csvtk. 
./transform-date-fields [...args...] --metadata metadata.tsv |
  ./transform-strain-names [...args...] |
  ./transform-string-fields [...args...] --output-metadata /dev/stdout |
  csvtk --tabs sample [...args...] > metadata_cleaned.tsv

The use of /dev/stdout is not ideal in the last example. We could instead make output to stdout the default and allow users to specify the format of that output with a flag like:

# Chained transforms with `augur curate` and another bioinformatics tool.
# Explicit output to stdout in TSV enables the interaction with csvtk. 
./transform-date-fields [...args...] --metadata metadata.tsv |
  ./transform-strain-names [...args...] |
  ./transform-string-fields [...args...] --output-format tsv |
  csvtk --tabs sample [...args...] > metadata_cleaned.tsv

This approach of the format flag only works if we can know that each command will only have a single output stream. Maybe that's a reasonable assumption and ndjson-to-tsv-and-fasta is the exception to the rule.

The current proposed requirement of NDJSON as the only possible input and output format for curate commands would require something like:

# Chained transforms with `augur curate` and another bioinformatics tool.
./tsv-to-ndjson metadata.tsv | # hypothetical transform
  ./transform-date-fields [...args...] |
  ./transform-strain-names [...args...] |
  ./transform-string-fields [...args...] |
  ./ndjson-to-tsv |
  csvtk --tabs sample [...args...] > metadata_cleaned.tsv

The UX cost of the initial and final transforms to/from NDJSON decrease with the number of intermediate commands in the chain. The cost is highest for a single-use transform like the first example above, though. The question is whether we want to charge that UX cost to the user with the NDJSON-only curate commands or whether we want to charge the technical cost to ourselves to provide alternative I/O options for those commands.

The HTSLIB (e.g., samtools/bcftools) ecosystem of commands that originally inspired the modular Augur design could be useful examples for this same kind of UI consideration. For example, it's possible to chain bcftools commands with an initial input in standard file format (BAM), intermediate data streamed to stdout in uncompressed BAM format (analogous here to our use of NDJSON), and final data written to an explicit output file in a standard format (VCF).

huddlej avatar Jun 30 '22 21:06 huddlej

I was really psyched about curate, but then the more I (over)thought about it, the more it seems that "curation" is more like "filtering" in that it describes carefully selecting (or organizing) a subset of items from all available items. This doesn't match actions of most subcommands listed above like converting date strings from one format to another.

Other options that occurred to me last night included fix and format.

I like fix for its brevity, but it also implies there is something wrong or broken with the input data. That's an opinionated stance we could take, maybe...

I like format because most of the subcommands proposed so far involve text formatting. It is a little prosaic compared to "curate", but it seems more aligned with what the subcommands are doing.

I'm also still ok with transform, since that verb accurately describes what's happening in most subcommands. I'm not as worried about it sounding like translate.

Trying this out with the example commands from my last comment:

# Single use of date field transform.
augur format date-fields [...args...] --metadata metadata.tsv --output-metadata metadata_cleaned.tsv

# Chained transforms with `augur curate` and another bioinformatics tool.
# Explicit output to stdout in TSV enables the interaction with csvtk. 
augur format date-fields [...args...] --metadata metadata.tsv |
  augur format strain-names [...args...] |
  augur format string-fields [...args...] --output-format tsv |
  csvtk --tabs sample [...args...] > metadata_cleaned.tsv

huddlej avatar Jul 01 '22 17:07 huddlej

Missing out on using @joverlee521's great work in other pathogen repos over naming issues is unfortunate. This feels like chicken and egg to me: I'd like to use these in practice to get a feel for what would be good naming, but I can't use them until a name is decided for implementation.

If we can't decide on a name now, would it be possible to start with curate with the option to alias/change it to something else later? An example timeline:

  • Until name is settled, print WARNING: experimental command! Name and behavior subject to change.
  • 16.1.0: augur curate format-date is available!
  • 16.2.0: augur curate transform-authors is available!
  • 16.3.0: augur curate merge-user-metadata is available!
  • 16.4.0: We decide format is a better name. Old commands still work, but are aliases with DeprecationWarnings:
    • augur curate format-date -> augur format date
    • augur curate transform-authors -> augur format authors
    • augur curate merge-user-metadata -> augur format merge-user-metadata
  • 17.0.0: After using these in practice for a while, we realize augur format is indeed the better name. Remove old command names and warnings.

victorlin avatar Jul 01 '22 23:07 victorlin

+1 for @huddlej's thoughts on I/O. They match how I imagined the framework that individual commands plug into (regardless of whether Augur-provided or user-provided).

Re: --output-metadata /dev/stdout being non-ideal, a standard convention would be handling --output-metadata - to mean the same thing. The bare hyphen when a filename is expected almost always means "stdout" or "stdin" depending on context. This potentially avoids the need for an --output-format flag and the caveats it comes with that John mentioned.

+1 for @victorlin's thoughts on naming and getting something out sooner than later.

Re: the term curate and the semantic concerns @huddlej raised, I suggested curate in the meaning of data curation which is heavily focused on fixing, formatting, tidying, annotating, augmenting, etc. rather than museum curation or music curation which are indeed more heavily focused on selecting, filtering, etc. Wikipedia says this about data curation

Data curation is the organization and integration of data collected from various sources. It involves annotation, publication and presentation of the data such that the value of the data is maintained over time, and the data remains available for reuse and preservation.

which seems to fit well with what we're doing here.

tsibley avatar Jul 05 '22 16:07 tsibley

Good call on clarifying the "data curation" context (tidy is another good option!), although that ambiguity is what concerned me in the first place (the determining "what information is worth saving and for how long" bit).

If no one else is bothered by this, though, I'd say we adopt curate permanently. We might as well get it right the first time and not worry about aliasing in the future.

huddlej avatar Jul 05 '22 17:07 huddlej

Thank you for the detailed examples @huddlej! I fully support the I/O options for individual commands especially to make one-off calls easier for users. With that in mind, the augur curate work would no longer be blocked by the issue to expand augur parse (if we even still want to pursue that?)

We would want to have a clean UI for these I/O options, which I think we can achieve with a separate shared base parser (a la https://stackoverflow.com/a/23296874).

Re: naming, I'm happy with augur curate

joverlee521 avatar Jul 05 '22 23:07 joverlee521

In a similar vein of UX cost, I was thinking that each individual command should normalize strings in addition to the normalize-strings command.

Ideally, an augur curate chain would start with normalize-strings so that subsequent commands do not have to worry about it. However, for one-off calls, each command should normalize strings to avoid unexpected results from string comparisons. The default behavior for each command should be to normalize strings as the first step so users don't have to remember to "turn it on". The commands can all have a --skip-normalization flag to skip this first step in the case that they are in an augur curate chain:

augur curate normalize-strings ...
      | augur curate titlecase --skip-normalization ...
      | augur curate format date --skip-normalization ... 
      | augur curate apply-geolocation-rules --skip-normalization ...

joverlee521 avatar Jul 05 '22 23:07 joverlee521

Dumping some thoughts/questions here after talking more about the I/O framework with @tsibley today.


If a user provides both a metadata TSV and a FASTA file as inputs, I think we would only ever do an "inner" join that only keeps records that are present in both files. However, I do want to provide the users with an option to dictate how to handle unmatched records. This can be an --unmatched option with the following choices:

  1. error = raise an error for unmatched records
  2. warn = output warnings for unmatched records
  3. silent = pass unmatched records silently

With the assumption that records in the two files should be 1:1, I would make error the default.


Should augur curate handle FASTA files that have metadata/descriptions in the FASTA headers?

We already handle this type of FASTA with augur parse so @tsibley suggested that we can point users to use augur parse to separate the file into two files that can then be provided as inputs to the augur curate command.

If they already have metadata file and only want to keep the sequence id in the FASTA header, then they can also use something like seqkit replace to edit the headers.

It's unclear to me how often people work with this type of FASTA file, so I'm not sure if it would be a huge UX burden to ask users to preprocess FASTA files to only have the sequence id in the header.

joverlee521 avatar Aug 24 '22 00:08 joverlee521

When you refer to record do you mean data in the fasta file, the metadata row entry or both? I think there needs to be a distinction made between unmatched metadata records and unmatched fasta records. Missing metadata record can cause the augur pipeline to fail, while a missing fasta record is usually ignored (I think, feel free to correct me).

I think the option to error, warn or silent is a good idea. I would like a helpful message when error/warn is produce such as:

The following strains were not found in the metadata file: The following strains were not found in the fasta file:

Or write out which strains are missing metadata or fasta records.

Personally, I care more about missing metadata than missing fasta entries, so a warning is really helpful.

Should augur curate handle FASTA files that have metadata/descriptions in the FASTA headers?

Agreed this is solved by augur parse or seqkit replace so there's no need to repeat things.

ammaraziz avatar Aug 24 '22 06:08 ammaraziz

Thank you for the feedback @ammaraziz!

When you refer to record do you mean data in the fasta file, the metadata row entry or both? I think there needs to be a distinction made between unmatched metadata records and unmatched fasta records.

Yes, I mean records in both files. I agree that there definitely should be a distinction between unmatched metadata records and unmatched fasta records.

joverlee521 avatar Aug 24 '22 17:08 joverlee521

If a user provides both a metadata TSV and a FASTA file as inputs, I think we would only ever do an "inner" join that only keeps records that are present in both files.

There are cases when the user's input may include multiple sequence records for the same strain id. In the past, this has been an issue with GISAID downloads (caused by multiple submissions of the same strain name that would need to be deduplicated by accession id, if we had it). If we assume that curation happens after deduplication, the inner join assumption seems great. If we expect curation to precede deduplication, though, we need to allow for a left-join with metadata as the source of truth on the left.

It is also possible that the user could have multiple entries in the metadata for the same strain id, although I can't remember encountering this myself with standard GISAID or NCBI data. It might be nice to distinguish in the warning/error message to users between missing and duplicate records, to handle this case. Then, we could include a reminder for users to deduplicate their data prior to running the curate commands.

huddlej avatar Aug 25 '22 16:08 huddlej