ncov icon indicating copy to clipboard operation
ncov copied to clipboard

augur refine crashes when the root sequence is not included in the build

Open huddlej opened this issue 3 years ago • 12 comments

Context A common use case for ncov builds is to search for specific sequences in GISAID (e.g., for a specific emerging lineage, state, country, etc.) and create a build for these data. The simplest possible builds.yaml for this analysis should be:

inputs:
  - name: gisaid-data
     sequences: data/downloaded_sequences.fasta
     metadata: data/downloaded_metadata.tsv

In practice, users need to also include metadata and sequences for the reference Wuhan/Hu-1/2019 because the workflow uses this strain to root the time tree. When users do not manually include this reference, the workflow runs all the way through to the refine step before throwing a confusing error about "unsupported rooting mechanisms or root not found".

Description We should include the reference sequence and metadata as inputs to all analysis even if users haven't requested them.

Possible solution Instead of including the reference and metadata early in the workflow, we should force-include it either just before creating subsampled data or just after.

Alternately, we could define a different rooting mechanism that doesn't rely on a specific reference strain.

Additional context Thank you to @BryanTegomoh for reporting this issue!

(Edit: Updated title to reflect the actual error and not a proposed solution.)

huddlej avatar Aug 02 '21 22:08 huddlej

Hi Nextstrain Team,

I was wondering if I could have this issue assigned to me. My approach would be similar to the possible solution. I would include the reference and metadata before creating the subsampled data to to the time tree. I would also love any feedback on this approach. Thank you!

emontag47 avatar Dec 01 '21 15:12 emontag47

@emontag47 Assigned! Thanks for volunteering and please feel free to discuss/comment on things here.

tsibley avatar Dec 04 '21 00:12 tsibley

Hi Nextstrain Team and @tsibley,

I am attempting to push some changes to a branch I created to eventually submit a pull request. When I try to push, I get the error "remote: Permission to nextstrain/ncov.git denied to emontag47. fatal: unable to access 'https://github.com/nextstrain/ncov.git/': The requested URL returned error: 403". After searching for a bit on Google, I found that this error could be due to not having permission to push to the repository. Is there anyway someone could give me permission to push to branches or offer some insight into how I could solve this issue? Thank you!

emontag47 avatar Dec 14 '21 03:12 emontag47

Hi Eric - the common way to do this is to fork the repo and submit a PR from your fork. Thanks in advance for your contribution!

jameshadfield avatar Dec 14 '21 03:12 jameshadfield

Hi @jameshadfield! I will try that and post here again if I am still having issues. Thank you for your help!

emontag47 avatar Dec 14 '21 03:12 emontag47

@jsan4christ reported this issue recently to @rneher and myself. San is one of many people who have experienced this issue.

Although there is a workaround of including the reference data as a separate input for a given build, this solution is not clearly documented. Ideally, the workflow would be more aware of the presence of the reference as early as possible and handle missing reference data more gracefully by either:

  1. telling the user how to modify their build configuration to include the reference
  2. automatically adding the references, as needed, and then removing them when they are no longer needed.

Both solutions require the workflow to check for the presence of the root in the subsampled data or perhaps the original input data.

Both solutions could be implemented as a rule that runs a script that checks for the root in the subsampled data. To implement the first solution, that script would print a helpful error message about how to modify the build config and exit the workflow with an error. If the reference data are present, the script would create a placeholder file that allows the workflow to continue without an error. A major disadvantage to this solution is that it alerts the user about the missing references after the most I/O-intensive parts of the workflow have completed and it requires users to modify their build config and re-run all of those parts again, just to get a single sequence added to their analysis. The rule to check for the reference data could be placed earlier in the workflow where it could check all inputs, but this would be complicated by the many different entry points to the workflow and the potentially large input files that would need to be searched.

To implement the second solution, a similar script would also accept reference sequences and metadata as an input and produce a new set of subsampled data with the reference data added, when the reference data are missing. It would never throw an error and would always produce output FASTA and metadata to be used by the rest of the workflow. This solution does have a disadvantage of modifying the user's data without their express approval.

Augur's align command uses something like the second solution to ensure that the requested reference is included in an alignment and then provides an option to remove the reference after alignment. A third solution would be to modify augur refine to behave similarly to augur align, by accepting a reference sequence and metadata as the root that could be automatically removed after running TreeTime in the case where it wasn't present already.

huddlej avatar Feb 25 '22 01:02 huddlej

Bumping this, since @danrlu and team just ran into this issue in another annoyingly subtle manifestation where the reference was in the excluded_by_diagnostics.txt file because it didn't have the expected number of mutations compared to the rest of the data in the build.

Building on what I previously described in February, we could solve this issue with the following steps:

  1. Add a new rule after the combine samples rule that adds the reference sequence and metadata to the subsampled data, if they don't already exist.
  2. Add another hardcoded value to the --include argument of the filter step of the workflow that includes strains in defaults/include.txt (despite what the user provides as the include file).

I think this solution is the simplest one that accomplishes the outcome we want. The reference has to be included right after subsampling (or alignment) instead of right before tree building or time tree building because it needs to be masked and indexed.

If we are worried that users will not want the reference force-included in their analyses (despite the fact that we need it to root the time tree), we could add a configuration option to remove_reference and a rule after refine that removes the reference.

Yet another approach would be to change the rooting mechanism for refine to "best" instead of hardcoding a specific strain. This would be the simplest solution, as it eliminates the need for the reference metadata and sequence completely. That said, I don't think this is what we want biologically. There is nothing stopping others from changing the value of root to "best" in their own build configurations though...

huddlej avatar Apr 18 '22 18:04 huddlej

I like the 1+2 proposal above. Keep in mind that even if we could reliably root on "best" (essentially maximizing root-to-tip correlation) we'd be left with an issue of calling mutations. By including Wuhan root mutations are reliably called relative to this reference. This is important for clade assignment as well as general understanding.

trvrb avatar Apr 18 '22 18:04 trvrb

I agree with Trevor - I think we'd invite more confusion if mutation calling started getting out-of-whack with what's standard.

emmahodcroft avatar Apr 18 '22 18:04 emmahodcroft

This sounds great, thank you @huddlej .

I would perhaps also consider adding a conditional to the proximity calculation script that excludes the reference (i.e., don't enrich for samples similar to the ref sequence).

sidneymbell avatar Apr 18 '22 19:04 sidneymbell

@sidneymbell I was wondering about that point in @danrlu's Slack thread. I would expect the reference to get ignored during proximity calculations based on the --ignore-seqs that specifies the root strain name. Do you find that you still see the root strain (Wuhan/Hu-1/2019) in the proximity scores or perhaps you see the other Wuhan reference strain that we use for GenBank-based builds (Wuhan/WH01/2019)?

huddlej avatar Apr 18 '22 20:04 huddlej

I'm learning so much through this discussion 😆 I didn't know --ignore-seqs Wuhan/Hu-1/2019 is in place by default (I see it in our run log). We removed all ref strain from include.txt just so we don't get sequences pulled due to proximity to the root, but in fact we only needed to remove the other 2 ref sequences (such as Wuhan/WH01/2019), and can keep Wuhan/Hu-1/2019 in include.txt 🤩 But currently we also repurposed our free-of-reference include.txt for record keeping and metadata overlay template generation. Just want to add our train of thought for reference. Thanks very much for all the improvements!!!

danrlu avatar Apr 18 '22 21:04 danrlu