usher
usher copied to clipboard
Many sequences missing, like USA/PA-VSP3406/2021
When uploading sequences to Usher, I often get an error message/warning saying that at least some sequences weren't found. What's the reason for this? Am I doing anything wrong? I use strain names as extracted from GISAID by Nextstrain's ncov-ingest.
A particularly bad case was this Bahrain strain list where only 2 out of 300 sequences were found by Usher.
This is the error message:
Unable to find 47 of your sequences in the tree, e.g. 'USA/PA-VSP3406/2021',
'USA/NY-PRL-2021_0726_00K09/2021', 'USA/CA-CDC-LC0202791/2021',
'USA/NM-NMDOH-2021133303/2021', 'USA/LA-BIE-LSUH001500/2021'

There are several possible reasons:
- Some sequences are permanently excluded because they had 10 or more equally parsimonious placements
- Some sequences are temporarily excluded because they had more than 5 equally parsimonious placements (they get a new chance each day)
- Some sequences are permanently excluded because they have fewer than 20,000 non-N bases
- The tree has an outdated name for the sequence
- [This is probably the reason for most of your examples] The sequence has been submitted to both GenBank and GISAID, but with slightly different names. The tree uses the GenBank name. We're currently unable to match up the GISAID name -- but I should be able to fix this, thanks for sending examples!
For your USA/PA-VSP3406/2021 example, its GenBank name is USA/VSP3406/2021. The web interface currently has a file that maps EPI_ISL_4498479 to the GenBank accession and name -- but it does not have a mapping of the GISAID name to the GenBank name. I should be able to add that. In the meantime, would it be possible for you to try EPI_ISL IDs instead of names?