usher icon indicating copy to clipboard operation
usher copied to clipboard

Many sequences missing, like USA/PA-VSP3406/2021

Open corneliusroemer opened this issue 4 years ago • 1 comments
trafficstars

When uploading sequences to Usher, I often get an error message/warning saying that at least some sequences weren't found. What's the reason for this? Am I doing anything wrong? I use strain names as extracted from GISAID by Nextstrain's ncov-ingest.

A particularly bad case was this Bahrain strain list where only 2 out of 300 sequences were found by Usher.

This is the error message:

Unable to find 47 of your sequences in the tree, e.g. 'USA/PA-VSP3406/2021', 
'USA/NY-PRL-2021_0726_00K09/2021', 'USA/CA-CDC-LC0202791/2021',
 'USA/NM-NMDOH-2021133303/2021', 'USA/LA-BIE-LSUH001500/2021'

image

S:1264L.subsample_1000.txt

corneliusroemer avatar Oct 15 '21 19:10 corneliusroemer

There are several possible reasons:

  • Some sequences are permanently excluded because they had 10 or more equally parsimonious placements
  • Some sequences are temporarily excluded because they had more than 5 equally parsimonious placements (they get a new chance each day)
  • Some sequences are permanently excluded because they have fewer than 20,000 non-N bases
  • The tree has an outdated name for the sequence
  • [This is probably the reason for most of your examples] The sequence has been submitted to both GenBank and GISAID, but with slightly different names. The tree uses the GenBank name. We're currently unable to match up the GISAID name -- but I should be able to fix this, thanks for sending examples!

For your USA/PA-VSP3406/2021 example, its GenBank name is USA/VSP3406/2021. The web interface currently has a file that maps EPI_ISL_4498479 to the GenBank accession and name -- but it does not have a mapping of the GISAID name to the GenBank name. I should be able to add that. In the meantime, would it be possible for you to try EPI_ISL IDs instead of names?

AngieHinrichs avatar Oct 15 '21 20:10 AngieHinrichs