ncov icon indicating copy to clipboard operation
ncov copied to clipboard

Fix strain name sanitize logic

Open huddlej opened this issue 2 years ago • 1 comments

Description of proposed changes

Fixes three bugs. The first bug is in the sanitize logic for the workflow when run with data from GISAID's full FASTA and metadata downloads. Although GISAID replaces whitespace in strain names with underscores for some downloads, it does not replace this whitespace in the larger downloads. The sanitize metadata script previously converted whitespace in strain names to underscores (following the GISAID convention), but the sanitize sequences script removed whitespace from strain names (following the Nextstrain convention). As a result, builds that relied on sequences from countries with spaces in their names (e.g., "South Africa") would not have any of the expected sequences from those countries since metadata had names like South_Africa and sequences had names like SouthAfrica. After merging this PR, all data from the full GISAID downloads will have names like SouthAfrica, fixing the original bug.

The second bug occurs in the workflow where genomes from countries with single quotes in their names (e.g., Cote d'Ivoire) have the single quotes dropped from their strain names during tree building causing a mismatch between tip names and metadata records at the augur refine step. The solution here is a brute-force one of replacing these single quotes with hyphens in both metadata and sequence records, so the names do not get mangled downstream. Ideally, we should replace this kind of logic with an augur curate command or equivalent csvtk/seqkit commands.

The third bug is a mismatch between spellings of Cote d'Ivoire in latitude/longitude data and in GISAID's metadata. This PR adds a latitude and longitude value for "Cote d'Ivoire" (how it is spelled in GISAID) along with the existing "Côte d'Ivoire" spelling (how it is spelled in Nextstrain-sanitized data from ncov-ingest). This fix allows the country's data to appear on the map in the correct place when users run the workflow with GISAID data.

Testing

  • [x] Tested manually with Africa CDC builds

huddlej avatar Aug 04 '22 22:08 huddlej

Wow - good catch on these bugs John!!

emmahodcroft avatar Aug 05 '22 08:08 emmahodcroft

Thank you for testing this and confirming it works, @j23414!

I might be lacking context here, but do we need to account for these GISAID downloads that do have the underscores?

@joverlee521 Looking back at this question after a couple of months (🤦🏻), I don't expect the data with underscores to be an issue because the GISAID data downloaded with that option will have underscores in both metadata and sequences. The issue this PR fixes is when there are spaces in the strain names and the two sanitize steps treated the spaces differently.

I'm going to merge this now, but we can keep our eyes peeled for downstream issues.

huddlej avatar Dec 08 '22 03:12 huddlej

I don't expect the data with underscores to be an issue because the GISAID data downloaded with that option will have underscores in both metadata and sequences

Just a quick note, I tried the docs downloading a few test Burkina Faso samples and the sequences had headers with underscores:

>**/**/Burkina_Faso/**/**

while the metadata didn't give me an option to "replace spaces with underscores" so I ended up with metadata strains with spaces:

**/**/Burkina Faso/**/**

j23414 avatar Dec 08 '22 22:12 j23414

Ah, got it. Thank you for checking this, @j23414! I forgot that path did not allow replacing spaces with underscores. That suggests that the original code in sanitize metadata was correct to replace whitespace with underscores (as you and @joverlee521 suggested) and that the better fix would be to modify logic in sanitize sequences to replace whitespace in strain names with underscores instead of nothing.

The worst case scenario of this implementation is when users create a build with custom GISAID data and GISAID data curated by our team where spaces have been replaced by nothing and builds end up with duplicates of the same sample using different delimiters. But since that is unlikely to happen for any users outside of our team, it isn't a huge issue.

huddlej avatar Dec 09 '22 18:12 huddlej