dada2 icon indicating copy to clipboard operation
dada2 copied to clipboard

Silva missing ranks incorrectly propagated into DADA2-formatted database by `dada2:::makeTaxonomyFasta_SilvaNR()`

Open mikemc opened this issue 3 years ago • 22 comments

A user pointed out to me in https://github.com/mikemc/dada2-reference-databases/issues/1#issue-826059569 problems with some of the taxonomy assignments from our Silva 138 database (https://zenodo.org/record/3986799) in which taxonomy names were sometimes showing up in the wrong rank. I tracked this issue to a problem in the Silva precursor file SILVA_138_SSURef_NR99_tax_silva.fasta.gz. For example, for the genus Anaerococcus,

❯ zcat SILVA_138_SSURef_NR99_tax_silva.fasta.gz | grep "Anaerococcus" | head -n 2                                                                                                                            main
>HL281652.2.1435 Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Anaerococcus;unidentified
>AF542233.1.1411 Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Anaerococcus;Anaerococcus lactolyticus

the Family field is actually missing from the fasta headers. The end result is that in the user's assignments, the genus Anaerococcus showed up as the family name, and the genus was NA. This problem seems to be fixed in Silva 138.1,

❯ zcat SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz | grep "Anaerococcus" | head -n 2                                                                                                                          main
>HL281652.2.1435 Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Family XI;Anaerococcus;unidentified
>AF542233.1.1411 Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Family XI;Anaerococcus;Anaerococcus lactolyticus

However, I'm wondering if we should modify dada2:::makeTaxonomyFasta_SilvaNR() to catch such errors rather than propagate them when creating our database files from the Silva files. A simple method might be to check that every header has 6 semi-colons.

Edit: The problem with 138 was also in the taxonomy file,

❯ grep "Anaerococcus" tax_slv_ssu_138.txt
Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Anaerococcus;        45574   genus           138

❯ grep "Anaerococcus" tax_slv_ssu_138.1.txt
Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Family XI;Anaerococcus;      50192   genus           138.1

Edit: There are also a much fewer set of cases where there were a bunch of extra semi-colons, so we'd want to catch too many semi-colons as well

❯ zcat SILVA_138_SSURef_NR99_tax_silva.fasta.gz | grep "AB511003.1.1389"
>AB511003.1.1389 Bacteria;Patescibacteria;Saccharimonadia;Saccharimonadales;YM;S32;TM7;uncultured bacterium

❯ zcat SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz | grep "AB511003.1.1389"
>AB511003.1.1389 Bacteria;Patescibacteria;Saccharimonadia;Saccharimonadales;YM_S32_TM7_50_20;uncultured bacterium

mikemc avatar Mar 09 '21 22:03 mikemc