pr2database icon indicating copy to clipboard operation
pr2database copied to clipboard

DADA2 assignSpecies/addSpecies

Open kjurdzinski opened this issue 1 year ago • 1 comments

We noticed that the latest PR2 release contains only one file for DADA2 annotation, compatible with assignTaxonomy based on a naive Bayesian classifier. However, we wanted to get species-level identification using exact matching implemented in the assignSpecies and addSpecies functions of DADA2. We made in-house files to perform the analysis and run it successfully. However, the following issues remain open:

  1. We have been suggested to remove sequences not annotated to a specific species, i.e. the ones ending with "_sp.". However, we do get sequences with an exact match to one of these reference sequences, as well as another for which we have a species name. If the sequences annotated as ..._sp. would be removed, we would classify these sequences to the named species, even though based on current results we know there are other, unnamed species out there with exactly matching sequences as well. Should the "_sp." annotations be removed after the taxonomic classification with DADA2 is done? (The ones which were the only exact matches to a sequence)
  2. Some sequences which were annotated to one species using assignTaxonomy had exact matches to more than one species, e.g. to Skeletonema marinoi and Skeletonema costatum. (This might be an issue more relevant for DADA2 developers, but still important in the context of best practices for PR2-based analysis using DADA2)

kjurdzinski avatar Jun 07 '23 18:06 kjurdzinski

Hi Krzysztof

Answers to your two excellent comments:

In GenBank some sequences may be labelled Genus_sp. although they correspond to a specific species, but the person who deposited the sequence did not(or could not) identify the species. This assignment is often transmitted to PR2 because it is often impossible to decide what the initial species in the sample from the sequence originated. This is even more critical for environmental sequences for which there are no morphology data. Also, short 18S may not discriminate two (or more) different sequences. Even the full 18S sometimes is 100% similar between different species. This is the case for example for Gephyrocapsa (before Emiliania) huxleyi and G. oceanica that are morphologically very different. This is even more critical if you assign short metabarcodes which may not allow discrimination between 2 closely related species.

The case of the two Skeletonema species you are mentioning is interesting. Because S. marinoi was described only in 2005 and is hard to distinguish by light microscopy from costatum, many papers have kept using S. costatum as the species name. You will need to align reference sequences for the 2 species in the region of your metabarcodes to see if they differ. In PR2 there are 178 sequences of marinoi and 48 of costatum but it would be best for you only to choose the one from the original publication (Sarno, D., Kooistra, W., Medlin, L.K., Percopo, I. & Zingone, A. 2005. Diversity in the genus Skeletonema (Bacillariophyceae). II. An assessment of the taxonomy of S. costatum-like species with the description of four new species. Journal of Phycology. 41:151–76.). Please note also that only cultures sequences should be considered

image

image

In practice,

  • if one sequence is identified by assignSpecies as both a bona fide species and Genus_sp. (from the same Genus as the bona fide species) I would probably assign it to the bona fide species.
  • On the other hand of the sequence is only identified as Genus_sp. I will assign it to the Genus but not to any species in particular.
  • You may have some cases where one sequence matches several species, and in this case it may be wiser just to assign to the genus because the barcode may not resolve at species level.

If you have other specific cases, it could be interesting to discuss them.

PS. We have plans to add in PR2 a flag for sequences that correspond to species types, ie sequences that are associated to the publication describing a given species. But this will of course take a lot of effort and time (will need team effort for each taxonomic group) as we have to go back to the original papers.

vaulot avatar Jun 08 '23 10:06 vaulot