dada2
dada2 copied to clipboard
Large number of unclassified data
Hi! First of all, thank you for this fantastic tool and the support here on GitHub. I am working with an ITS sequencing from tropical soil samples, and I have followed the ITS workflow so far (https://benjjneb.github.io/dada2/ITS_workflow.html). However, I found a large number of unclassified ASVs at the phylum level (totaling up to 75% of the community in certain samples). The sequences have very good quality, and I keep most of them in the workflow even using strict parameters (maxN = 0, maxEE = c(1, 1), and truncQ = 2). The primers were also almost completely removed using Cutadapt.
After reading this discussion in other forums, I changed the Unite database to the developer's version, and my results improved. However, the number of unclassified AVSs is still large (now less than 35%). I have two questions if you can help me:
(1) Do you have any other recommendations in this case? (2) How can I know if I have a problem with mixed orientations?
I really appreciate any help you can provide.
(2) How can I know if I have a problem with mixed orientations?
Run assignTaxonomy(..., tryRC=TRUE)
to test both orientations at the same time. If that strongly reduces the number of unclassified ASVs, then that is that.
(1) Do you have any other recommendations in this case?
If the above doesn't work, I would recommend identifying the top perhaps 10 ASVs that are unclassified as the Phylum level, and BLAST them against nt: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome
That is often very informative, for example if some unintended DNA has been amplified.
Thank you @benjjneb! I used tryRC=TRUE but the results are pretty much the same. I blasted the top ASVs, and they really seem to be unclassified fungi.
A Master student I am supervising had very similar results to you with the UNITE dataset not including singletons. We also blasted and much of it came back unclassified from NCBI as well. Although if we include the singletons as well, all of it is classified. Looks like it is a matter of database completeness. We trusted the smaller UNITE database more and he argued in his thesis that databases for Eukaryotes are incomplete and need a lot more data still. Maybe a custom combination of several databases would be an improvement. But I struggle with formatting them correctly. Maybe @benjjneb has some resources/bash scripts to share?
If you don't mind me asking: What environment are you working on?