qiita
qiita copied to clipboard
warn to trim primers
There is a discussion about why to trim or not to trim primers prior to DADA2 or Deblur: https://forum.qiime2.org/t/deblur-vs-dada2-questions/2093/7
As far as I understand the protocol correctly, the primer is designed to align within the conserved region prior / after the variable V4 region and taxonomic information shall be harvested this region.
Furthermore, when comparing sequence features across experiments, we need to ensure we are treating them in the same way. I don't see an option how to remove primers (even if specified in pcr_primers col in prep files) for per_sequence_fastq files. Thus, it is the obligation of the user to ensure sequences are treated correctly.
I myself did not :-/ It took me half a year to realize this and I had to re-analyse three projects. The issue came to my attention, because I was reading https://doi.org/10.1038/s41586-019-0878-z downloaded genomes of those bugs, used primerprospector to obtain V4 reads and checked if I can find those features in my study. The offset of having the primer still in my Deblur feature table did bias this analysis a lot :-/
Therefore, I think Qiita should warn the user of this situation, e.g. by reading the pcr_primer column, comparing the beginning / end of uploaded sequences and test if primer sequences are still contained in say more than 90% of individual reads or of 90% of features in the final feature table.
The offset of having the primer still in my Deblur feature table did bias this analysis a lot :-/
This is interesting, do you have examples of this? How much is alot or what was the bias?
Thanks!
Regarding the 11-mix from the paper: The difference was none. We could not find any feature sequence from their genomes in our study. However, for an extended list of 40 genomes, we found 4 matching sequences in our study when primers were trimmed away, compared to 0. However, the 4 matching features are quite unspecific for the taxon / strain (thousands of 100%id matches when blasting against NR) with hundreds of different taxonomyIDs. However, different primer absence / presence can ruin promising search engines like red biom.
Has this issue been resolved?
Note that I'm seeing some really weird results when I pull down all of the human fecal samples with redbiom. Namely, if I run
ctx="Deblur-Illumina-16S-V4-100nt-fbc5b2"
redbiom search metadata "(feces | fecal | faecal | stool) & (sapien | human)" > human_ids.txt
cat human_ids.txt | redbiom fetch samples --context $ctx --output human-stool-deblur.biom
I'll get a deblurred table with 993083 taxa
The weird part about this is that if I run closed-reference OTU picking with vsearch, I only get 34977 hits (there are 599482 unmatched sequences). I wonder if a lack of primer trimming is inflating the miss rate.
The issue hasn't been resolved. Note that this specific issue is about a warning to users about this tentative problem with wetlab methods that do not ignore the linkerprimer of the sequences.
Anyway, a few questions to try to help you with your specific question:
- Have you checked your CR failures for linker primers?
- Are you using the forward/reverse matching for CR matching? In some cases the reference and the sequences are not in the same orientation
- Which reference are you using?
Right on, thanks for responding then - maybe I should move this over to a new issue? Regarding your question
- No, I didn't - this was just pulling all of the human fecal data from qiita and running closed-ref picking with the default vsearch parameters.
- I didn't do that either. Is this a problem upstream? Would it be worthwhile to enforce quality control in qiita?
- I used the latest greengene reference dataset.
The command that I used is given as follows
qiime vsearch cluster-features-closed-reference --i-reference-sequences ~/Documents/databases/97_seqs.qza --i-sequences human-stool-deblur-seqs.qza --i-table human-stool-deblur.biom.qza --p-perc-identity .97 --output-dir closed-reference --p-threads 30
No problem! Open a new issue? I guess this is fine; next time, suggest sending an email to [email protected].
- Then I wonder how did you get to this issue or assume it was the linker primer ¯_(ツ)_/¯
- AFAIK deblur will not check orientation of any reference as it's a denoising step, that happens in taxonomy classification or CR; try adding
--p-strand both - K
@mortonjt, is it that 599482 unique features from the input deblur table failed to recruit to Greengenes, or a total of 599482 reads failed to recruit? What's the total % of reads lost following recruitment, and what fraction of the features failing to recruit are singletons or doubletons?
A few considerations:
- deblur parameters in Qiita include
--p-min-reads 1so singletons are retained - closed ref for human fecal from western individuals typically only recruits 80-90% of reads at 97% IIRC
- many of the human fecal samples in Qiita are non-western and there likely are taxon sampling issues in the reference
I'm not seeing evidence of a quality control issue or a problem, but rather that there are unknowns. I'm not sure if there is a deviation in expectation as I don't think there has been much investigation (that I'm aware of at least) on the application of closed reference following deblur.
These are the former - 599482 unique features from the deblur table that failed to recruit to Greengenes.
- Understood that makes sense. Note that there are 217 features observed in at least 10000 samples, 1341 features observed in at least 1000 samples that weren't observed in GG
- Right, I was expecting around 80% recruitment
- Makes sense
If quality control isn't an issue, then I think this is really exciting! I didn't realize how many novel bugs weren't represented in GG.
The reoccurrence of features across samples is exciting. Do these recruit to greengenes at, say, 90%?