dada2
dada2 copied to clipboard
Drop in Qualityscore in the middle of the reads
Hi, I am looking at some data from the 2 x 150 bp V4 region, and when I look at the quality plot, we see a drop in the middle of the read for both forward and reverse. Furthermore, we see that many of our reads don't merge and have a relatively high percentage of chimeras (between 35% - 23%). I have tried different settings for the filtering:
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(135,125), maxN=0, maxEE=c(3,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE)
input filtered denoisedF denoisedR merged nonchim
S217 28311 27556 26812 27263 8361 7182 S218 71293 69817 68873 69245 5743 3848 S219 54009 52129 51461 51417 9244 18130 S220 42999 41111 40682 40562 18860 16787 S221 46393 44781 44175 44475 11427 10013 S222 44044 42420 42022 41865 18219 17392
sum(seqtab.nochim)/sum(seqtab) 0.770275
Here, we left out the truncLen part: out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, maxN=0, maxEE=c(3,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE)
input filtered denoisedF denoisedR merged nonchim
S217 28311 26963 26240 26518 18448 13343 S218 71293 68615 67456 67761 54402 26459 S219 54009 51042 50224 50386 31696 25513 S220 42999 40072 39625 39559 26085 22037 S221 46393 43918 42427 43474 31217 22158 S222 44044 41441 41008 40993 27124 22091
sum(seqtab.nochim)/sum(seqtab) 0.6565934
Can the reason for the sudden drop in the quality of the reads in the middle be a sequencing error, and can this explain the high number of chimeras found in the data?
Thanks for the feedback.
I don't have a satisfying answer, but yes the abnormal quality drop in the middle of the reads could be connected to a lot of reads lost at merging and chimera. But I don't know what the underlying mechanism would be, as I've never seen this type of quality profile before. One thing to check is whether the reads pre- and post-quality-drop both look like bacterial 16S sequences, and whether there is any low complexity sequence in the data (see the dada2 plotComplexity
and seqComplexity
functions for a quick way to look for that).
Thanks for the feedback. The output from the plotComplexity for the first 2 samples looks like this:
When I try the seqComplexity I get this error:
hist(seqComplexity("S217_R1_001.fastq"), 100) Error in .Call2("new_XString_from_CHARACTER", class(x0), string, start, : key 50 (char '2') not in lookup table In addition: Warning message: In seqComplexity("S217_R1_001.fastq") : Not all sequences were A/C/G/T only, which can interfere with the calculation of the Shannon richness.
I have attached the first two forward and reverse samples
S217_R1_001.fastq.gz S217_R2_001.fastq.gz
S218_R1_001.fastq.gz S218_R2_001.fastq.gz
I might try to get the raw data for demultiplexing my self.
Those first two plots in particular look troubling. Bimodal distributions of complexity scores suggest there is a mixture between normal (high complexity) sequences, and sequences that partially contain low complexity regions. This should not be observed in V4 data.
To use seqComplexity
you'll need to read in the sequences first (it doesn't work on filenames), so sq <- getSequences("my/fastq/file.fq"); seqComplexity(sq)
should get you a vector of complexity scores for the sequences in that file. Then depending on how comfortable you are with R, you can poke around the set of sequences in the low mode, and in the high mode of complexity to see if anything obvious jumps out.
Ahh, thanks for that; I did as you suggested
I furthermore did a BLAST of some of the sequences I got from one of the samples and received back bacteria. I have read in other feedback you have given that you might recommend using the function rm.lowcomplex in the filterAndtrim function, would that be an idea?
Thanks
I have read in other feedback you have given that you might recommend using the function rm.lowcomplex in the filterAndtrim function, would that be an idea?
Yes, you would set the rm.lowcomplex
threshold based on the plotComplexity
histograms to remove the low score mode. So maybe at something like 13? from the above.
Thanks for the input so far! I have tried different parameters to see what would give me the best results but I keep loosing a lot of reads doing the merged part, would it be better to only work with the reversed read and skip the forward read as it seems to be there we have the problems?
However, by doing so, will I then run into troubles when assigning taxa as I am working with v4 2x150 bp?
Finally, if I use truncLen I seem to loos more reads when merging compared to when I use truncLen, however, my segtab.nochim/segtab percentage is much better when using truncLen
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, maxN=0, maxEE=c(3,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE, rm.lowcomplex = 13)
I get a percentage: sum(seqtab.nochim)/sum(seqtab) = 0.6312996
input filtered denoisedF denoisedR merged nonchim
S217 28311 12809 12289 12379 10879 7446 S218 71293 65764 64674 64858 53490 25977 S219 54009 18146 17493 17646 13899 8675 S220 42999 10944 10584 10577 7945 5242 S221 46393 23366 22046 22968 20470 12945 S222 44044 11088 10811 10838 9797 6395
where when I use truncLen i get this ratio:
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(117,147), maxN=0, maxEE=c(3,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE, rm.lowcomplex = 13)
sum(seqtab.nochim)/sum(seqtab) = 0.7294229
input filtered denoisedF denoisedR merged nonchim
S217 28311 21459 20811 21053 5206 4391 S218 71293 67665 65995 67084 9482 6057 S219 54009 32892 32441 32399 10286 8630 S220 42999 23861 23342 23476 7828 7076 S221 46393 32658 32128 32292 5764 4759 S222 44044 24482 24056 24157 9003 6568
Sorry for the long post and thanks again!
I should have caught this earlier, but the truncLen
you are using is too small. If using the common EMP V4 primer set, the length of the sequenced amplicon is 251-255 nts. The mergePairs
function requires a minimum of 12 nts of overlap to merge the forward and reverse reads by default. So, it is necessary to have at least 255+12=267 nts between the forward and reverse reads after truncation, and some buffer is recommended. Your total nts post truncation are 117+147=264 and 135+125=260 in the filterAndTrim
calls you posted above. This is what is driving the very low merging rates. So if you want to explore truncLen
, you need to set a hard requirement that the forward+reverse truncation lengths add up to a minimum of 270 (or you can mess with minOverlap
I guess).
sum(seqtab.nochim)/sum(seqtab) = 0.6312996
This is a high rate of chimeric reads. But I would check that after fixing the truncLen
issue that it stays this high. We do see up to 25% of reads as chimeric in some amplicon sequencing data (although this suggests that improved PCR protocols would be appropriate in the future).
Thanks for that; I had set the minOverlap lower, all the way down to 3, in the examples I sent you to see if that would make anything different; sorry for not informing you of that. I will run it again with an appropriate truncLen that gives a minimum of 270 and see how that goes.
The data is some I have received from someone else to combine with some other data for a paper, so I don't know how the laboratory protocols have been done.
Hi @MadsBjornsen . I'm having similar issues with a metabarcoding dataset. Did you ever figure out the culprit in your case? Was it contamination? Too many PCR duplicates?