dada2 icon indicating copy to clipboard operation
dada2 copied to clipboard

Drop in Qualityscore in the middle of the reads

Open MadsBjornsen opened this issue 10 months ago • 5 comments

Hi, I am looking at some data from the 2 x 150 bp V4 region, and when I look at the quality plot, we see a drop in the middle of the read for both forward and reverse. Furthermore, we see that many of our reads don't merge and have a relatively high percentage of chimeras (between 35% - 23%). I have tried different settings for the filtering:

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(135,125), maxN=0, maxEE=c(3,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE)

             input    filtered  denoisedF   	denoisedR    	merged        nonchim

S217 28311 27556 26812 27263 8361 7182 S218 71293 69817 68873 69245 5743 3848 S219 54009 52129 51461 51417 9244 18130 S220 42999 41111 40682 40562 18860 16787 S221 46393 44781 44175 44475 11427 10013 S222 44044 42420 42022 41865 18219 17392

sum(seqtab.nochim)/sum(seqtab) 0.770275

Here, we left out the truncLen part: out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, maxN=0, maxEE=c(3,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE)

        input      filtered   denoisedF   denoisedR    	merged        nonchim

S217 28311 26963 26240 26518 18448 13343 S218 71293 68615 67456 67761 54402 26459 S219 54009 51042 50224 50386 31696 25513 S220 42999 40072 39625 39559 26085 22037 S221 46393 43918 42427 43474 31217 22158 S222 44044 41441 41008 40993 27124 22091

sum(seqtab.nochim)/sum(seqtab) 0.6565934

Forward quality plot Reverse quality plot

Can the reason for the sudden drop in the quality of the reads in the middle be a sequencing error, and can this explain the high number of chimeras found in the data?

Thanks for the feedback.

MadsBjornsen avatar Apr 25 '24 12:04 MadsBjornsen

I don't have a satisfying answer, but yes the abnormal quality drop in the middle of the reads could be connected to a lot of reads lost at merging and chimera. But I don't know what the underlying mechanism would be, as I've never seen this type of quality profile before. One thing to check is whether the reads pre- and post-quality-drop both look like bacterial 16S sequences, and whether there is any low complexity sequence in the data (see the dada2 plotComplexity and seqComplexity functions for a quick way to look for that).

benjjneb avatar Apr 25 '24 21:04 benjjneb

Thanks for the feedback. The output from the plotComplexity for the first 2 samples looks like this:

S217_R2_001 S217_R1_001

S218_R1_001 S218_R2_001

When I try the seqComplexity I get this error:

hist(seqComplexity("S217_R1_001.fastq"), 100) Error in .Call2("new_XString_from_CHARACTER", class(x0), string, start, : key 50 (char '2') not in lookup table In addition: Warning message: In seqComplexity("S217_R1_001.fastq") : Not all sequences were A/C/G/T only, which can interfere with the calculation of the Shannon richness.

I have attached the first two forward and reverse samples

S217_R1_001.fastq.gz S217_R2_001.fastq.gz

S218_R1_001.fastq.gz S218_R2_001.fastq.gz

I might try to get the raw data for demultiplexing my self.

MadsBjornsen avatar Apr 29 '24 10:04 MadsBjornsen

Those first two plots in particular look troubling. Bimodal distributions of complexity scores suggest there is a mixture between normal (high complexity) sequences, and sequences that partially contain low complexity regions. This should not be observed in V4 data.

To use seqComplexity you'll need to read in the sequences first (it doesn't work on filenames), so sq <- getSequences("my/fastq/file.fq"); seqComplexity(sq) should get you a vector of complexity scores for the sequences in that file. Then depending on how comfortable you are with R, you can poke around the set of sequences in the low mode, and in the high mode of complexity to see if anything obvious jumps out.

benjjneb avatar Apr 29 '24 14:04 benjjneb

Ahh, thanks for that; I did as you suggested

S217_R1_001-seq S217_R2_001-seq

S218_R1_001-seq S218_R2_001-seq

I furthermore did a BLAST of some of the sequences I got from one of the samples and received back bacteria. I have read in other feedback you have given that you might recommend using the function rm.lowcomplex in the filterAndtrim function, would that be an idea?

Thanks

MadsBjornsen avatar Apr 29 '24 15:04 MadsBjornsen

I have read in other feedback you have given that you might recommend using the function rm.lowcomplex in the filterAndtrim function, would that be an idea?

Yes, you would set the rm.lowcomplex threshold based on the plotComplexity histograms to remove the low score mode. So maybe at something like 13? from the above.

benjjneb avatar Apr 29 '24 15:04 benjjneb

Thanks for the input so far! I have tried different parameters to see what would give me the best results but I keep loosing a lot of reads doing the merged part, would it be better to only work with the reversed read and skip the forward read as it seems to be there we have the problems?

However, by doing so, will I then run into troubles when assigning taxa as I am working with v4 2x150 bp?

Finally, if I use truncLen I seem to loos more reads when merging compared to when I use truncLen, however, my segtab.nochim/segtab percentage is much better when using truncLen

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, maxN=0, maxEE=c(3,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE, rm.lowcomplex = 13)

I get a percentage: sum(seqtab.nochim)/sum(seqtab) = 0.6312996

 input filtered denoisedF denoisedR merged nonchim

S217 28311 12809 12289 12379 10879 7446 S218 71293 65764 64674 64858 53490 25977 S219 54009 18146 17493 17646 13899 8675 S220 42999 10944 10584 10577 7945 5242 S221 46393 23366 22046 22968 20470 12945 S222 44044 11088 10811 10838 9797 6395

where when I use truncLen i get this ratio:

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(117,147), maxN=0, maxEE=c(3,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE, rm.lowcomplex = 13)

sum(seqtab.nochim)/sum(seqtab) = 0.7294229

 input filtered denoisedF denoisedR merged nonchim

S217 28311 21459 20811 21053 5206 4391 S218 71293 67665 65995 67084 9482 6057 S219 54009 32892 32441 32399 10286 8630 S220 42999 23861 23342 23476 7828 7076 S221 46393 32658 32128 32292 5764 4759 S222 44044 24482 24056 24157 9003 6568

Sorry for the long post and thanks again!

MadsBjornsen avatar May 10 '24 10:05 MadsBjornsen

I should have caught this earlier, but the truncLen you are using is too small. If using the common EMP V4 primer set, the length of the sequenced amplicon is 251-255 nts. The mergePairs function requires a minimum of 12 nts of overlap to merge the forward and reverse reads by default. So, it is necessary to have at least 255+12=267 nts between the forward and reverse reads after truncation, and some buffer is recommended. Your total nts post truncation are 117+147=264 and 135+125=260 in the filterAndTrim calls you posted above. This is what is driving the very low merging rates. So if you want to explore truncLen, you need to set a hard requirement that the forward+reverse truncation lengths add up to a minimum of 270 (or you can mess with minOverlap I guess).

sum(seqtab.nochim)/sum(seqtab) = 0.6312996

This is a high rate of chimeric reads. But I would check that after fixing the truncLen issue that it stays this high. We do see up to 25% of reads as chimeric in some amplicon sequencing data (although this suggests that improved PCR protocols would be appropriate in the future).

benjjneb avatar May 14 '24 15:05 benjjneb

Thanks for that; I had set the minOverlap lower, all the way down to 3, in the examples I sent you to see if that would make anything different; sorry for not informing you of that. I will run it again with an appropriate truncLen that gives a minimum of 270 and see how that goes.

The data is some I have received from someone else to combine with some other data for a paper, so I don't know how the laboratory protocols have been done.

MadsBjornsen avatar May 14 '24 22:05 MadsBjornsen

Hi @MadsBjornsen . I'm having similar issues with a metabarcoding dataset. Did you ever figure out the culprit in your case? Was it contamination? Too many PCR duplicates?

alexkrohn avatar Jul 08 '24 19:07 alexkrohn