dada2 icon indicating copy to clipboard operation
dada2 copied to clipboard

Unexplained read loss during merging step

Open SimonMorvan opened this issue 2 years ago • 11 comments

Hello @benjjneb and DADA2 community!



I am having a very similar problem as issue #831 where I lose a considerable portion of my sequences during the merging step although I maintain a sufficient overlap. 
 Context: Soil samples 2x300 bp Illumina MiSeq sequencing 16S V3-V4 region with primers 341F-805R

For FilterAndTrim I used maxEE(2,2) and truncLen(270,240) which should leave enough overlap.
I removed the primers using trimLeft(17,21). Other settings were left at their default values.

I used pool = pseudo for dada2() and left the minimum overlap at 12bp. Here is what my track table looks like with 3 of my samples.

image

At first, I thought there was a quality problem with the reads that was interfering with the merging step. However, the quality profiles of the filtered reads look okay. QP_filtered Furthermore, when I use returnRjects = TRUE to generate the mergers object, there are very few rejects due to mismatches. The main reason of the rejected sequences seems to be that the nmatch is <12 with a lot of sequences matching only on 3bp.

I’ve asked the sequencing provider about heterogeneity spacers as I saw this could explain a proportion not making it through the merging step. They responded that they did not use heterogeneity spacers but used staggered primers that look like this:

Forward Primer | Sequence 341FP1-CS1 | CCTACGGGNGGCWGCAG 341FP2-CS1 | TCCTACGGGNGGCWGCAG 341FP3-CS1 | ACCCTACGGGNGGCWGCAG 341FP4-CS1 | CTACCTACGGGNGGCWGCAG

Reverse Primer | Sequence 805RP1| GACTACHVGGGTATCTAATCC 805RP2| TGACTACHVGGGTATCTAATCC 805RP3| ACGACTACHVGGGTATCTAATCC 805RP4| CTAGACTACHVGGGTATCTAATCC

I've tried being more stringent on the primer removal and changed truncLen from (17,21) to (20,24) in order to remove even the longest versions of the primers but the track table remains similar and I still lose a lot of reads during the merging step.

image

I really don't know what is going on. I've also sequenced the ITS communities of the same samples and I don't lose as many sequences during the merging step so something seems to be going on with the 16S dataset.

I’ve included the 6 fastq files corresponding to the three samples containing the raw sequences obtained from the sequencing provider if you want to take a look.

3B_16S_R1.fastq.gz 3B_16S_R2.fastq.gz 5A_16S_R1.fastq.gz 5A_16S_R2.fastq.gz 12B_16S_R1.fastq.gz 12B_16S_R2.fastq.gz

Thank you for your help !

SimonMorvan avatar Oct 26 '21 20:10 SimonMorvan

"Staggered primers" cause the exact same problem as heterogeneity spacers, they mean that sequences are beginning at different positions on the read, introducing artificial variation in the data that interferes with the DADA2 denoising process. Removing a fixed number of bases at the beginning doesn't work because it doesn't solve the variation in seqeunce start position in the reads.

You will need to use an external progam like cutadapt to remove the primers, and then enter into the DADA2 workflow.

benjjneb avatar Oct 26 '21 21:10 benjjneb

Hello @benjjneb, Thank you for your quick reply and your help on the matter ! I will try it with cutadapt based on what you did with the ITS pipeline. Can I still use truncLen(270,240) after cutadapt or is it better to stick with only the maxEE like in the ITS pipeline?

SimonMorvan avatar Oct 27 '21 13:10 SimonMorvan

I think you mean truncLen, and yes, for 16S data which does not have much overall length variation, enforcing a truncation length after removing primers is still a good idea to remove the low quality tails.

benjjneb avatar Oct 27 '21 14:10 benjjneb

I've just tried using cutadapt to remove all the versions of the primers and their reverse complements.

A higher proportion of the reads merge but I still lose a lot of reads at the merging step.

track input filtered denoisedF denoisedR merged nonchim ID 3B_16S 117999 87173 77205 83243 46661 30045 3B_16S 5A_16S 116101 86648 77780 83392 51362 39865 5A_16S 12B_16S 111644 84077 72540 79426 39373 25018 12B_16S

I've attached the R script I've used. Script.txt

SimonMorvan avatar Oct 27 '21 15:10 SimonMorvan

What is the output of the primerHits check before and after cutadapt runs? Consider just FP1 and RP1, what is the output of:

# Check presence of every versions of primers 
rbind(FWD.ForwardReads = sapply(FP1.orients, primerHits, fn = fnFs.filtN[[1]]), 
      FWD.ReverseReads = sapply(FP1.orients, primerHits, fn = fnRs.filtN[[1]]), 
      REV.ForwardReads = sapply(RP1.orients, primerHits, fn = fnFs.filtN[[1]]), 
      REV.ReverseReads = sapply(RP1.orients, primerHits, fn = fnRs.filtN[[1]]))

both before and after (with new filenames) running cutadapt?

benjjneb avatar Oct 29 '21 15:10 benjjneb

Here is the output:

rbind(FWD.ForwardReads = sapply(FP1.orients, primerHits, fn = fnFs.filtN[[1]]), FWD.ReverseReads = sapply(FP1.orients, primerHits, fn = fnRs.filtN[[1]]), REV.ForwardReads = sapply(RP1.orients, primerHits, fn = fnFs.filtN[[1]]), REV.ReverseReads = sapply(RP1.orients, primerHits, fn = fnRs.filtN[[1]])) Forward Complement Reverse RevComp FWD.ForwardReads 101736 0 0 0 FWD.ReverseReads 0 0 0 3599 REV.ForwardReads 0 0 0 6094 REV.ReverseReads 106595 0 0 0

rbind(FWD.ForwardReads = sapply(FP1.orients, primerHits, fn = fnFs.cut[[1]]),  FWD.ReverseReads = sapply(FP1.orients, primerHits, fn = fnRs.cut[[1]]),  REV.ForwardReads = sapply(RP1.orients, primerHits, fn = fnFs.cut[[1]]),  REV.ReverseReads = sapply(RP1.orients, primerHits, fn = fnRs.cut[[1]]))                  Forward Complement Reverse RevComp FWD.ForwardReads       0          0       0       0 FWD.ReverseReads       0          0       0       0 REV.ForwardReads       0          0       0       0 REV.ReverseReads       0          0       0       0

I've also checked with the other FP and RP and cutadapt seems to work as I get 0 primer hit after running cutadapt. I have a warning on the cutadapt log saying that an adapter is preceded by "A" extremely often. The adapter in question is RP4.RC (the reverse complement of the 4th version of the staggered 805R primer). Do you thing I should add on A to this adapter and do I therefore add a T at the end of the RP4 primer as well ?

SimonMorvan avatar Oct 29 '21 15:10 SimonMorvan

That looks like the primers are being effectively removed, so I am somewhat confused about why there is still a merging issue.

Can you attach the exact cutadapt/DADA2 script you are running, so I can try to reproduce on the three example files you provided?

benjjneb avatar Nov 02 '21 15:11 benjjneb

Absolutly, Script 1 is the original script I used based on your recommandation. Script1.docx

I've also tested Script 2 after talking to a bio-informatician who advised me to only put the longest versions of the primers and their RC into cutadapt and to add some parameters to cutadapt (overlap = 7, discard-untrimmed, minimun-length = 220,220 and pair-filter=any). I've checked that the other versions of the primers were effectively removed after cutadapt. Script2.docx

Even with those adjustments, the track stays similar. input filtered denoisedF denoisedR merged nonchim ID 3B_16S 103150 83947 74405 80139 45066 29042 3B_16S 5A_16S 101317 83210 74684 80091 49469 38389 5A_16S 12B_16S 99173 80767 69632 76137 37528 24235 12B_16S

I've looked at the rejected sequences that did not make it through that merging step. Some of them are rejected because of a mismatch or an indel in the overlap, but it's a minority. Most of the rejected sequence are removed because the overlap between 2 sequences is <5bp when the overlap I've set is 12bp. I have a hard time understanding what is causing this. I trim down the sequences to 270,240 after cutadapt which should leave enough overlap as the 16 V3V4 amplicon is supposed to be around 460bp.

Thank you for your time!

SimonMorvan avatar Nov 02 '21 15:11 SimonMorvan

Hi Ben,

Have you had the time to check if something was off with the sequences or with how I was treating them ? Although I loose a lot of reads during the merging step, the rarefaction curves show I reach a plateau so I guess I will continue on with my analysis unless you believe I should try something else.

Thank you for your time!

Simon

Rarefaction_curve

SimonMorvan avatar Nov 22 '21 15:11 SimonMorvan

Hi Simon and Ben, I am experiencing this exact issue with the same primer pair on the same sequencing platform. Even after removing primers with cutadapt, I am still only retaining ~ 19% of my reads through the entire pipeline, with the majority being lost at the merger step. Was the root of this issue ever identified?

gcuster1991 avatar Jun 08 '22 16:06 gcuster1991

Hi @gcuster1991, in my case I continued on with my analyses as the rarefaction curves were looking okay. It's still a bit frustrating to keep such a small percentage of your sequencing dataset..

SimonMorvan avatar Jun 08 '22 19:06 SimonMorvan

Hi Ben,

Have you had the time to check if something was off with the sequences or with how I was treating them ? Although I loose a lot of reads during the merging step, the rarefaction curves show I reach a plateau so I guess I will continue on with my analysis unless you believe I should try something else.

Thank you for your time!

Simon

Rarefaction_curve

Hello Simon, I am trying to make rarefaction curves for my reads to see if they reach to plateau. Would you be able please to share the code that you used for extracting the rarefaction curves plot, here? Thanks and regards!

Mehrdad-M3614 avatar Jun 09 '23 19:06 Mehrdad-M3614