dada2 Mixed-orientation reads & orient.fwd loses >50% of sequences

Mixed-orientation reads & orient.fwd loses >50% of sequences

Open sasugden opened this issue 3 months ago • 0 comments

Hello,

I'm working with some publicly available mixed-orientation reads, but I'm having trouble making it through the processing steps without losing >75% of the reads from start to finish. This is Illumina NovaSeq paired-end 2 x 250 bp data produced with the 515F/907R primers (so a ~370 bp amplicon with primers removed). I'm showing results for NCBI accession SRR19744062 (which contains 85,0111 reads) as an example.

The forward and reverse read files contain both the forward and reverse primers in their forward orientation:

SRR19744062	Forward	Rev. Comp.
FWD.ForwardReads	41,909	46
REV.ForwardReads	42,203	15
FWD.ReverseReads	40,529	15
REV.ReverseReads	38,194	41

The read orientations do not follow an interleaved pattern - i.e., they do not alternate fwd/rev/fwd/rev - and the orientation pattern matches between the forward and reverse files. To the best of my knowledge, that makes this a mixed-orientation workflow. For example, here are sequences 8-15 in this sample, with (+) indicating that the read starts with the forward primer and (-) indicating that the read starts with the reverse primer:

Sequence	Forward	Reverse
seq.8	+	-
seq.9	+	-
seq.10	+	-
seq.11	-	+
seq.12	+	-
seq.13	+	-
seq.14	-	+
seq.15	-	+

I have read most every post I can find about mixed-orientation reads, including this extremely helpful one (https://github.com/benjjneb/dada2/issues/938), but I'm afraid those suggestions are still not enough to solve my problem. No matter how I go about processing these reads, I lose about 50% of the data from denoising to merging, for a total loss of almost 80% of the reads from start to finish - despite these being high-quality reads!

My attempts so far have included:

I tried generating ASVs "as is" with the original R1 and R2 files. I figured that this would maximize my use of dada2's error model, and that I could re-orient the reverse ASVs after I got my final sequence table. If a reverse ASV ended up being a duplicate of a forward ASV, I could just collapse them together afterwards. However, I lose over half the data at the merging step.
I tried the "orient.fwd" functionality in the filterAndTrim command. This does exactly what it should - the filtered reads are all in their expected orientation, so I'm feeding correctly oriented sequences into mergePairs - but for reasons unknown to me, I still lose half my reads at the merging step, even for the reads in their "expected" orientation.
I tried option 2 again but set maxMismatch=2 for merging. Generous, I know. I don't lose quite as many reads upon merging, but it's still just shy of 50% (and a greater percentage of reads get identified as chimeras).
I followed the recommendation from the post linked above to divide the R1 and R2 files into four files: R1_forward_seqs, R1_reverse_seqs, R2_forward_seqs, R2_reverse seqs. I still lose half my reads at the merging step.
To see if this was a problem with the reads, I tried merging paired ends first before running dada2 at all. I get 85% of the reads to merge when I use vsearch --fastq_mergepairs - but when I process them as single-end reads, I lose a remarkable percentage of reads during denoising.

Here's some better info about these attempts, showing the number of reads at each step in the pipeline for the five different approaches I've described above (number of denoised sequences is about the same between forward/reverse reads so I've just shown one value):

Attempt	Sample	Input N	N filtered	N denoised	N merged	N nochim	final orientation
1	S1	85,011	60,702	52,893	16,857	15,287	mixed
2	S1	85,011	60,391	53,837	21,021	18,114	forward
3	S1	85,011	60,391	53,837	27,996	21,922	forward
4	S1a	42,378	31,309	27,098	9,781	8,847	forward
4	S1b	42,191	29,374	25,815	6,980	6,220	reverse
5	S1	73,944	66,311	39,089	-	37,491	mixed

Other notes - I've been running dada2 in R, and I haven't been truncating reads at all (they're pretty high quality across their entire length), just removing the primers with trimLeft=c(20,20) and then using maxEE=c(2,2), truncQ=2. Even if I truncate R2 a bit, I get a similar outcome.

I've spun my own wheels and asked some of my peers and we just can't figure out why all the reads are disappearing. Where do they go? Why do they leave? Maybe this is supposed to happen and I just don't understand why? Or there's a problem in the data that I can't discern? I'd appreciate any insight or recommendations for things to try!

Primer sequences for reference: FWD (5'-GTGCCAGCMGCCGCGGTAA-3'), REV (5'-CCGTCAATTCMTTTRAGTTT-3')

Apr 03 '24 16:04 sasugden

dada2 dada2 copied to clipboard

Mixed-orientation reads & orient.fwd loses >50% of sequences

dada2
dada2 copied to clipboard