dada2 icon indicating copy to clipboard operation
dada2 copied to clipboard

Mixed-orientation reads & orient.fwd loses >50% of sequences

Open sasugden opened this issue 3 months ago • 0 comments

Hello,

I'm working with some publicly available mixed-orientation reads, but I'm having trouble making it through the processing steps without losing >75% of the reads from start to finish. This is Illumina NovaSeq paired-end 2 x 250 bp data produced with the 515F/907R primers (so a ~370 bp amplicon with primers removed). I'm showing results for NCBI accession SRR19744062 (which contains 85,0111 reads) as an example.

The forward and reverse read files contain both the forward and reverse primers in their forward orientation:

SRR19744062 Forward Comp. Rev. Rev. Comp.
FWD.ForwardReads 41,909 0 0 46
REV.ForwardReads 42,203 0 0 15
FWD.ReverseReads 40,529 0 0 15
REV.ReverseReads 38,194 0 0 41

The read orientations do not follow an interleaved pattern - i.e., they do not alternate fwd/rev/fwd/rev - and the orientation pattern matches between the forward and reverse files. To the best of my knowledge, that makes this a mixed-orientation workflow. For example, here are sequences 8-15 in this sample, with (+) indicating that the read starts with the forward primer and (-) indicating that the read starts with the reverse primer:

Sequence Forward Reverse
seq.8 + -
seq.9 + -
seq.10 + -
seq.11 - +
seq.12 + -
seq.13 + -
seq.14 - +
seq.15 - +

I have read most every post I can find about mixed-orientation reads, including this extremely helpful one (https://github.com/benjjneb/dada2/issues/938), but I'm afraid those suggestions are still not enough to solve my problem. No matter how I go about processing these reads, I lose about 50% of the data from denoising to merging, for a total loss of almost 80% of the reads from start to finish - despite these being high-quality reads!

My attempts so far have included:

  1. I tried generating ASVs "as is" with the original R1 and R2 files. I figured that this would maximize my use of dada2's error model, and that I could re-orient the reverse ASVs after I got my final sequence table. If a reverse ASV ended up being a duplicate of a forward ASV, I could just collapse them together afterwards. However, I lose over half the data at the merging step.
  2. I tried the "orient.fwd" functionality in the filterAndTrim command. This does exactly what it should - the filtered reads are all in their expected orientation, so I'm feeding correctly oriented sequences into mergePairs - but for reasons unknown to me, I still lose half my reads at the merging step, even for the reads in their "expected" orientation.
  3. I tried option 2 again but set maxMismatch=2 for merging. Generous, I know. I don't lose quite as many reads upon merging, but it's still just shy of 50% (and a greater percentage of reads get identified as chimeras).
  4. I followed the recommendation from the post linked above to divide the R1 and R2 files into four files: R1_forward_seqs, R1_reverse_seqs, R2_forward_seqs, R2_reverse seqs. I still lose half my reads at the merging step.
  5. To see if this was a problem with the reads, I tried merging paired ends first before running dada2 at all. I get 85% of the reads to merge when I use vsearch --fastq_mergepairs - but when I process them as single-end reads, I lose a remarkable percentage of reads during denoising.

Here's some better info about these attempts, showing the number of reads at each step in the pipeline for the five different approaches I've described above (number of denoised sequences is about the same between forward/reverse reads so I've just shown one value):

Attempt Sample Input N N filtered N denoised N merged N nochim final orientation
1 S1 85,011 60,702 52,893 16,857 15,287 mixed
2 S1 85,011 60,391 53,837 21,021 18,114 forward
3 S1 85,011 60,391 53,837 27,996 21,922 forward
4 S1a 42,378 31,309 27,098 9,781 8,847 forward
4 S1b 42,191 29,374 25,815 6,980 6,220 reverse
5 S1 73,944 66,311 39,089 - 37,491 mixed

Other notes - I've been running dada2 in R, and I haven't been truncating reads at all (they're pretty high quality across their entire length), just removing the primers with trimLeft=c(20,20) and then using maxEE=c(2,2), truncQ=2. Even if I truncate R2 a bit, I get a similar outcome.

I've spun my own wheels and asked some of my peers and we just can't figure out why all the reads are disappearing. Where do they go? Why do they leave? Maybe this is supposed to happen and I just don't understand why? Or there's a problem in the data that I can't discern? I'd appreciate any insight or recommendations for things to try!

Primer sequences for reference: FWD (5'-GTGCCAGCMGCCGCGGTAA-3'), REV (5'-CCGTCAATTCMTTTRAGTTT-3')

sasugden avatar Apr 03 '24 16:04 sasugden