dada2
dada2 copied to clipboard
Mixed-orientation reads & orient.fwd loses >50% of sequences
Hello,
I'm working with some publicly available mixed-orientation reads, but I'm having trouble making it through the processing steps without losing >75% of the reads from start to finish. This is Illumina NovaSeq paired-end 2 x 250 bp data produced with the 515F/907R primers (so a ~370 bp amplicon with primers removed). I'm showing results for NCBI accession SRR19744062 (which contains 85,0111 reads) as an example.
The forward and reverse read files contain both the forward and reverse primers in their forward orientation:
SRR19744062 | Forward | Comp. | Rev. | Rev. Comp. |
---|---|---|---|---|
FWD.ForwardReads | 41,909 | 0 | 0 | 46 |
REV.ForwardReads | 42,203 | 0 | 0 | 15 |
FWD.ReverseReads | 40,529 | 0 | 0 | 15 |
REV.ReverseReads | 38,194 | 0 | 0 | 41 |
The read orientations do not follow an interleaved pattern - i.e., they do not alternate fwd/rev/fwd/rev - and the orientation pattern matches between the forward and reverse files. To the best of my knowledge, that makes this a mixed-orientation workflow. For example, here are sequences 8-15 in this sample, with (+) indicating that the read starts with the forward primer and (-) indicating that the read starts with the reverse primer:
Sequence | Forward | Reverse |
---|---|---|
seq.8 | + | - |
seq.9 | + | - |
seq.10 | + | - |
seq.11 | - | + |
seq.12 | + | - |
seq.13 | + | - |
seq.14 | - | + |
seq.15 | - | + |
I have read most every post I can find about mixed-orientation reads, including this extremely helpful one (https://github.com/benjjneb/dada2/issues/938), but I'm afraid those suggestions are still not enough to solve my problem. No matter how I go about processing these reads, I lose about 50% of the data from denoising to merging, for a total loss of almost 80% of the reads from start to finish - despite these being high-quality reads!
My attempts so far have included:
- I tried generating ASVs "as is" with the original R1 and R2 files. I figured that this would maximize my use of dada2's error model, and that I could re-orient the reverse ASVs after I got my final sequence table. If a reverse ASV ended up being a duplicate of a forward ASV, I could just collapse them together afterwards. However, I lose over half the data at the merging step.
- I tried the "orient.fwd" functionality in the filterAndTrim command. This does exactly what it should - the filtered reads are all in their expected orientation, so I'm feeding correctly oriented sequences into mergePairs - but for reasons unknown to me, I still lose half my reads at the merging step, even for the reads in their "expected" orientation.
- I tried option 2 again but set maxMismatch=2 for merging. Generous, I know. I don't lose quite as many reads upon merging, but it's still just shy of 50% (and a greater percentage of reads get identified as chimeras).
- I followed the recommendation from the post linked above to divide the R1 and R2 files into four files: R1_forward_seqs, R1_reverse_seqs, R2_forward_seqs, R2_reverse seqs. I still lose half my reads at the merging step.
- To see if this was a problem with the reads, I tried merging paired ends first before running dada2 at all. I get 85% of the reads to merge when I use
vsearch --fastq_mergepairs
- but when I process them as single-end reads, I lose a remarkable percentage of reads during denoising.
Here's some better info about these attempts, showing the number of reads at each step in the pipeline for the five different approaches I've described above (number of denoised sequences is about the same between forward/reverse reads so I've just shown one value):
Attempt | Sample | Input N | N filtered | N denoised | N merged | N nochim | final orientation |
---|---|---|---|---|---|---|---|
1 | S1 | 85,011 | 60,702 | 52,893 | 16,857 | 15,287 | mixed |
2 | S1 | 85,011 | 60,391 | 53,837 | 21,021 | 18,114 | forward |
3 | S1 | 85,011 | 60,391 | 53,837 | 27,996 | 21,922 | forward |
4 | S1a | 42,378 | 31,309 | 27,098 | 9,781 | 8,847 | forward |
4 | S1b | 42,191 | 29,374 | 25,815 | 6,980 | 6,220 | reverse |
5 | S1 | 73,944 | 66,311 | 39,089 | - | 37,491 | mixed |
Other notes - I've been running dada2 in R, and I haven't been truncating reads at all (they're pretty high quality across their entire length), just removing the primers with trimLeft=c(20,20)
and then using maxEE=c(2,2), truncQ=2
. Even if I truncate R2 a bit, I get a similar outcome.
I've spun my own wheels and asked some of my peers and we just can't figure out why all the reads are disappearing. Where do they go? Why do they leave? Maybe this is supposed to happen and I just don't understand why? Or there's a problem in the data that I can't discern? I'd appreciate any insight or recommendations for things to try!
Primer sequences for reference: FWD (5'-GTGCCAGCMGCCGCGGTAA-3'), REV (5'-CCGTCAATTCMTTTRAGTTT-3')