dada2
dada2 copied to clipboard
Multiple degenerate primers 16S
Hello Benjamin,
I sequenced the V3-V4 region of 96 stools samples (2x 250b) with the degenerate and closely related primers:
FWD 341: CAACCTACGGGNGGCWGCAG ACCCTACGGGNGGCWGCAG TCCTACGGGNGGCWGCAG CCTACGGGNGGCWGCAG
REV 805: CATGACTACHVGGGTATCTAATCC ACGACTACHVGGGTATCTAATCC TGACTACHVGGGTATCTAATCC GACTACHVGGGTATCTAATCC
Each sample has about 90k reads. I removed primers successfully with cutadapt (-m 229:224 -M 235:232 --pair-filter=any -e 0 --no-indels) and truncated right the resulting read at length 228 and 223 for the forward and the reverse reads respectively (bbduk ftr=228; bbduk ftr=223) in order to use figaro software (--ampliconLength 427 --forwardPrimerLength 1 --reversePrimerLength 1 --minimumOverlap 19).
Until the merger step, it kept 90% of the reads. However, the removeBimeraDenovo step only kept 30% of the reads (not unique merged sequences, but reads). I hoped that CollapseNoMismatch function may help me in some way before running removeBimeraDenovo, but the function never ended on a node with a core of 120 RAM during 6 days.
Which procedure could you advise enabling me to process these data ?
Thanks for your help,
Corentin
Hello, Any advice on this please ?
Corentin
Until the merger step, it kept 90% of the reads. However, the removeBimeraDenovo step only kept 30% of the reads (not unique merged sequences, but reads).
This suggests that the primer removal that was performed did not successfully remove (just) the primers, and that the cut-adapted sequences still are starting at different positions.
The key thing that needs to happen when primers of variable length are on the reads is for the primer to be removed from the front of the read, so that the remaining reads all start at the same position in the actual 16S gene sequence. It is also preferable to have all reads truncated to a common length after that happens.
I do not see how that is being accomplished by the cutadapt flags you included here, that enforce a minimum length but don't remove the variable length primer sequences you described.
One thing that would help is to provide the full cutadapt command you are executing. Another thing would be to do a minimal test that your cutadapt command is achieving what you want. For example, make a mock fastq file with the same sequence prepended by each of your 4 primers. After you run your cutadapt command on that file, do you get trimmed set of sequences that are all identical?
Hello @benjjneb,
I come back with more details. I made some modification to take into account that I want to remove primers on its whole length, and the resulting command I use for cutadapt is:
cutadapt -g file:primers_fwd.fasta -G file:primers_rev.fasta \
-e 0 --no-indels --pair-filter=any \
-o mysample_trimmed_R1.fastq.gz -p mysample_trimmed_R2.fastq.gz \
mysample_R1.fastq.gz mysample_R2.fastq.gz
I add a "^" before each of my primer sequences to made them anchored 5' as follows:
>Primer_FWD_1
^CCTACGGGNGGCWGCAG
>Primer_FWD_2
^TCCTACGGGNGGCWGCAG
>Primer_FWD_3
^ACCCTACGGGNGGCWGCAG
>Primer_FWD_4
^CAACCTACGGGNGGCWGCAG
Cutadapt seems to handle it well:
=== First read: Adapter Primer_FWD_1 ===
Sequence: CCTACGGGNGGCWGCAG; Type: anchored 5'; Length: 17; Trimmed: 32886 times
No. of allowed errors: 0
Overview of removed sequences
length count expect max.err error counts
17 32886 0.0 0 32886
And so on for the other primers. For this sample, 90.6% pairs passed filters. I ran this command asking for lowering case to show you what was detected as a primer and what was removed. For the 10 first forward reads (shortened for better reading):
cctacggggggcagcagTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGT
caacctacgggtggcagcagTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGT
tcctacggggggctgcagTGGGGGATATTGGACAATGGGGGGAACCCTGATCCAGCGACGCCGCGTGAGT
tcctacgggcggctgcagTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGT
tcctacgggtggcagcagTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGT
tcctacgggaggctgcagTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGT
caacctacggggggctgcagTGAGGAATATTGGTCAATGGGCGGGAGCCTGAACCAGCCATGCCGCGTGAAG
caacctacggggggctgcagTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGT
tcctacgggtggctgcagTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGAGC
cctacggggggcagcagTGGGGGATATTGCACAATGGAGGAAACTCTGATGCAGCGACGCCGCGTGAGG
It seems that the resulting reads all start at the same position. Finally, I ran bbduk to trim right the resulting reads:
bbduk.sh in=mysample_trimmed_R1.fastq.gz out=mysample_trimmed_cut_R1.fastq.gz ftr=229
bbduk.sh in=mysample_trimmed_R2.fastq.gz out=mysample_trimmed_cut_R2.fastq.gz ftr=225
After getting figaro optimized parameters and running dada2, I get this track for this sample:
input filtered denoisedF denoisedR merged nonchim %_reads_retained
mysample 105847 94426 93208 93510 81800 37347 35.3
All samples are ranging from 35% to 55% retained reads. Do you think I can improve these results ?
Thank you very much for your help, Best regards, Corentin
That all looks good to me, other than the fraction of reads passing chimera filtering of course.
I'm not sure.
Maybe looking more closely at what is being removed as chimeric could help. Instead of removeBimeraDenovo
, you can run isBimeraDenovo
-- it will run the same code, but return a TRUE/FALSE logical vector with the ASVs being identified as chimeric having TRUE values.
With that some questions can be asked about the chimera-flagged sequences. Are they different lengths than the other sequences? Are there high-abundances chimeras? (the vast majority of chimeras are very low abundance) Is there any taxonomic pattern in the abundant chimeras? If looking at just the most abundant chimeras, are there some that are very similar to each other, with differences appearing repatedly in the same base positions?