dada2 Multiple degenerate primers 16S

Multiple degenerate primers 16S

Open CRichard21 opened this issue 2 years ago • 4 comments

Hello Benjamin,

I sequenced the V3-V4 region of 96 stools samples (2x 250b) with the degenerate and closely related primers:

FWD 341: CAACCTACGGGNGGCWGCAG ACCCTACGGGNGGCWGCAG TCCTACGGGNGGCWGCAG CCTACGGGNGGCWGCAG

REV 805: CATGACTACHVGGGTATCTAATCC ACGACTACHVGGGTATCTAATCC TGACTACHVGGGTATCTAATCC GACTACHVGGGTATCTAATCC

Each sample has about 90k reads. I removed primers successfully with cutadapt (-m 229:224 -M 235:232 --pair-filter=any -e 0 --no-indels) and truncated right the resulting read at length 228 and 223 for the forward and the reverse reads respectively (bbduk ftr=228; bbduk ftr=223) in order to use figaro software (--ampliconLength 427 --forwardPrimerLength 1 --reversePrimerLength 1 --minimumOverlap 19).

Until the merger step, it kept 90% of the reads. However, the removeBimeraDenovo step only kept 30% of the reads (not unique merged sequences, but reads). I hoped that CollapseNoMismatch function may help me in some way before running removeBimeraDenovo, but the function never ended on a node with a core of 120 RAM during 6 days.

Which procedure could you advise enabling me to process these data ?

Thanks for your help,

Corentin

Mar 06 '22 02:03 CRichard21

Hello, Any advice on this please ?

Corentin

Mar 13 '22 00:03 CRichard21

Until the merger step, it kept 90% of the reads. However, the removeBimeraDenovo step only kept 30% of the reads (not unique merged sequences, but reads).

This suggests that the primer removal that was performed did not successfully remove (just) the primers, and that the cut-adapted sequences still are starting at different positions.

The key thing that needs to happen when primers of variable length are on the reads is for the primer to be removed from the front of the read, so that the remaining reads all start at the same position in the actual 16S gene sequence. It is also preferable to have all reads truncated to a common length after that happens.

I do not see how that is being accomplished by the cutadapt flags you included here, that enforce a minimum length but don't remove the variable length primer sequences you described.

One thing that would help is to provide the full cutadapt command you are executing. Another thing would be to do a minimal test that your cutadapt command is achieving what you want. For example, make a mock fastq file with the same sequence prepended by each of your 4 primers. After you run your cutadapt command on that file, do you get trimmed set of sequences that are all identical?

Mar 14 '22 21:03 benjjneb

Hello @benjjneb,

I come back with more details. I made some modification to take into account that I want to remove primers on its whole length, and the resulting command I use for cutadapt is:

cutadapt -g file:primers_fwd.fasta -G file:primers_rev.fasta \
         -e 0 --no-indels --pair-filter=any \
         -o mysample_trimmed_R1.fastq.gz -p mysample_trimmed_R2.fastq.gz \
         mysample_R1.fastq.gz mysample_R2.fastq.gz

I add a "^" before each of my primer sequences to made them anchored 5' as follows:

>Primer_FWD_1
^CCTACGGGNGGCWGCAG
>Primer_FWD_2
^TCCTACGGGNGGCWGCAG
>Primer_FWD_3
^ACCCTACGGGNGGCWGCAG
>Primer_FWD_4
^CAACCTACGGGNGGCWGCAG

Cutadapt seems to handle it well:

=== First read: Adapter Primer_FWD_1 ===
Sequence: CCTACGGGNGGCWGCAG; Type: anchored 5'; Length: 17; Trimmed: 32886 times
No. of allowed errors: 0
Overview of removed sequences
length  count   expect  max.err error counts
17      32886          0.0             0           32886

And so on for the other primers. For this sample, 90.6% pairs passed filters. I ran this command asking for lowering case to show you what was detected as a primer and what was removed. For the 10 first forward reads (shortened for better reading):

   cctacggggggcagcagTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGT
caacctacgggtggcagcagTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGT
  tcctacggggggctgcagTGGGGGATATTGGACAATGGGGGGAACCCTGATCCAGCGACGCCGCGTGAGT
  tcctacgggcggctgcagTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGT
  tcctacgggtggcagcagTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGT
  tcctacgggaggctgcagTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGT
caacctacggggggctgcagTGAGGAATATTGGTCAATGGGCGGGAGCCTGAACCAGCCATGCCGCGTGAAG
caacctacggggggctgcagTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAGT
  tcctacgggtggctgcagTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGAGC
   cctacggggggcagcagTGGGGGATATTGCACAATGGAGGAAACTCTGATGCAGCGACGCCGCGTGAGG

It seems that the resulting reads all start at the same position. Finally, I ran bbduk to trim right the resulting reads:

bbduk.sh in=mysample_trimmed_R1.fastq.gz out=mysample_trimmed_cut_R1.fastq.gz ftr=229
bbduk.sh in=mysample_trimmed_R2.fastq.gz out=mysample_trimmed_cut_R2.fastq.gz ftr=225

After getting figaro optimized parameters and running dada2, I get this track for this sample:

          input filtered denoisedF denoisedR merged nonchim %_reads_retained
mysample 105847    94426     93208   93510    81800    37347     35.3

All samples are ranging from 35% to 55% retained reads. Do you think I can improve these results ?

Thank you very much for your help, Best regards, Corentin

Mar 20 '22 02:03 CRichard21

That all looks good to me, other than the fraction of reads passing chimera filtering of course.

I'm not sure.

Maybe looking more closely at what is being removed as chimeric could help. Instead of removeBimeraDenovo, you can run isBimeraDenovo -- it will run the same code, but return a TRUE/FALSE logical vector with the ASVs being identified as chimeric having TRUE values.

With that some questions can be asked about the chimera-flagged sequences. Are they different lengths than the other sequences? Are there high-abundances chimeras? (the vast majority of chimeras are very low abundance) Is there any taxonomic pattern in the abundant chimeras? If looking at just the most abundant chimeras, are there some that are very similar to each other, with differences appearing repatedly in the same base positions?

Mar 22 '22 20:03 benjjneb

dada2 dada2 copied to clipboard

Multiple degenerate primers 16S

dada2
dada2 copied to clipboard