Question: How to trim multiple overhang adapter sequences and primers from paired-end Fastq files
Hello, I am new to using Cutadapt and I would like some guidance on how to trim overhang adapter sequences and 16S rRNA V3-V4 primers from 96 paired-end Fastq files. The amplicon PCR was done using a pool of seven 16S rRNA gene specific primers using V3-V4 target sequences that are appended with overhang adapter sequences 5’ of the target sequence as follows: Forward primer XT_338F Overhang Adaptor Sequence 5’-3’:TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG XT_338F V3V4 16S Primer Sequence 5’-3’ : ACTCCTRCGGGAGGCAGCAG
Reverse primer XT_806R Overhang Adaptor Sequence 5’-3’:GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG XT_806R V3V4 16S Primer Sequence 5’-3’:GGACTACHVGGGTWTCTAAT
However, a heterogeneity spacer (0-6 bases) was included to provide phase separation across the library as shown in the attached screenshot. Please note that the sequences in red are the 16S V3-V4 primers whereas those highlighted in blue are the heterogeneity spacer bases, which are part of the overhang adapter sequence.
Should I trim the reads as shown below or is there a more appropriate way to go about this? $ cutadapt -g ADAPTER_FWD1 -g ADAPTER_FWD2...-g ADAPTER_FWD7 -G ADAPTER_REV1 -G ADAPTER_REV2...-G ADAPTER_REV7 -o out.1.fastq -p out.2.fastq sample1_R1.fastq sample1_R2.fastq
Alternatively, should I combine all the sequences of the seven oligo sets as one adapter in the forward and reverse orientations and trim accordingly?
Also, is it possible for me to create a loop in bash to trim the 96 Fastq files in one go instead of repeating the trimming 96 times for each paired end file?
I will really appreciate your response on this since it is quite confusing for me. Thank you.
Hi, the adapter sequences you give seem to be the sequencing primers. They need to be added to the DNA fragments so that the sequencing process can start, but you shouldn’t see them at the beginning of the respective read because the first sequenced base is the one right after them. With one exception: If the fragment/insert itself is shorter than the read length, then you will start to see the respective other sequencing primer at the end of the read. See here for a good explanation: https://teichlab.github.io/scg_lib_structs/methods_html/Illumina.html
To confirm, you could have a look at your FASTQ files and check whether you see the adapter in the beginning in of each read, or whether each read starts with the heterogeneity spacer + primer.
If it is the case that each read starts directly with a variably-length spacer and is then followed by the primer sequence, you can deal with the spacer by prefixing your primer with X and six N nucleotides (making this a "noninternal 5' adapter", see the documentation). Also use a minimum overlap corresponding to your primer length: -g "XNNNNNNACTCCTRCGGGAGGCAGCAG;o=20". Do the same for the R2 reads. It should look something like this overall:
cutadapt -g "XN{6}ACTCCTRCGGGAGGCAGCAG;o=20" -G "XN{6}GGACTACHVGGGTWTCTAAT;o=20" -o out.1.fastq -p out.2.fastq in.1.fastq in.2.fastq
Regarding how to create a loop in bash to process all samples, please see https://cutadapt.readthedocs.io/en/stable/recipes.html#many-samples.
Dear Marcel,
Thank you for your insight and response. I will inspect the Fastq files as you've suggested.