cutadapt icon indicating copy to clipboard operation
cutadapt copied to clipboard

Incomplete Adapter Trimming with Cutadapt 5.0

Open Anna-MarieSeelen opened this issue 8 months ago • 3 comments

Hi,

First of all, thank you for developing such a useful trimming tool!

I'm using Cutadapt 5.0 with Python 3.12.9, installed via conda. I'm running into an issue where not all adapter sequences are being trimmed from my reads, and I'm not sure why.

My reads were generated using the Illumina DNA PCR-Free Prep kit, which, according Illumina's adapter sequence documentation, uses the following two adapter sequences in both the forward and reverse reads:

This means my adapter sequences are CTGTCTCTTATACACATCT and ATGTGTATAAGAGACA

According to the website, "When performing adapter trimming, the software independently assesses each adapter for trimming." Based on this—and what I observe in my reads—I think this means that not all reads contain both adapters; some have only one, and others have none, i.e. the adapters are not linked.

Here are some examples from my raw data where both adapter sequences are present in the same read:

Forward reads (examples):

36202:ATCCCCTGTGAGATGTGTATAAGAGACAGGGCCGAGGCCACCCCGACGTTCATGCACTCGGGGTGGAACAAGCCCGCGACATACAACGCCATAGTGTTCATAGAGGGCGACACCGAGCCCGGCACGACCTGTCTCTTATACACATCTCCGA

36910:GCGAGATGTGTATAAGAGACAGGTCCCAGGACGTCGACGTGGCCTCGAAGTTCCGCCTGGCCTTCAAGGAGCGGTGCTGGGCCGGGGCCGGGATCTTCAACTGGGTCTGGCAGTACCCTGTCTCTTATACACATCTCCGAGCCCACGAGAC

56694:CTGCGAGATGTGTATAAGAGACAGCTTTCTGACCCATCTGATGGATGACCCGGGCCGACATGGGGAACGCCGAGATACCGCACGCCCCCAGCCTGTCTCTTATACACATCTCCGAGCCCACGAGACGACACCATGTATCTCGTATGCCGTC

Reverse reads (examples):

820498:CTGCGAGATGTGTATAAGAGACAGCAGCTTGTTCGCTTTGTTACGTGCTTTTCAAATACGCTTTCTGCGGGTCACTTCAGGGCGAAGATTTCTCCACTCCTGCTATAGCTGTCTCTTATACACATCTCGCAGGGGATAGTCAGATGACGCT

This is the command I used to trim the adapter sequences: cutadapt --cores 6
-g ATGTGTATAAGAGACA -a CTGTCTCTTATACACATCT --nextseq-trim=20
-G ATGTGTATAAGAGACA -A CTGTCTCTTATACACATCT
-O 7 -o 7
-e 0.2
--cut -1 -U -1
-m 80 -q 20,20 --max-n 0 --trim-n
-o "$out1" -p "$out2" "$in1" "$in2"

However when I check partial adapter sequences in the trimmed files I get this result: TGTCTCTTATACACA was found 48 times in the forward reads ATGTGTATAAGAG was found 1712 times in the forward reads TGTCTCTTATACACA was found 1 times in the reverse reads ATGTGTATAAGAG was found 54 times in the reverse reads

To investigate further, I searched for ATGTGTATAAGAG in the trimmed forward reads and found this example: 22362:GCTGCGAGATGTGTATAAGAGACAGGCGAGATGTGTATAAGAGACAGGAGGTGGATGAAGCTCACTCCGAAAGTCCAGTCGTCTGATCAGGCGAGGATCGTCGCGTTTCTGGCCCGCGACTCCAA

Here's the corresponding line from the original forward reads before trimming: 22486:CTGCGAGATGTGTATAAGAGACAGCTGCGAGATGTGTATAAGAGACAGGCGAGATGTGTATAAGAGACAGGAGGTGGATGAAGCTCACTCCGAAAGTCCAGTCGTCTGATCAGGCGAGGATCGTCGCGTTTCTGGCCCGCGACTCCAAGGT

So my questions are:

  • Why are partial adapter sequences still present after trimming?
  • Am I using the correct parameters to ensure both adapters are fully trimmed?
  • Additionally, could you explain where the extra G at the beginning of the trimmed read is coming from?

I copied the first 100000 lines of my raw forward and reverse fastqc files and my trimmed forward and reverse files and added them to the issue in case you need them for troubleshooting.

raw_R1.txt raw_R2.txt trimmed_R1.txt trimmed_R2.txt

I hope you can help me with this, and thanks in advance!

Best regards,

Anna

Anna-MarieSeelen avatar May 01 '25 10:05 Anna-MarieSeelen

Sorry, but I missed your question. Is this still relevant?

marcelm avatar May 28 '25 07:05 marcelm

No problem about the delay and thank you for your answer, and this issue is still relevant for my research.

Anna-MarieSeelen avatar May 28 '25 07:05 Anna-MarieSeelen

Hi,

My reads were generated using the Illumina DNA PCR-Free Prep kit, which, according Illumina's adapter sequence documentation, uses the following two adapter sequences in both the forward and reverse reads: This means my adapter sequences are CTGTCTCTTATACACATCT and ATGTGTATAAGAGACA According to the website, "When performing adapter trimming, the software independently assesses each adapter for trimming."

To be honest, that sentence is not very clear and at least I cannot understand it in isolation. But I don’t think it means that both adapters are used in both R1 and R2 reads.

I often refer to this page when trying to understand read structure: https://teichlab.github.io/scg_lib_structs/methods_html/Illumina.html

Under the "Nextera Dual Index Library" heading, you can see that CTGTCTCTTATACACATCT comes after the insert. So if the insert is shorter than the read length, this is what you would see in the read. Thus, that adapter would be a 3' adapter in Cutadapt terminology.

I’m less sure about the other adapter ATGTGTATAAGAGACA. If you search for it on that page, you can see that it appears just before the insert, but there are some bases (AG) before it and one (G) after it. This corresponds to the sequence that is on the Illumina page shown under the "Nextera Mate Pair Adapter Trimming", so I guess there are some small differences between the protocols.

In any case, it appears that this adapter sequence would only appear in R2 (if the fragment is shorter than the read length), but because R2 is sequenced in the other orientation, you would see its reverse complement TGTCTCTTATACACAT. This would then also be a 3' adapter.

However, I tried this now on the data you provided and R1 is one nucleotide too long if I use that sequence, so it seems you should trim the reverse complement of that slightly longer sequence. And it turns out that the reverse complement of that slightly longer adapter is identical to the first adapter ...

So my best guess is therefore that you need -a CTGTCTCTTATACACATCT -A CTGTCTCTTATACACATCT.

Note that if you use only these options and then re-run Cutadapt on the output with -a ATGTGTATAAGAGACA, you will find some occurrences of that adapter as well, but I just don’t know where they come from, so I cannot say what you should do with them. Perhaps they represent some kind of contamination and you should remove the reads that contain them with --discard-trimmed (in a second Cutadapt round, that is).

Maybe one more note: You usually don’t need to get the trimming perfect. There will always be some type of contamination that you cannot get rid of. Sometimes trimming isn’t even necessary at all, for example, when adapter occurrences are rare and you use a read mapper afterwards that can do soft clipping (most can).

Based on this—and what I observe in my reads—I think this means that not all reads contain both adapters; some have only one, and others have none, i.e. the adapters are not linked.

Sounds correct. Linked adapters are usually used for removing primer sequences.

This is the command I used to trim the adapter sequences: cutadapt --cores 6 -g ATGTGTATAAGAGACA -a CTGTCTCTTATACACATCT --nextseq-trim=20 -G ATGTGTATAAGAGACA -A CTGTCTCTTATACACATCT -O 7 -o 7 -e 0.2 --cut -1 -U -1 -m 80 -q 20,20 --max-n 0 --trim-n -o "$out1" -p "$out2" "$in1" "$in2"

A couple of remarks:

  • Remove -o 7: When you provide -O 7, it is applied to all adapters (both R1 and R2). When you use -o, you are actually setting the output file name to 7 (but this is overriden later when you use -o "$out1").
  • Use either --nextseq-trim or -q. You should not use both types of quality trimming at the same time.
  • You are being quite strict with --max-n 0. Depending on what you are doing, reads with one or so N in them may be perfectly usable.
  • Why do you remove the last nucleotide of each read? (--cut -1 -U -1?)
  • Why are partial adapter sequences still present after trimming?

Cutadapt searches for all adapter sequences that you provide, but by default removes only the best matching one from each read (that is, at most one from R1, one from R2). This applies even if you provide one 5' adapter and one 3' adapter - if the 3' adapter matches better, only it is removed and vice versa.

  • Am I using the correct parameters to ensure both adapters are fully trimmed?

  • Additionally, could you explain where the extra G at the beginning of the trimmed read is coming from?

You mean like the one in line 22362? The original has two occurrences of the adapter sequence. The sequence preceding the second adapter sequence happens to be one base longer than the sequence preceding the first occurrence. Cutadapt trims the leftmost occurrence of an adapter by default.

marcelm avatar May 28 '25 13:05 marcelm