cutadapt icon indicating copy to clipboard operation
cutadapt copied to clipboard

5' adapters that are incomplete at the tail

Open lokapal opened this issue 4 years ago • 4 comments

Hello, Marcel!

Are there any chance to remove incomplete 5' adapters that are incomplete at the tail, not head? E.g. I have MYVERYLONGADAPTER and I have a lot of reads like

read1 MYVERYLONGAmysequence1 read2 MYVERYLOmysequence2

Adapters are long REALLY so in any case the part that left is longer than 30bp and is quite unique. I don't know the length of adapter that is left precisely in any sequence. Surely I can supply cutadapt with the list of 40 or so adapters that are all between minimal and full length but how it will be treated by cutadapt? I mean something like: -g MYVERYLONGADAPTER
-g MYVERYLONGADAPTE
-g MYVERYLONGADAPT
-g MYVERYLONGADAP
-g MYVERYLONGADA
-g MYVERYLONGAD
-g MYVERYLONGA
-g MYVERYLONG
-g MYVERYLON
-g MYVERYLO

Should I put them from the longer to the shorter?

Thanks in advance!

lokapal avatar Aug 06 '21 17:08 lokapal

Interesting, can you elaborate a little bit on why you get this type of data? I just wonder whether it would be worth adding support for this to Cutadapt (no promises, though).

In any case, you’ll currently have to provide all possible adapter prefixes manually, similar to what you did above. However, you should follow these recommendations, that is,

  • use anchored 5' adapters (-g ^MYVERYLONGA),
  • use an allowed number of errors not higher than -e 2,
  • do not use wildcards in the adapter sequence.

If you cannot follow the above, trimming will be quite slow (but it’ll still work). Also, you can put the sequences in a FASTA file for convenience.

Should I put them from the longer to the shorter?

It should not matter in this case in which order you provide the sequences.

marcelm avatar Aug 07 '21 09:08 marcelm

As a matter of fact my current reads have THREE adapters: one of them is full, the second is broken (sometimes at the head with the end presenting, sometimes at the tail with the head presenting), the third is Illumina/PE adapter usually (but not always!) at 3'. It's 4C libraries sequenced (the other experiment, but the technology is the same basically): https://www.sciencedirect.com/science/article/pii/S1046202318304742. Two adapters: A1 and A2, A1D means A1 direct, A1RC means A1 reverse complement, A2D means A2 direct, A2RC means reverse complement.

The examples of reads (marked up) are in the attached file reads.fa.gz

lokapal avatar Aug 07 '21 11:08 lokapal

Hi @marcelm, I have a similar problem, but it is more common one.

Given a sequence with 5' adapter, eg ALONGADAPTORsequence, if sequence is low quality in the end, or has polyG, cutadapt will trim this sequence into ALONGADAPTORseq (1st case) or ALONGADAP (2nd case). Then the -g argument and remove the adapter in the 1st case, but not in the 2nd case. And will cause adaptor contamination in the filtered reads.

y9c avatar Sep 16 '21 17:09 y9c

@yech1990 Thanks for reporting! I have opened a separate issue (#565) as this needs to be fixed in a different way than the problem that @lokapal has.

marcelm avatar Sep 17 '21 09:09 marcelm