cutadapt icon indicating copy to clipboard operation
cutadapt copied to clipboard

Does cutadapt support the repeated bases range inside adapter sequences?

Open lokapal opened this issue 3 years ago • 7 comments

Hello!

I cannot figure out from the documentation - does cutadapt support range of repeated bases inside the adapter? Surely I'll try to do it, but it would be good anyway have "officially" supported the possibility to set something like -a ATAPN{10-15}TER. The notation means that inside ADAPTER after "P" could be ANY amount of N's from 10 to 15.

lokapal avatar Feb 12 '22 09:02 lokapal

This isn’t supported at the moment. (If it gets supported, the syntax would use a comma instead of a hyphen as in N{10,15} because that is what is used in regular expressions.) However, there are sometimes ways to achieve what you want anyway. Can you describe a bit more what your read layout is?

marcelm avatar Feb 12 '22 09:02 marcelm

As a matter of fact I have FaeI sequence (palindrome CATG) AFTER that I need to remove all to the end of the read, but retain FaeI itself (it's common genomic sequence). And it should not to be TOO far from the end of the read. (i.e. don't touch the sequence if the distance between CATG and 3' end is more than 10-20 letters and remove the "tail" in the other case).
The examples:

@HISEQ:535:XXXXXXXXXXXXXX 1:N:0:TATAATAT
AAGTTAAGAATTCAGGC**CATG**GTCTGGAGGATCCCAGCTCCCAGCAGAGGCACTTTTTTTCACACAGAATCCTGGGCAG
         Don't touch the tail!

@HISEQ:535:XXXXXXXXXXXXXX 1:N:0:TATAATAT
GAATTCAAAAGTTCATTTGACCAGTCTGTCAATCCAGCTGCTTTAGGATGATGGGGAA**CATG**GTCCCGAGG
                                                               Remove the tail!

Surely I can write something at BioPerl/BioPython to do it, but cutadapt is too smart already ;-) I try to figure out how to do it "automatically"

lokapal avatar Feb 12 '22 10:02 lokapal

Just to clarify: What would these two reads need to look like after trimming?

marcelm avatar Feb 12 '22 10:02 marcelm

I think you can get what you want by using -a CATGN{20}X, that is, a non-internal 3' adapter. You can see the X as not being allowed to match any base of the read, so you can have 20 bases following the CATG, but not more.

You’ll also need --action=retain if you do not want the CATG itself to be removed.

marcelm avatar Feb 12 '22 10:02 marcelm

It doesn't work at all, these sequence were not trimmed at all (I guess retain is in effect):

@HISEQ:XXXXXXXXXXXXXX 1:N:0:TATAATAT GAATTCTTCACTGGGGCTCCTAGGATGGGCTGGTTGGGTGGGGGTGGAGAATGTCTGCCTCCAGAGTTTGCATGGA

@HISEQ:XXXXXXXXXXXXXX 1:N:0:TATAATAT TGTGACCAAAAGTGCCTCTGCTGGGAGCTGGGATCCTCGGGACCATGATCCCAACAATGTTCACAG

The commandline was: cutadapt -j 20 --action=retain --minimum-length 20 -q 24 -a CATGN{20}X

lokapal avatar Feb 12 '22 12:02 lokapal

I've tried to build linked adapter something like to: "CATG;required;action=retain;noindels;min_overlap=4;max_error_rate=0.01...N{20}X;optional;action=trim" but to no avail: action per adapter is not supported

lokapal avatar Feb 12 '22 12:02 lokapal

You’re right, --action=retain won’t work the way you need it. I don’t think the linked adapters approach works either. I don’t have any further suggestions at the moment except that you could create an info file and parse that. But if you are just looking for CATG without indels and without mismatches, then you might as well implement this yourself just using str.find() without using Cutadapt at all.

I think what you would need is a way to specify where exactly to cut the sequence when an adapter has been found. Something like an extra character in the adapter sequence, so -a CATG_N{20} would cut between CATG and the rest. And then --action=retain would be the same as specifying -a ADAPTER_.

I’ll think further about this, but I will likely not have any time for anything like this soon.

marcelm avatar Feb 12 '22 13:02 marcelm