cutadapt icon indicating copy to clipboard operation
cutadapt copied to clipboard

Heterogeneity Spacers and Primers

Open l-gallucci opened this issue 1 year ago • 5 comments

Hi @marcelm,

thank you for the development of Cutadapt. I'm actually using the latest stable version and python 3.10.

I'm dealing with these heterogeneity spacers+primers:

Bacterial region V3-V4: 341F (5´-CCTACGGGNGGCWGCAG-3´) 341Fb (5´-TCCTACGGGNGGCWGCAG-3´) 341Fc (5´-ATCCTACGGGNGGCWGCAG-3´) 341Fd (5´-TGTCCTACGGGNGGCWGCAG-3´) 785R (5´-GACTACHVGGGTATCTAATCC-3´)

I would like to know if you know a suggested way to deal with that. I was thinking to use a file in which I insert these primers (like demultiplexing) but the problem is that basically they are representing the same primers, all equally used on same samples, so I dont need to have 4 different output for each and if I set the files like:

341F ... 341Fb ... ...I feel that this, of course, will lead to different outputs based on the type of primer sequences given, so I'm pretty sure that this is not the right approach.

Do you have suggestions?

l-gallucci avatar Jul 02 '24 12:07 l-gallucci

Hi, in case it is still relevant: You can provide multiple primers/adapters with the same name. So something like this:

>341F
(sequence of 341F)
>341F
(sequence of 341Fb)

Then demultiplexing will send them to the same file.

Please leave the issue open as I would like to document this.

marcelm avatar Aug 10 '24 06:08 marcelm

Hi @marcelm, still useful thank you!

Just a quick question, should be useful also for future people questions...if I'm using just the first one (341F) and not the other, could we consider as they are removed anyway? The structure is 'spacers+primer' where 341F is just 'primer' and the other (b,c,d), instead, have 'spacers+primer'. I was supposing that everything before the 'primer' is removed.

l-gallucci avatar Aug 10 '24 07:08 l-gallucci

Getting back to this: I don’t understand the last question. If still relevant, can you re-phrase?

marcelm avatar Nov 13 '24 10:11 marcelm

Hi, @marcelm

kind off, but I will rephrase anyway just to clarify.

If in a situation in which I have heterogeneity spacers+primers in my paired-end data, like:

Bacterial region V3-V4:
341F (5´-CCTACGGGNGGCWGCAG-3´)
341Fb (5´-TCCTACGGGNGGCWGCAG-3´)
341Fc (5´-ATCCTACGGGNGGCWGCAG-3´)
341Fd (5´-TGTCCTACGGGNGGCWGCAG-3´)
785R (5´-GACTACHVGGGTATCTAATCC-3´)

If I'm using only 341F (the original primer, with any spacers) and 785R as input for the tool...Cutadapt trims everything that is found before the primer, so also the spacers despite having given it only the original primers or instead these remain and only the primer is eliminated?

So, could we consider this as valid removal approach or I need to go for a separated file lists, as you suggested in your previous response, to be sure that also the spacers are removed?

>341F
(sequence of 341F)
>341F
(sequence of 341Fb)
>341F
(sequence of 341Fc)
>341F
(sequence of 341Fd)

l-gallucci avatar Nov 13 '24 10:11 l-gallucci

If in a situation in which I have heterogeneity spacers+primers in my paired-end data, like:

Bacterial region V3-V4:
341F (5´-CCTACGGGNGGCWGCAG-3´)
341Fb (5´-TCCTACGGGNGGCWGCAG-3´)
341Fc (5´-ATCCTACGGGNGGCWGCAG-3´)
341Fd (5´-TGTCCTACGGGNGGCWGCAG-3´)
785R (5´-GACTACHVGGGTATCTAATCC-3´)

Let me reformat this to make it more visible what is going on:

341F       CCTACGGGNGGCWGCAG
341Fb    T-CCTACGGGNGGCWGCAG
341Fc   AT-CCTACGGGNGGCWGCAG
341Fd  TGT-CCTACGGGNGGCWGCAG

785R  GACTACHVGGGTATCTAATCC

If I'm using only 341F (the original primer, with any spacers) and 785R as input for the tool...Cutadapt trims everything that is found before the primer, so also the spacers despite having given it only the original primers or instead these remain and only the primer is eliminated?

If you provide a sequence with -g SEQUENCE, Cutadapt considers this to be a 5' adapter (actually primer in this case), and it uses these rules:

  • The sequence may appear anywhere within the read. That is, there can be any number of bases before it.
  • When trimming, the sequence itself and anything preceding it is removed from the read.

So, could we consider this as valid removal approach or I need to go for a separated file lists, as you suggested in your previous response, to be sure that also the spacers are removed?

Yes, this is fine. Just provide the 341F sequence without the heterogeneity spacers in order to automatically remove the primer and heteregeneity spacers in one go.

One consideration that could be relevant is that, as I mentioned, the primer can appear anywhere within the read. If you want to be a bit more specific, you could require that at most a certain number of bases appear before the primer. You can use a non-internal 5' adapter for this. It would look like this: -g XN{3}CCTACGGGNGGCWGCAG;o=17 where the 3 is the maximum number of nucleotides you allow before the primer, and o=17 is used to ensure that you see the full primer sequence. (But it probably does not make a big difference – if any.)

marcelm avatar Nov 13 '24 11:11 marcelm