suggestions on primer+spacers using cutadapt
Hi,
I have these primers+spacers used for my sequencing. I was wondering if you suggest adding each primer separately and ask for --times=7 or use one primer+the longest spacer and use less stringent overlap and error to find the primers. If the second way is better I'm not sure how low for overlap and how high for the error rate I should go to make sure non-primer partial sequences don't get trimmed accidentally. Since I don't want to anchor my primers to be bale to remove partial primers.
primers: 319F_1: "ACTCCTRCGGGAGGCAGCAG" 319F_2: "CACTCCTRCGGGAGGCAGCAG" 319F_3: "ACACTCCTRCGGGAGGCAGCAG" 319F_4: "TACACTCCTRCGGGAGGCAGCAG" 319F_5: "GTACACTCCTRCGGGAGGCAGCAG" 319F_6: "CGTACACTCCTRCGGGAGGCAGCAG" 319F_7: "ACGTACACTCCTRCGGGAGGCAGCAG"
319F_1_rc: "CTGCTGCCTCCCGYAGGAGT" 319F_2_rc: "CTGCTGCCTCCCGYAGGAGTG" 319F_3_rc: "CTGCTGCCTCCCGYAGGAGTGT" 319F_4_rc: "CTGCTGCCTCCCGYAGGAGTGTA" 319F_5_rc: "CTGCTGCCTCCCGYAGGAGTGTAC" 319F_6_rc: "CTGCTGCCTCCCGYAGGAGTGTACG" 319F_7_rc: "CTGCTGCCTCCCGYAGGAGTGTACGT"
806R_1: "GGACTACHVGGGTWTCTAAT" 806R_2: "GGGACTACHVGGGTWTCTAAT" 806R_3: "TGGGACTACHVGGGTWTCTAAT" 806R_4: "ATGGGACTACHVGGGTWTCTAAT" 806R_5: "CATGGGACTACHVGGGTWTCTAAT" 806R_6: "TCATGGGACTACHVGGGTWTCTAAT" 806R_7: "ATCATGGGACTACHVGGGTWTCTAAT"
806R_1_rc: "ATTAGAWACCCBDGTAGTCC" 806R_2_rc: "ATTAGAWACCCBDGTAGTCCC" 806R_3_rc: "ATTAGAWACCCBDGTAGTCCCA" 806R_4_rc: "ATTAGAWACCCBDGTAGTCCCAT" 806R_5_rc: "ATTAGAWACCCBDGTAGTCCCATG" 806R_6_rc: "ATTAGAWACCCBDGTAGTCCCATGA" 806R_7_rc: "ATTAGAWACCCBDGTAGTCCCATGAT"
Thanks,
Is this paired-end or single-end data?
Don’t use --times=7 – this is only meant to be used when you want to remove up to 7 adapters from each read.
I think this could work:
-a 'XACGTACACTCCTRCGGGAGGCAGCAG;o=20...ATTAGAWACCCBDGTAGTCC' --revcomp
- This specifies a linked adapter (these consist of a 5' adapter and a 3' adapter).
- 'XACGTACACTCCTRCGGGAGGCAGCAG;o=20' will find any of 319F_1 to 319F_7 at the 5' end of the read.
ATTAGAWACCCBDGTAGTCCis 806R_1_rc. It’s sufficient to provide just one sequence because when an adapter is found, the adapter itself and anything following it is removed, so the subsequent spacer nucleotides are also removed.--revcomptakes care of the reverse-complemented case where the read starts with 806R and ends with 319F_x_rc.
Since I don't want to anchor my primers to be bale to remove partial primers.
In which way would the primers occur partially?
Thanks for your prompt reply and suggested solution.
This is paired-end data.
The spacers (with different length) are before the primers (in a way being partial), do they still get trimmed doing it the way you suggested?
Down below primers are in bold (showing fwd_7 and rev_7 sequences) FWD primer with the longest spacer: ACGTACACTCCTRCGGGAGGCAGCAG REV primer with the longest spacer: ATCATGGGACTACHVGGGTWTCTAAT
There were cases in which, I used -e 15 and -o up to 0.25, but sometimes I still saw primer sequences left in some reads. When I inspected those reads, only eight bases matched the provided primer sequence. I wasn’t sure how to handle those cases (maybe reducing minimum overlap but I see that you suggested 20bp for this case).
I was also wondering what is the best practice to check if primers were completely removed from reads.
Many thanks
This is paired-end data.
Then please also have a look here: https://cutadapt.readthedocs.io/en/stable/recipes.html#trimming-amplicon-primers-from-paired-end-reads This is how to trim paired-end data "in spirit", but you need to adjust it to deal with the spacers.
The spacers (with different length) are before the primers (in a way being partial), do they still get trimmed doing it the way you suggested?
I’m still not sure I understand fully when you say "partial". Do you consider ACGTAC to be the full spacer and the shorter versions CGTAC, GTAC, TAC, AC, C to be partial versions of the full spacer? Then yes, these shorter versions will also be found and trimmed.
Down below primers are in bold (showing fwd_7 and rev_7 sequences) FWD primer with the longest spacer: ACGTACACTCCTRCGGGAGGCAGCAG REV primer with the longest spacer: ATCATGGGACTACHVGGGTWTCTAAT
There were cases in which, I used -e 15 and -o up to 0.25, but sometimes I still saw primer sequences left in some reads. When I inspected those reads, only eight bases matched the provided primer sequence. I wasn’t sure how to handle those cases (maybe reducing minimum overlap but I see that you suggested 20bp for this case).
It shouldn’t be necessary to use such a high maximum error rate (assuming this is Illumina data).
How did you check whether primer sequences were still in the reads? And which eight bases of the primer were still in the read?
I was also wondering what is the best practice to check if primers were completely removed from reads.
I search for them using Cutadapt ...
I don’t know what exactly you want to achieve, but my aim would not be to remove all primer sequences (wherever they occur), but to keep only those reads that look as expected. That is, I would want to keep only those reads that have the spacer+primer in the expected positions and perhaps which are in a certain length range after trimming. For everything else, something undesirable happened in the library preparation, so I’d rather skip the entire read pair. That is, I’d use --discard-untrimmed and perhaps some length filtering.
Hi again,
Thanks for your explanation. I ended up updating my cutadapt to v5.1 and then:
fwd_primer: "ACGTACACTCCTRCGGGAGGCAGCAG" (longest sequence, containing the longest spacer+primer) rev_primer: "ATCATGGGACTACHVGGGTWTCTAAT" (longest sequence, containing the longest spacer+primer)
cutadapt -m 150:150 -O 15 -e 0.15 --discard-untrimmed --revcomp -g {config[fwd_primer]} -G {config[rev_primer]} -o {output.R1} -p {output.R2} {input.R1} {input.R2}
I really like the reports now since they are cleaner and more understandable using the --revcomp flag.
I would like to remove forward/reverse primers from 5' and their reverse complement from 3'. I was wondering how I would make sure that happens properly using --revcomp.
After checking the results, I still see some reads containing the primers. This is how I found them: I checked multiqc results over represented sequences but I didn't see primers existance in the overrepresented sequences then I used the grep function after gunzipping the fastq.gz files after running cutadapt to see if those primer sequences are found.
Some reads have primer sequences in the middle and I think those could be somehow undesirable happened in the library preparation or maybe accidentally the 16S sequence has a similar section like primer in that variable region (cause technically primers should be in the beginning of the reads). I hope those reads get removed by chimera removal, or ASV target-length removal or no annotation at lower levels.
For some reads though I still saw the primers existing:
I couldn’t find the data where I had observed eight primer bases still present in the reads to show you some examples. From what I remember, only the first few bases of the primer matched; the rest were different nucleotides that didn’t align with the original primer sequence. I’m not sure whether that means those reads lacked the primer altogether or whether a high mismatch rate prevented its removal.
Since I receive data prepared with different kits (and thus varying primer designs and fragment lengths), I’m wondering if there are any guidelines for choosing initial values for the minimum overlap (-O) and error rate (-e) parameters, which I can then tweak to optimize results.
Thanks!!!
Hi!
To follow up on this I also figured out that some ASVs at the end still had the primers in them but they were usually low abundant, annotated as NA at kingdom or phylum levels or they had a non-target length which means majority could get excluded downstream. However, I was wondering why even if I used discard untrimmed reads I still see them as ASVs in the final sequence table.
Thanks for your help!
Hi,
For some reads though I still saw the primers existing:
Since you used --discard-untrimmed, I’m very sure that a primer was removed from those reads. These examples are probably from cases where there are two (or more) consecutive primers at the 5' end of the read. Cutadapt trims the leftmost one, the other one remains.
There are a couple of things you can do about this.
- Use the
rightmostoption to make Cutadapt trim the rightmost primer, as in-g "{config[fwd_primer]};rightmost". - Use
--times=2to run a second round of adapter removal after the first one. - Run Cutadapt again on the trimmed reads, but with
--discard-trimmedin order to remove all reads with double primer occurrences.
I don’t have enough context to say which of these options would be best. The first option is probably the simplest one, but if you want to ensure that you only use reads that have one primer for each side, the third would be better.
(Also, a small request: Please avoid pasting screenshots of text. I would have liked to count the number of nucleotides in the first read to see whether it is indeed shorter than 150 bp, but doing that from an image is too much work. If you want to highlight something, you can use bold or italics.)
Since I receive data prepared with different kits (and thus varying primer designs and fragment lengths), I’m wondering if there are any guidelines for choosing initial values for the minimum overlap (-O) and error rate (-e) parameters, which I can then tweak to optimize results.
The minimum overlap -O 20 I suggested above was to ensure that only full primer occurrences are found (since these have a length of 20 nt), so there’s not that much to tune. Regarding error rate, I find the default maximum error rate of 10% (-e 0.1) already quite generous given that the sequencing error rate for Illumina reads is much lower. I just leave that at the default value unless there’s a reason not to (e.g. non-Illumina data, some kind of variability in the adapter/primer).
Thanks for your detailed responses! I apologize for not pasting the reads in a more readable format.