cutadapt empty reads outputted when --revcomp is used

Hello, First of all, I am a fan of Cutadapt, used and cited many many times!

However, I am having an issue right now I cannot actually understand why, I have several fastq files marker reads and they contain a mixture of forward and reverse reads. I want to trim everything extending out from the PCR primers one these reads. I tried to build a command that could trim the forward reads, look for the primers in reverse complemented reads and if found, output the rc form in the output.

So I looped over the fastq files like this

wSSUmCf="TATYGYTCTTNAACGAGGAATC" # amf forward primer
wLSUmBr="AACACTCGCAYAYATGYTAGA" # amf reverse primer

for file in *.fastq
do cutadapt \
	-g $wSSUmCf -a $wLSUmBr \
	-e 0.01 -n 2 \
	--cores=10 \
	--match-read-wildcards \
	--revcomp \
	--times 10 \
	--trimmed-only \
	-o ${file//.fastq/_stripped.fastq} $file
done

Cutadapt runs smoothly but the out put is kind of weird. In some cases the reverse complemented sequence is written in the output correctly, in some other cases is written like this

I am running cutadapt=4.0=py39hbf8eff0_0 installed form bioconda.

A small subset of the read including the one that is not outputted is provided as a zipped test.fastq file.

test.zip

Thanks so much for your help! Gian

Jun 10 '22 20:06 Gian77

Hi! Thanks for providing the test file – this makes it easy to reproduce what you saw.

When I manually reverse complement your /1715/ccs example sequence, I can see that it starts with AACACTCGCACATATGCTAGA, that is, it matches your 3' adapter sequence. Because it’s a 3' adapter, the occurrence itself and everything following it is removed, so the read is empty after trimming. I’m not super up-to-date on how amplicon sequencing with PacBio works, so I don’t know why this would happen there. In Illumina sequencing, however, seeing a 3' adapter at the beginning of a read would be a sign for a primer dimer (two adapters attached to each other) with no fragment of interest in between, so trimming such a read to length zero would be the correct thing to do.

I have a couple of remarks regarding the used command-line arguments:

You specify -n 2 but also --times 10. -n is an alias for --times, so one of them is ignored (I think the last occurrence is the one that counts).
Why do you need --times? In case you want to make sure you find the rightmost occurrence of the 5' adapter as described in #254, note that I added a "rightmost" search parameter just a couple of days ago. You would need to upgrade to Cutadapt 4.1 to use it.
-e 0.01 specifies a maximum error rate of 0.01, that is, of 1%. Because your primer sequences are shorter than 100 nt, this is the same as -e 0 and I would prefer to write it that way because it makes it clearer that no errors are allowed.
--match-read-wildcards is needed to allow wildcards in the reads. If this happens in CCS reads, then the option is fine, but if your intention was to allow wildcards in the primer, then you can leave the option out (wildcards in the adapter are allowed by default).

I’m available on Monday again in case you have further questions.

Jun 10 '22 22:06 marcelm

Hello @marcelm,

Thanks a lot for the fast reply and the explanation. I will correct my commands accordingly.

The cutadapt behavior you described on my reads makes total sense to me.

As for the PacBio run, the sequencing facility told me that "In a multi-plexed environment, the output is a single consensus sequence for each barcoded amplicon sequence on a per-ZMW bases (one consensus sequence saved per-ZMW). The orientation of the [barcoded] ccs reads will be both forward and reverse and are amenable to PCR primer-trimming..." I believe those reversed reads are good quality, I BLAST a few and they gave me the exact taxa I was expecting to see with high % coverage and identity.

If I use the rightmost option wouldn't I still keep the adapter on the rc reads?

So, I need to find a way to reverse complement those reads and trim the adapters in the correct way, that 3' adapter has to be interpreted as a 5'.

What would you suggest? Can I first reverse complement the reads that are in the reverse form, then run cutadapt twice with the adapters in the correct order 5' -> 3' so that first time half the reads are trimmed each cutadapt round?

Thanks so much again! Gian

Jun 14 '22 14:06 Gian77

Hi again and sorry for the delay this time – not sure if this is relevant anymore.

I’m once again not sure how your reads look (your sequencing facility didn’t say where the primers should be), but I would guess that the --revcomp option would do the right thing. Are you aware of how it works? With it, Cutadapt does an adapter search both on the forward and reverse-complemented read. It then checks the match scores: If the score was higher for the reverse complement, then it reverse complements the read and continues with that. You could say that the --revcomp option is intended to be used for "normalizing" the orientation of the reads. You shouldn’t need to do any reverse-complementing yourself.

Note that one could also swap 5' with 3' adapters, swap rightmost to leftmost etc., but it is much easier to just reverse-complement the read and leave everything else the same.

Jul 01 '22 09:07 marcelm

Hello @marcelm,

No worries, I understand - we are all crazy busy in science... So, I think I found a solution, but please correct me if I am wrong.

Fist run cutadapt to find the adapters and to --revcomp the reads in where the adapters was reversed. In this way it seems that all the reads are reoriented in the right 5' -> 3' direction. To visualize the primers I alsdo added --action lowercase . Reversing the reads I think is necesary because they need to be clustered into ASVs at some point so they will need to align correctly.
Then, I did a second run of cutadapt to trim the adapters using -e 0.1 which I think gave a good compromise between accuray and reads lost. I looked at a random subset of 3000 reads after running this and it seems reads are all good, trimmed, and oriented. This is the code:

onda activate cutadapt
echo -e "cutadapt version: `cutadapt --version`" 

for file in `ls *.fastq`
do cutadapt \
	-g $wSSUmCf -a $wLSUmBr_RC \
	-e 0.1 \
	--cores=$cores \
	--match-read-wildcards \
	--revcomp \
	--action lowercase \
	--discard-untrimmed \
	-o ${file//.fastq/_rc.fastq} $file
done

for file in `ls *_rc.fastq`
do cutadapt \
	-g $wSSUmCf -a $wLSUmBr_RC \
	-e 0.1 \
	--cores=$cores \
	--match-read-wildcards \
	-o ${file//_rc.fastq/_trim.fastq} $file
done
conda deactivate

Thanks a lot! G.

Jul 01 '22 14:07 Gian77

Hi again. I was on vacation, sorry for the delay. Is the above working for you? Perhaps it is clear, but you should not need to run Cutadapt twice in your final pipeline: If you omit --action=lowercase from the first command, it should already do everything necessary (trim the reads and normalize their orientation).

It looks ok otherwise, just a note regarding the for loop: Instead of

for file in `ls *.fastq`

you can just write

for file in *.fastq

Aug 02 '22 12:08 marcelm

Hi @marcelm,

Hope you had fun during your vacation - here in Michigan is hot and humid and no vacation yet...

Thanks for approving - I believe that cutadapt worked well on PacBio ccs reads. --action lowercase was just to really see what cutadapt was doing and it worked. I think you can close this.

Regards, Gian

Aug 05 '22 01:08 Gian77

Thanks! Summer in Stockholm is short, so everyone is on vacation in July ...

Just open a new issue or re-open this one if you have further questions.

Aug 05 '22 07:08 marcelm

cutadapt cutadapt copied to clipboard

empty reads outputted when --revcomp is used

cutadapt
cutadapt copied to clipboard