Issue with Mismatch Settings in Cutadapt for Demultiplexing
Hello,
I am using cutadapt for demultiplexing my environmental DNA samples and want to allow a 1 bp mismatch across the entire index/adaptor sequence, which is 8 bp long in my case.
My current command is as follows:
However, in the output, it appears that cutadapt by default only allows a 1 bp mismatch in the last four bases of the 8 bp index/adaptor sequence. Below is a screenshot from my log file showing the issue:
Could you help me understand why this is happening, and how I can modify the settings to allow 1 bp of mismatch across all positions of my index/adaptor sequence?
Thank you for your help!
Best, Karoline
Hi,
(a small ask for next time: Please avoid posting screenshots of terminal output, and paste the actual text instead. This makes it easier for me to read and to copy and paste.)
First, may I suggest that you use Cutadapt’s demultiplexing functionality instead of writing your own loop? You can just provide Cutadapt with all the adapters/barcodes in a single FASTA file and let it split out the reads into one output file per adapter. Please see this section in the documentation: https://cutadapt.readthedocs.io/en/stable/guide.html#demultiplexing. This should be a lot more efficient (=faster) than the custom loop.
You can also run Cutadapt on multiple cores to speed it up (use option --cores with the number of cores you want it to use).
Regarding the error rate: The interpretation of the numbers under "No. of allowed errors" heading is a bit different than what you are suggesting. It is relevant only for partial adapter matches. You did provide the option --overlap 8 to avoid partial matches, but since the adapter type that you are using ("regular 5'") allows partial matches, this info is printed anyway. 1-4 bp: 0; 5-8 bp: 1 says: If the match involves 1 to 4 bases of the adapter, zero errors are allowed. If the match involves 5 to 8 bases, 1 error is allowed. See also this section in the documentation: https://cutadapt.readthedocs.io/en/stable/guide.html#error-tolerance. Perhaps I could improve this message to take the minimum overlap length into account.
Another suggestion (just for readability, will not change results): You can write -e 1 instead of -e 0.2; then it’s a bit clearer that your intention is to allow one error over the length of the adapter.
Is it correct that your 5' adapter can be anywhere within the read? That is, do you have a variable number of nucleotides preceding the 5' adapter? If the adapter is actually at the beginning of the read, you instead need to use an "anchored 5' adapter" by preceding the sequence with a ^ (please see the documentation).
Hi Mark,
Thank you for your quick response! It was really helpful, and I now have a much better understanding of how to interpret the "No. of allowed sequences" in the log file, which allows me to move forward with my analysis.
Apologies for the screenshots—I’ll make sure to post the actual text next time.
Regarding the position of my adapters, they are not at the very beginning of the reads, as a few nucleotides (N) were added at the start of each sequence to facilitate reading the Illumina signals.
I also appreciate the tip on making demultiplexing run faster—I’ll definitely look into that!
Best, Karoline