cutadapt icon indicating copy to clipboard operation
cutadapt copied to clipboard

Amplicon primer trimming and Demultiplexing

Open HuGang-c opened this issue 8 months ago • 4 comments

Hi,

I have amplicon sequencing paired end data which has barcodes on each end of a paired read.

My primer is like:

N{8}(8 base barcodes)_Fwd_primer_sequence N{8}(8 base barcodes)_Rev_primer_sequence

My barcodes is anchored on the start of each read. However, many barcodes in my data is shorter than expected (8 base). Thus, I want to use 4 base before the primer sequence instead, I have checked the barcodes sequence and 4 base is enough for distinguish all the barcodes I have.

I want to know if there is a way for cutadapt to use my primer sequence as a identifier and keep the primer sequence, 4 base before the primer sequence (the barcodes), all base after the primer sequence.

If it possible, I can use the output to do demultiplexing with cutapapt using anchored 4 base barcodes.

Or maybe there is a more elegent way to do this.

Thanks in advance.

HuGang-c avatar Apr 22 '25 02:04 HuGang-c

You should be able to use a non-internal 5' adapter instead of an anchored 5' adapter, and then require that the minimum overlap is 4. For example, if your barcode is AACCGGTT, you can specify it as -g 'XAACCGGTT;o=4'. Then partial occurrences with at least four bases will also be found.

marcelm avatar Apr 28 '25 13:04 marcelm

Hi,

Thanks for your suggestion.

I have tried as you suggested.

However, I found that the reads were not demutiplexed as expected.

I have paired-end reads with combinatorial dual indexes.

I first tried this command:

cutadapt -e 1 --no-indels -g 'file:barcodes_fwd_8base_inter.fasta;o=5' -G 'file:barcodes_rev_8base_inter.fasta;o=5' -o {name1}-{name2}.1.fastq.gz -p {name1}-{name2}.2.fastq.gz A_1.fq.gz A_2.fq.gz -j 8 --action=none`

Part of my barcodes_fwd_8_base_inter.fasta is:

>1
XGAGTAGTA
>2
XTTCGCTTG
>3
XTATCTTTG
>4
XTTCAGCTA

Part of my barcodes_rev_8_base_inter.fasta is:

>A
XTACTTATG
>B
XCGTTGCGA
>C
XTCACGTCT
>D
XACGCTTGC

I then tried this command:

cutadapt -e 1 --no-indels -g 'XGAGTAGTA;o=5' -G 'XTACTTATG;o=5' -o s1-A.1.fastq.gz -p s1-A.2.fastq.gz A_1.fq.gz A_2.fq.gz -j 8 --action=none`

Here I used the '1' fwd barcode and 'A' rev barcode to test if the before command is running as expected.

I found the 1-A.1.fastq.gz output by the first command is the same as s1-A.1.fastq.gz

So I think the first command was working.

However, I found I got many reads (around 90%) in unknown-unknown.fastq.gz

I then test with my primer sequence:

cutadapt -e 1 --no-indels -g 'XTGGAGTGAGTACGGTGTGC' -G 'XTGAGTTGGATGCTGGATGG' -o withprimer.1.fastq.gz -p withprimer.2.fastq.gz A_1.fq.gz A_2.fq.gz -j 16 --action=none --untrimmed-output untrim.fastq1.gz --untrimmed-paired-output untrim.fastq2.gz

I am confused with the output.

First, some reads in withprimer.1.fastq.gz do not have any primer sequence, like:

@LH00215:393:22T3HJLT4:2:1101:44524:4195 1:N:0:CGCTCATT+AGGATAGG
GTACGGTGTGCGCATCATCCTCATTCTCATCAACAACAACCACAACCACAGCCGCAACATGAGTATGGCATGCCGGATCTTAGACAGTTTATCACCAATCGACCTTCTCATCATTTTCCTGGTATCCCACAAGCAGCCGCGGCGGAATTGT

Second, some reads contain the primer sequence are in untrim.fastq1.gz , like:

@LH00215:393:22T3HJLT4:2:1101:37947:1126 1:N:0:CGCTCATT+AGGATAGG GCTTTATCTTTGTGGAGTGAGTACGGTGTGCGAAGGGCCATAAGAAGAAGAAAGTTGGGCGGTGAATCCGGCGGTGATGATGCTTATGTCAAAGCCATCCAGCATCCAACTCATCTTCTGCAC

The pair for this reads in untrim.fastq2.gz also contain the primer sequence, is:

@LH00215:393:22T3HJLT4:2:1101:37947:1126 2:N:0:CGCTCATT+AGGATAGG GTGCAGAAGATGAGTTGGATGCTGGATGGCTTTGACATAAGCATCATCACCGCCGGATTCACCGCCCAACTTTCTTCTTCTTATGGCCCTTCGCACACCGTACTCACTCCACAAAGATAAAGC

I read from the document, and think the reads I list here in untrim file should be classed into: somethingelseADAPTERmysequence

I think it should be processed and write in withpimer file, but it is not, what happened?

I may got misunderstanding for the usage/command, hope you can help.

Thanks

HuGang-c avatar Apr 29 '25 02:04 HuGang-c

However, I found I got many reads (around 90%) in unknown-unknown.fastq.gz

Ok, maybe I misunderstood: In your first message, you said "many barcodes in my data is shorter than expected (8 base)." In which way are they shorter than expected? I was assuming that they are missing bases in the beginning, but maybe you meant that bases are missing at the end of the barcode? Do you have an example?

If bases are missing at the end of the barcode, you can just use a shortened barcode. So instead of searching for -g ^GAGTAGTA, you use -g ^GAGTA (you said your barcodes are anchored, which is why I added the ^).

I then test with my primer sequence:

cutadapt -e 1 --no-indels -g 'XTGGAGTGAGTACGGTGTGC' -G 'XTGAGTTGGATGCTGGATGG' -o withprimer.1.fastq.gz -p withprimer.2.fastq.gz A_1.fq.gz A_2.fq.gz -j 16 --action=none --untrimmed-output untrim.fastq1.gz --untrimmed-paired-output untrim.fastq2.gz

I am confused with the output.

You should probably not use the X for your primer sequence. With the X, only these types of occurrences are found:

ADAPTERmysequence
PTERmysequence

That is:

  • No bases are allowed before the adapter/primer.
  • Partial occurrences are allowed.

First, some reads in withprimer.1.fastq.gz do not have any primer sequence, like:

@LH00215:393:22T3HJLT4:2:1101:44524:4195 1:N:0:CGCTCATT+AGGATAGG
GTACGGTGTGCGCATCATCCTCATTCTCATCAACAACAACCACAACCACAGCCGCAACATGAGTATGGCATGCCGGATCTTAGACAGTTTATCACCAATCGACCTTCTCATCATTTTCCTGGTATCCCACAAGCAGCCGCGGCGGAATTGT

There is a partial primer in the beginning:

read:           GTACGGTGTGCGCATCATCCTCATTC...
                |||||||||||
primer: TGGAGTGAGTACGGTGTGC

Second, some reads contain the primer sequence are in untrim.fastq1.gz , like:

@LH00215:393:22T3HJLT4:2:1101:37947:1126 1:N:0:CGCTCATT+AGGATAGG GCTTTATCTTTGTGGAGTGAGTACGGTGTGCGAAGGGCCATAAGAAGAAGAAAGTTGGGCGGTGAATCCGGCGGTGATGATGCTTATGTCAAAGCCATCCAGCATCCAACTCATCTTCTGCAC

Yes, this is not found because there are bases before the primer.

I read from the document, and think the reads I list here in untrim file should be classed into: somethingelseADAPTERmysequence

If you want to find this type of occurrence, use a regular 5' adapter (without the X). If you want to disallow partial occurrences, use a minimum overlap. So for example -g 'TGGAGTGAGTACGGTGTGC;o=19'

marcelm avatar Apr 29 '25 06:04 marcelm

Hi,

Thanks for the details, it is very clear and helpful, I think I now understand how it work.

I confirmed the read structure again, it is expected as:

read1: gctt_barcodes_Fwdprimer_mysequence read2: ctgt_barcodes_Revprimer_mysequence

So I first run: cutadapt -g '^GCTT' -G '^CTGT' -o cut_read1.fastq.gz -p cut_read2.fastq.gz A_1.fq.gz A_2.fq.gz

I found that around 80% of my reads were not trimmed at all, but I think this is fine for the demultiplexing.

After remove the GCTT and CTGT sequence, I assumed that the reads should have anchored barcodes.

I checked the output and found that many of my reads contain shortened barcodes.

So, In which way are they shorter than expected?

The barcodes are shortened in the beginning. Like GAGTAGTA become TAGTA, and the amounts of shortened bases is not same for each read. I also confirmed that these reads is my sequence, as they contain my primer sequence. So for demultiplexing, I want to use the last 4 bases of my barcodes which is also enough for distinguish the reads.

I tried to use some new commands for cutting and demultiplexing after reading your detailed guidance , but still have many unknown-unknown reads, I think maybe I still not understand all the information.

Do you have any idea about what should I do for cutting and demultiplexing?

HuGang-c avatar Apr 30 '25 06:04 HuGang-c