Demultiplexing and adapter/barcode trimming from PacBio amplicon reads
Hi there!
I'm new to bioinformatic analyis of PacBio amplicon data and while I was searching for possible tools for demultiplexing/adapter/primer trimming I was suggested to try with cutadapt.
But since we are talking about three different actions I would ask for some guidance how to perform it properly. The way the HiFi reads are oriented is the following: Barcode1-Adapter-FW_Primer-Sequence-RW_Primer-Adapter'-Barcode1'. Adapters and barcodes on both sides are different and while barcodes depend on the sample, adapters only refer to FW_ and RW_Primers.
My question is the following: Would it be possible to perform demultiplexing step by cutadapt by including both adapter and barcode as part of a single "ADAPTER" in a single step or is this not a good practices for some reason? Or would it be better to first demultiplex the samples based on the barcode and then consecutively remove adapters and primers.
It's also important to note that as emphasized in https://github.com/marcelm/cutadapt/issues/627 not all reads are oriented the same way so I should somehow account for that as well.
I already started with demultiplexing sequences using lima tool by PacBio but now I'm wondering whether it would be the best to perform all three steps using the same software.
If you could give me some suggestions what would be the best way to handle the data in my case I would really appreciate it.
Thank you very much in advance!
My question is the following: Would it be possible to perform demultiplexing step by cutadapt by including both adapter and barcode as part of a single "ADAPTER" in a single step or is this not a good practices for some reason?
This is possible and I would probably do it that way myself.
Or would it be better to first demultiplex the samples based on the barcode and then consecutively remove adapters and primers.
There is a small advantage to doing this separately: You can set the maximum error rate individually per barcode/adapter/primer. If you combine the adapter and barcode into one, then the maximum error rate applies to the whole thing, and you cannot say where the errors are allowed to be, thus, they could all be in the barcode. So for example, if you have a barcode of length 20 and a primer of length 30 and you process them individually, setting the maximum error rate to 0.1, this allows a maximum of 2 errors in the barcode and 3 in the primer. If you combine them and also use a maximum error rate of 0.1, you can have at most 5 errors in total, and all of these 5 errors could in principle be in the barcode. But how big of a issue this is depends on the actual length of the barcode and primer and the sequencing error rate. I would assume it is not a problem.
@marcelm Thanks a lot for your answer, much appreciated! I'll try to combine both the adapter and the barcode in one ADAPTER then and see how it works.
For trimming the primers though I think perhaps it will be better to process them separately in the second round as they include degenerate bases which probably messes up with the error rate or? What would be your thoughts on this? Thank you in advance!
Hm, I think I’d try this as well, maybe decreasing the allowed error rate a little bit if the combined length is getting large. What do you mean with "messes up with the error rate"? Degenerate bases are handled by Cutadapt (you can have IUPAC wildcards in the adapters and they only match the bases that they represent). N wildcards are ignored when computing the error rate of a match (so ACNGT matches ATCGT, then the error rate is 1/4, not 1/5). (The other wildcards that represent 2 or 3 nucleotides are counted, which is maybe not fully accurate.)
Hi,
First of all thank you @marcelm for your reply! Finally I decided to do demultiplexing using lima (proprietary PacBio tool) and then continue with trimming using cutadapt with the following settings (bold are primer sequences and in normal font are linking sequences which I called ADAPTER from the opening post):
cutadapt
-g CTCTCTATGGGCAGTCCGCTSTTYMTNGAYAGYCAG \
-a CTCGTGTCTCCGACTCAKRTGCANSGCRTGGCAGAA
--discard-untrimmed
-e 0.1
-m 50
-j 8
-o trimmed_output.fastq input.fastq
To say a bit more about the project: We amplified ~720 bp amplicons which were sequenced using PacBio and we received cleaned HiFi reads in bam format. I converted the bam to fastq then used lima to demultiplex. After using cutadapt with above settings I have obtained the below results (it's just for one sample out of 120 to get the picture). I'm attaching the trimmed and the "raw sequence" for that sample HT1--HT19.zip.
Do I understand the report correctly that a big proportion of the reads is getting trimmed away at some point (i.e. Adapter 2: 1480 26 0.0 3 26): I was going over the explanation of the results in the user guide and from there I would asssume that 1480 nucleotides long sequence was trimmed away - is this the correct way to think about this? My assumption is that this is probably the artefact of PacBio sequencing which uses Circular consensus sequencing (CCS) and that the same amplicon sequence (~720 nucleotides) is for example repeated two times in the same read - maybe the HiFi reads were not cleaned well enough.
Anyway my second question is: Since I get the span of final trimmed sequences from 187 to 2171 nucleotides I should probably do some quality filtering after that. I'm again assuming that longer trimmed reads are coming from CCS sequencing and even though they are not numerous, biological information could be lost by just discarding them. So does cutadapt recognize that the same adapter is repeated multiple times in one read and could be used to trim those longer 2171 reads for example (where target amplicon sequence is maybe repeated three times)?
Thank you very much in advance!
The results from cutadapt:
[2025-08-16 21:14:39] Processing nosZ1_hifi_demux.HT1--HT19 ... This is cutadapt 3.7 with Python 3.6.8 Command line parameters: -g CTCTCTATGGGCAGTCCGCTSTTYMTNGAYAGYCAG -a CTCGTGTCTCCGACTCAKRTGCANSGCRTGGCAGAA --discard-untrimmed -e 0.1 -m 50 -j 8 -o /data1/home/tonig/data/SOMMIT_nosZ/trimmed/nosZ1/nosZ1_hifi_demux.HT1--HT19_trimmed.fastq /data1/home/tonig/data/SOMMIT_nosZ/demultiplexed/nosZ1/nosZ1_hifi_demux.HT1--HT19.fastq Processing reads on 8 cores in single-end mode ... Finished in 0.28 s (25 us/read; 2.42 M reads/minute).
=== Summary ===
Total reads processed: 11,480 Reads with adapters: 11,454 (99.8%)
== Read fate breakdown == Reads that were too short: 5,826 (50.7%) Reads discarded as untrimmed: 26 (0.2%) Reads written (passing filters): 5,628 (49.0%)
Total basepairs processed: 8,445,470 bp Total written (filtered): 3,933,745 bp (46.6%)
=== Adapter 1 ===
Sequence: CTCTCTATGGGCAGTCCGCTSTTYMTNGAYAGYCAG; Type: regular 5'; Length: 36; Trimmed: 5624 times
Minimum overlap: 3 No. of allowed errors: 1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-35 bp: 3
Overview of removed sequences length count expect max.err error counts 3 1 179.4 0 1 4 1 44.8 0 1 34 11 0.0 3 2 0 8 1 35 122 0.0 3 0 120 2 36 5133 0.0 3 5022 77 34 37 258 0.0 3 18 230 5 5 38 73 0.0 3 4 4 62 3 39 17 0.0 3 0 1 1 15 40 1 0.0 3 1 44 1 0.0 3 0 1 46 1 0.0 3 1 783 1 0.0 3 1 786 1 0.0 3 1 803 1 0.0 3 1 816 1 0.0 3 1 819 1 0.0 3 1
=== Adapter 2 ===
Sequence: CTCGTGTCTCCGACTCAKRTGCANSGCRTGGCAGAA; Type: regular 3'; Length: 36; Trimmed: 5830 times
Minimum overlap: 3 No. of allowed errors: 1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-35 bp: 3
Bases preceding removed adapters: A: 0.1% C: 0.2% G: 0.1% T: 0.2% none/other: 99.5%
Overview of removed sequences length count expect max.err error counts 243 1 0.0 3 1 256 1 0.0 3 1 258 1 0.0 3 1 268 1 0.0 3 1 269 3 0.0 3 3 272 2 0.0 3 2 275 1 0.0 3 1 277 3 0.0 3 3 281 1 0.0 3 0 1 283 1 0.0 3 1 287 1 0.0 3 0 1 288 5 0.0 3 5 297 2 0.0 3 2 299 1 0.0 3 1 301 3 0.0 3 3 309 1 0.0 3 1 310 1 0.0 3 1 311 1 0.0 3 1 312 2 0.0 3 2 313 1 0.0 3 1 317 1 0.0 3 1 318 3 0.0 3 3 320 1 0.0 3 1 321 1 0.0 3 1 323 2 0.0 3 2 325 1 0.0 3 1 326 2 0.0 3 2 336 1 0.0 3 1 347 1 0.0 3 1 348 1 0.0 3 1 352 5 0.0 3 5 353 3 0.0 3 3 354 2 0.0 3 2 355 1 0.0 3 1 356 1 0.0 3 1 357 2 0.0 3 2 359 1 0.0 3 1 380 1 0.0 3 1 384 2 0.0 3 2 387 2 0.0 3 2 391 1 0.0 3 1 395 1 0.0 3 1 396 2 0.0 3 2 401 3 0.0 3 3 402 2 0.0 3 2 406 3 0.0 3 2 1 408 5 0.0 3 5 412 1 0.0 3 1 413 2 0.0 3 2 417 3 0.0 3 3 .................................................... 688 1 0.0 3 1 689 4 0.0 3 3 1 690 36 0.0 3 36 691 1 0.0 3 0 1 695 1 0.0 3 0 1 700 1 0.0 3 0 1 701 2 0.0 3 2 702 67 0.0 3 66 1 703 8 0.0 3 7 1 704 7 0.0 3 5 2 705 203 0.0 3 197 5 1 706 15 0.0 3 14 1 707 14 0.0 3 13 1 708 14 0.0 3 11 3 709 1 0.0 3 1 711 3 0.0 3 2 1 712 2 0.0 3 1 1 714 1 0.0 3 1 715 1 0.0 3 1 716 1 0.0 3 1 717 2 0.0 3 1 1 718 2 0.0 3 1 0 0 1 719 2 0.0 3 2 720 6 0.0 3 5 1 721 1 0.0 3 1 722 2 0.0 3 1 1 723 5 0.0 3 4 1 724 5 0.0 3 5 725 4 0.0 3 3 1 726 11 0.0 3 10 1 727 7 0.0 3 5 2 728 18 0.0 3 14 3 1 729 31 0.0 3 26 5 730 48 0.0 3 40 6 2 731 240 0.0 3 205 34 1 732 3361 0.0 3 3296 55 2 8 733 381 0.0 3 332 44 4 1 734 181 0.0 3 150 23 7 1 735 140 0.0 3 106 29 3 2 736 74 0.0 3 51 19 4 737 63 0.0 3 46 12 3 2 738 47 0.0 3 38 9 739 56 0.0 3 32 18 5 1 740 39 0.0 3 23 13 3 741 35 0.0 3 22 9 4 742 24 0.0 3 16 7 1 743 30 0.0 3 18 8 4 744 29 0.0 3 17 6 5 1 745 26 0.0 3 17 8 0 1 746 14 0.0 3 6 5 3 747 14 0.0 3 8 6 748 20 0.0 3 14 4 2 749 17 0.0 3 9 7 1 750 9 0.0 3 5 4 751 11 0.0 3 5 4 1 1 752 9 0.0 3 6 1 2 753 15 0.0 3 9 1 4 1 754 3 0.0 3 2 1 755 15 0.0 3 12 3 756 9 0.0 3 5 2 1 1 757 4 0.0 3 2 1 1 758 5 0.0 3 4 1 759 5 0.0 3 1 4 760 6 0.0 3 4 0 2 761 5 0.0 3 3 2 762 5 0.0 3 3 2 763 7 0.0 3 5 1 1 764 4 0.0 3 0 3 1 765 3 0.0 3 1 1 1 766 5 0.0 3 4 1 .................................................... 891 9 0.0 3 9 892 1 0.0 3 1 893 3 0.0 3 3 894 1 0.0 3 1 901 2 0.0 3 2 902 1 0.0 3 1 917 1 0.0 3 1 931 2 0.0 3 2 936 2 0.0 3 2 938 1 0.0 3 1 959 1 0.0 3 1 978 1 0.0 3 1 983 1 0.0 3 1 991 1 0.0 3 1 996 1 0.0 3 1 1002 1 0.0 3 1 1004 1 0.0 3 1 1012 3 0.0 3 3 1016 1 0.0 3 1 1019 1 0.0 3 1 1041 3 0.0 3 3 1046 1 0.0 3 1 1047 1 0.0 3 1 1054 1 0.0 3 1 1072 1 0.0 3 1 1102 1 0.0 3 1 1110 1 0.0 3 1 1112 1 0.0 3 1 1113 1 0.0 3 1 1126 1 0.0 3 1 1135 1 0.0 3 1 1142 1 0.0 3 1 1143 1 0.0 3 1 1173 1 0.0 3 1 1174 1 0.0 3 1 1184 1 0.0 3 1 1191 3 0.0 3 3 1194 2 0.0 3 2 1202 1 0.0 3 1 1230 1 0.0 3 1 1239 1 0.0 3 1 1292 1 0.0 3 1 1374 1 0.0 3 1 1426 2 0.0 3 2 1438 4 0.0 3 4 1439 1 0.0 3 1 1450 1 0.0 3 1 1451 1 0.0 3 1 1453 2 0.0 3 2 1473 1 0.0 3 1 1476 1 0.0 3 1 1477 1 0.0 3 1 1479 4 0.0 3 4 1480 26 0.0 3 26 1481 3 0.0 3 3 1482 1 0.0 3 1 1487 1 0.0 3 1 1488 1 0.0 3 1 1489 1 0.0 3 1 1498 2 0.0 3 2 1508 2 0.0 3 2 1520 2 0.0 3 2 1568 1 0.0 3 1 1652 1 0.0 3 1 1759 1 0.0 3 1 1839 1 0.0 3 1 1876 1 0.0 3 1 1980 1 0.0 3 1 2006 1 0.0 3 1 2196 1 0.0 3 1