cutadapt
cutadapt copied to clipboard
Ensuring to find the correct file after Demultiplexing my 16S amplicon raw dataset with combinatorial dual indexes cutadapt command
- Cutadapt and Python version used: 4.8
- The method you used to install the tool (conda or pip, for example): Used mamba
- The command-line parameters you used for demultiplexing paired-end reads with combinatorial dual indexes cutadapt command:
cutadapt \
-e 0.15 --no-indels \
-g ^file:barcodes_fwd.fasta \
-G ^file:barcodes_rev.fasta \
-o {name1}-{name2}.1.fastq.gz -p {name1}-{name2}.2.fastq.gz \
input.1.fastq.gz input.2.fastq.gz
Here are some of the output generated from the code above:
LIB1_sample_189-LIB1_sample_183.1.fastq.gz
LIB1_sample_189-LIB1_sample_183.2.fastq.gz
LIB1_sample_189-LIB1_sample_184.1.fastq.gz
LIB1_sample_189-LIB1_sample_184.2.fastq.gz
LIB1_sample_189-LIB1_sample_185.1.fastq.gz
LIB1_sample_189-LIB1_sample_185.2.fastq.gz
LIB1_sample_189-LIB1_sample_186.1.fastq.gz
LIB1_sample_189-LIB1_sample_186.2.fastq.gz
LIB1_sample_189-LIB1_sample_187.1.fastq.gz
LIB1_sample_189-LIB1_sample_187.2.fastq.gz
LIB1_sample_189-LIB1_sample_188.1.fastq.gz
LIB1_sample_189-LIB1_sample_188.2.fastq.gz
LIB1_sample_189-LIB1_sample_189.1.fastq.gz
LIB1_sample_189-LIB1_sample_189.2.fastq.gz
LIB1_sample_189-LIB1_sample_190.1.fastq.gz
LIB1_sample_189-LIB1_sample_190.2.fastq.gz
- What is the correct combination of fastq.gz files for further analysis? I have a total of 230 samples.
- How can I pick those right files from the combination pool folder?
How many barcodes are in barcodes_fwd.fasta
and barcodes_rev.fasta
, respectively? Are you sure you have combinatorial dual indices? According to Illumina’s documentation, you can have up to 96 samples, so I wonder how the 230 samples fit in there.
Thanks for your question. It was 160 (set B- 96 and the rest in Set C)
kit used: NextFlex Rapid XP kit
Forward primer: GTGCCAGCMGCCGCGGTAA Reverse primer: GGACTACHVGGGTWTCTAAT
Two files have been received.
sample1_1_L001_R1_001.fastq sample1_1_L001_R2_001.fastq
sample2_1_L001_R1_001.fastq sample2_1_L001_R2_001.fastq
head -n 4 sample2_1_L001_R1_001.fastq @A00783:1516:HWGV2DRX3:1:2101:5891:5055 1:N:0:CGCTGCTC+GATCTGCC GTGCCAGCCGCCGCGGTAATACATAGGATGCAAGCGTTATCCGGATTTACTGGGCGTAAAGCGAGCTCAGGCGGATTTACAAGTCTGATGTTAAAGACAACTGCTTAACGGTTGTTTGCATTGGAAACTGTAAGTCTAGAGTATAGTAGAGAGTTTTGGAACTCCATGTGGAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGAGGCGAAGGCGAAAACTTAGGCTATAACTGACGCT +A00783:1516:HWGV2DRX3:1:2101:5891:5055 1:N:0:CGCTGCTC+GATCTGCC ,,:FFF,F,:FF:F:F:FFFF:F,F,FF:,FFF:F,,,,:FFF,F,,:FF::F:,:FFFF:F:F:F,FF::F:FF:,FFFFF,FF,,F,F,,FFFFFFFFF::FF:FF:FFF:FF:FF,,:,:FF:FF:::FF:FF,F:FFF,:F,F:FFFFF:F,:F,FFFFFFF:FFFFF:FFFFFFFF:F:FFFFFF:FFFFFFFFFFFFFFFFF::FFFFF:FFFFFFF:FFF,FFFFFFFFFFFFFF
How many barcodes are in barcodes_fwd.fasta and barcodes_rev.fasta, respectively?
Two 96 plates: B, and C barcode , example see below:
In Plate B, the barcode combination is as follows: "LIB1" "LIB1_sample_001" "AACAAGCC:GGAATGAG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01A" "LIB1" "LIB1_sample_002" "TTACCGCT:CCAGTATG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01B" "LIB1" "LIB1_sample_003" "TTACGCCA:TCATAGCG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01C" "LIB1" "LIB1_sample_004" "GTGATCTC:ACTTGGCT" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01D" "LIB1" "LIB1_sample_005" "GGATAGCA:CAGTAGAC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01E" "LIB1" "LIB1_sample_006" "AACCTCAG:GGATGATC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01F" "LIB1" "LIB1_sample_007" "CAACCTCA:GGATTCGA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01G" "LIB1" "LIB1_sample_008" "CGCCAATT:TAGCAAGG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01H" "LIB1" "LIB1_sample_009" "GGAATGAG:AATTGCCG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02A" "LIB1" "LIB1_sample_010" "CCAGTATG:TGAGATGC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02B" "LIB1" "LIB1_sample_011" "TCATAGCG:TGAGGACA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02C"
In Plate C, the barcode combination is as follows: "LIB1" "LIB1_sample_001" "AACAAGCC:AATTGCCG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01A" "LIB1" "LIB1_sample_002" "TTACCGCT:TGAGATGC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01B" "LIB1" "LIB1_sample_003" "TTACGCCA:TGAGGACA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01C" "LIB1" "LIB1_sample_004" "GTGATCTC:CGATACAC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01D" "LIB1" "LIB1_sample_005" "GGATAGCA:TAGCCACT" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01E" "LIB1" "LIB1_sample_006" "AACCTCAG:TATCTGGC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01F" "LIB1" "LIB1_sample_007" "CAACCTCA:TTAGGCAC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01G" "LIB1" "LIB1_sample_008" "CGCCAATT:GTCACAGA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01H" "LIB1" "LIB1_sample_009" "GGAATGAG:AACAAGCC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02A" "LIB1" "LIB1_sample_010" "CCAGTATG:TTACCGCT" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02B" "LIB1" "LIB1_sample_011" "TCATAGCG:TTACGCCA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02C" "LIB1" "LIB1_sample_012" "ACTTGGCT:GTGATCTC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02D" "LIB1" "LIB1_sample_013" "CAGTAGAC:GGATAGCA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02E" "LIB1" "LIB1_sample_014" "GGATGATC:AACCTCAG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02F" "LIB1" "LIB1_sample_015" "GGATTCGA:CAACCTCA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02G"
I combined plates B and C and created barcodes_fwd.fasta and barcodes_rev.fasta files for each sample number. This means that plate B starts with LIB1_sample_001 and ends with LIB1_sample_096, while plate C starts with LIB1_sample_097 and ends with LIB1_sample_160.
barcodes_fwd.fasta
LIB1_sample_001 AACAAGCC LIB1_sample_002 TTACCGCT LIB1_sample_003 TTACGCCA LIB1_sample_004 GTGATCTC LIB1_sample_005 GGATAGCA LIB1_sample_006 AACCTCAG LIB1_sample_007 CAACCTCA ................................... LIB1_sample_160
barcodes_rev.fasta
LIB1_sample_001 GGAATGAG LIB1_sample_002 CCAGTATG LIB1_sample_003 TCATAGCG LIB1_sample_004 ACTTGGCT LIB1_sample_005 CAGTAGAC LIB1_sample_006 GGATGATC LIB1_sample_007 GGATTCGA ............................... LIB1_sample_160
I still don’t have a clear picture of how your data is structured. I believe you need to understand this yourself before you can proceed. In particular, you need to figure out where the index sequences are.
First, to be explicit: There is a difference between unique dual indexing and combinatorial indexing. Unless you have reliable information that combinatorial indexing was used, it is more likely that unique dual indexing was done.
kit used: NextFlex Rapid XP kit
I am not familiar with it, but the manual for version 2 of that kit talks about Unique Dual Indices:
In addition, the availability of up to 1,536 different Unique Dual Index adapter barcodes facilitates high-throughput applications.
According to the same manual, the UDIs need to be bought separately, so it would still possible to use combinatorial indexing, but that would be against Illumina’s own advice:
Illumina recommends using unique dual indexing (UDI) whenever possible for the most accurate demultiplexing.
Your first read looks like this:
head -n 4 sample2_1_L001_R1_001.fastq
@A00783:1516:HWGV2DRX3:1:2101:5891:5055 1:N:0:CGCTGCTC+GATCTGCC GTGCCAGCCGCCGCGGTAATACATAGGATGCAAGCGTTATCCGGATTTACTGGGCGTAAAGCGAGCTCAGGCGGATTTACAAGTCTGATGTTAAAGACAACTGCTTAACGGTTGTTTGCATTGGAAACTGTAAGTCTAGAGTATAGTAGAGAGTTTTGGAACTCCATGTGGAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGAGGCGAAGGCGAAAACTTAGGCTATAACTGACGCT ...
- The first bases are GTGCCAGCCGCCGCGGTAA, which matches your forward primer. That means the indices are not in the beginning of R1 and you therefore cannot use
-g ^file:...
to demultiplex because that option is for providing "anchored 5' adapters", which must be at the 5' end of the read. - The
CGCTGCTC+GATCTGCC
in the FASTQ header are index sequences. This information is added by the Illumina software that produced the FASTQ file. This means that the software has already done demultiplexing. (It is not impossible that two levels of multiplexing were done, though.) - When I BLAST the read sequence above, it yields a 100% match against 16S RNA, that is, there is no index sequence at all in the read.
Two files have been received. sample1_1_L001_R1_001.fastq sample1_1_L001_R2_001.fastq sample2_1_L001_R1_001.fastq sample2_1_L001_R2_001.fastq
Do all reads in the first file contain CGCTGCTC+GATCTGCC
? What about the second file?
Hello there
Do all reads in the first file contain CGCTGCTC+GATCTGCC? What about the second file? Yes.
Checking LIB1_L2_1.fq for sequence CGCTGCTC+GATCTGCC... Number of occurrences in LIB1_1_L2_1.fq: 23013460 Total number of reads in LIB1_L2_1.fq: 23013460 Checking LIB1_L2_2.fq for sequence CGCTGCTC+GATCTGCC... Number of occurrences in LIB1_L2_2.fq: 23013460 Total number of reads in LIB1_L2_2.fq: 23013460 Checking LIB1_L1_1.fq for sequence CGCTGCTC+GATCTGCC... Number of occurrences in LIB1_L1_1.fq: 24176149 Total number of reads in LIB1_L1_1.fq: 24176149 Checking LIB1_L1_2.fq for sequence CGCTGCTC+GATCTGCC... Number of occurrences in LIB1_L1_2.fq: 24176149 Total number of reads in LIB1_L1_2.fq: 24176149
I had given a bad example,: here is right one from L001_1.fastq
In this string 'TTACCGCTGTGCCAGCAGCCGCGGTAA' of of L001_1, the first eight nucleotides are barcodesTTACCGCT, and the rest is forward primer.
@A00783:1516:HWGV2DRX3:1:2145:3992:12289 1:N:0:CGCTGCTC+GATCTGCC TTACCGCTGTGCCAGCAGCCGCGGTAACACATAGGATGCAAGCGTTATCCGGATTTACTGGGCGTAAAGCGAGCGCAGGCGGATTTACAAGTCTGATGTTAAAGACAACTGCTTAACGGTTGTTTGCATTGGAAACTGTAAGTCTAGAGTATAGTAGAGAGTTTTGGAACTCCATGTGGAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGAGGCGAAGGCGAAAACTTAGGCTATAACTGACGCT + FFFFFF:FFFF:FFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFF:FFFF,FFFFFFFFFF:FFFFF::FFFFF:FFFF:FFFFFF,,FFF,FFFFFF:FF:FFF:FFF:FFFFFFFFFFF:FF:FFFFFFFFF
@A00783:1516:HWGV2DRX3:1:2146:4408:26099 1:N:0:CGCTGCTC+GATCTGCC GCCTTCGCCTCTGGTGTTCTTCCATATATCTACGCATTCCACCGCTCCACATGGAGTTCCAAAACTCTCTACTATACTCTAGACTTACAGTTTCCAATGCAAACAACCGTTAAGCAGTTGTCTTTAACATCAGACTTGTAAATCCGCCTGCGCCCGCTTTACGCCCAGTAAATCCGGATAACGCTTGCATCCTATGTGTTACCGCTGTGCCAGCAGCCGCGGTAACAGATCGGAAGAGCACACGTCTGAA + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FF:,FF:FFFFFFFFFFF
@A00783:1516:HWGV2DRX3:1:2147:20889:26741 1:N:0:CGCTGCTC+GATCTGCC ATTACCGCTGTGCCAGCAGCCGCGGTAATACATAGGATGCAAGCGTTATCCGGATTTACTGGGCGTAAAGCGAGCGCAGGCGGATTTACAAGTCTGATGTTAAAGACAACTGCTTAACGGTTGTTTGCATTGGAAACTGTAAGTCTAGAGTATAGTAGAGAGTTTTGGAACTCCATGTGGAGCGGTGGAATGCGTAGAGATATGTAAGTCCCCCGGAGGCGAACGAGACACCTGAGGCGATCTCTGACGC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF:FFFFFFFFF,FFFFFFFFFFFF,FF::F,F,:,,:,,F,:,,F,FF,,,,FF,F,F,,:F:,,F,,,F,,F,FF
@A00783:1516:HWGV2DRX3:1:2151:12608:26271 1:N:0:CGCTGCTC+GATCTGCC AGCCTAAGTTTTCGCCTTCGCCTCTGGTGTTCTTCCATATATCTACGCATTCCACCGCTCCACATGGAGTTCCAAAACTCTCTACTATACTCTAGACTTACAGTTTCCAATGCAAACAACCGTTAAGCAGTTGTCTTTAACATCAGACTTGTAAATCCGCCTGCGCTCGCTTTACGCCCAGTAAATCCGGATAACGCTTGCATCCTATGTATTACCGCTGTGCCAGCAGCCGCGGTAAAGATCGGAAGAG
L001_1.fastq
In this string 'TTACCGCTGGACTACAGGGGTATCTAAT' of L001_1 , the first eight nucleotides are barcodesTTACCGCT, and the rest is reverse primer.
@A00783:1516:HWGV2DRX3:1:2116:7536:34084 1:N:0:CGCTGCTC+GATCTGCC TTACCGCTGGACTACAGGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACCTGAGCGTCAGTCTTCGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACGAGACTCAAGCTTGCCAGTATCAGATGCAGTTCCCAGGTTGAGCCCGGGGATTTCACATCTGACTTAACAAACCGCCTGCGTGCGCTTTACGCCCA + ,FF:FFF:FFFFFFFF:F::FFFFFFF::F:F:FFFFFFFF:FFFF:FFFF:FFFFFF:FFFFF:FFF:FFF:FFFFFFFFFF:FFFFFFFFFFFF:FFFFF:FFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFF:,FFFFFFF:FFFFFFFFFFF::F,,F:FF,FFFF:FFF:FF::FF:F,FF::F::FF:,FFFF:::FFF:FF,FFFF:FFF:FF,F:F:,,FFFFF:FFFF,,FFFF
@A00783:1516:HWGV2DRX3:1:2117:22815:14325 1:N:0:CGCTGCTC+GATCTGCC CTTGTTACCGCTGGACTACAGGGGTATCTAATCCTGTTTGCTACCCACGCTTTCGAGCCTCAGTGTCAGTATGATGCCAGGAAGCTGCCTTCGCCATCGGTATTCCTTCAGATCTCTACGCATTTCACCGCTACACCTGAAATTCTACTTCCCTCTCACCTACTCTAGCCTAACAGTTTCAGATGCAGTTCCCAGGTTAAGCCCGGGGATTTCACATCTGACTTATCAAGCCACCTACGCTCGCTTTACG + FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFF
@A00783:1516:HWGV2DRX3:1:2121:20509:32659 1:N:0:CGCTGCTC+GATCTGCC CACCTTACCGCTGGACTACAGGGGTATCTAATCCTGTTTGCTCCCCACACTTTCGCACCTCAGCGTCAGTATCGAGCCAGTGAGCCGCCTTCGCCACTGGTGTTCCTCCGAATATCTACGAATTTCACCTCTACACTCGGAATTCCACTCACCTCTCTCGAACTCAAGACCAGGAGTTTACAAGGCAGTTCCAGGGTTGAGCCCTGGGATTTCACCCTATACTTTCTGATCCGCCTACGTGCGCTTTACG
What is the correct combination of fastq.gz files for further analysis? I have a total of 230 samples. How can I pick those right files from the combination pool folder?
Do you know which 230 barcode combinations are possible? Then you look for only those in the following way:
- Create a file
forward.fasta
and onereverse.fasta
. The first lists all 230 forward barcodes and the other the 230 corresponding reverse barcodes. That is, the first barcode inforward.fasta
and the first inreverse.fasta
are the barcodes that are used for the first sample, the second from both are the barcodes for the second sample etc. - Use the
--pair-adapters
option (see the documentation) and use{name}
in the file name template. - Cutadapt will produce 230 files, one for each sample.
If you do not know which combinations are possible, just use the 230 biggest files. That may not be entirely correct, but it may be good enough.