cutadapt icon indicating copy to clipboard operation
cutadapt copied to clipboard

Ensuring to find the correct file after Demultiplexing my 16S amplicon raw dataset with combinatorial dual indexes cutadapt command

Open JayalalKJ opened this issue 10 months ago • 5 comments

  • Cutadapt and Python version used: 4.8
  • The method you used to install the tool (conda or pip, for example): Used mamba
  • The command-line parameters you used for demultiplexing paired-end reads with combinatorial dual indexes cutadapt command:
cutadapt \
    -e 0.15 --no-indels \
    -g ^file:barcodes_fwd.fasta \
    -G ^file:barcodes_rev.fasta \
    -o {name1}-{name2}.1.fastq.gz -p {name1}-{name2}.2.fastq.gz \
    input.1.fastq.gz input.2.fastq.gz

Here are some of the output generated from the code above:

LIB1_sample_189-LIB1_sample_183.1.fastq.gz
LIB1_sample_189-LIB1_sample_183.2.fastq.gz
LIB1_sample_189-LIB1_sample_184.1.fastq.gz
LIB1_sample_189-LIB1_sample_184.2.fastq.gz
LIB1_sample_189-LIB1_sample_185.1.fastq.gz
LIB1_sample_189-LIB1_sample_185.2.fastq.gz
LIB1_sample_189-LIB1_sample_186.1.fastq.gz
LIB1_sample_189-LIB1_sample_186.2.fastq.gz
LIB1_sample_189-LIB1_sample_187.1.fastq.gz
LIB1_sample_189-LIB1_sample_187.2.fastq.gz
LIB1_sample_189-LIB1_sample_188.1.fastq.gz
LIB1_sample_189-LIB1_sample_188.2.fastq.gz
LIB1_sample_189-LIB1_sample_189.1.fastq.gz
LIB1_sample_189-LIB1_sample_189.2.fastq.gz
LIB1_sample_189-LIB1_sample_190.1.fastq.gz
LIB1_sample_189-LIB1_sample_190.2.fastq.gz
  1. What is the correct combination of fastq.gz files for further analysis? I have a total of 230 samples.
  2. How can I pick those right files from the combination pool folder?

JayalalKJ avatar Apr 16 '24 21:04 JayalalKJ

How many barcodes are in barcodes_fwd.fasta and barcodes_rev.fasta, respectively? Are you sure you have combinatorial dual indices? According to Illumina’s documentation, you can have up to 96 samples, so I wonder how the 230 samples fit in there.

marcelm avatar Apr 16 '24 22:04 marcelm

Thanks for your question. It was 160 (set B- 96 and the rest in Set C)

kit used: NextFlex Rapid XP kit

Forward primer: GTGCCAGCMGCCGCGGTAA Reverse primer: GGACTACHVGGGTWTCTAAT

Two files have been received. sample1_1_L001_R1_001.fastq sample1_1_L001_R2_001.fastq
sample2_1_L001_R1_001.fastq sample2_1_L001_R2_001.fastq

head -n 4 sample2_1_L001_R1_001.fastq @A00783:1516:HWGV2DRX3:1:2101:5891:5055 1:N:0:CGCTGCTC+GATCTGCC GTGCCAGCCGCCGCGGTAATACATAGGATGCAAGCGTTATCCGGATTTACTGGGCGTAAAGCGAGCTCAGGCGGATTTACAAGTCTGATGTTAAAGACAACTGCTTAACGGTTGTTTGCATTGGAAACTGTAAGTCTAGAGTATAGTAGAGAGTTTTGGAACTCCATGTGGAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGAGGCGAAGGCGAAAACTTAGGCTATAACTGACGCT +A00783:1516:HWGV2DRX3:1:2101:5891:5055 1:N:0:CGCTGCTC+GATCTGCC ,,:FFF,F,:FF:F:F:FFFF:F,F,FF:,FFF:F,,,,:FFF,F,,:FF::F:,:FFFF:F:F:F,FF::F:FF:,FFFFF,FF,,F,F,,FFFFFFFFF::FF:FF:FFF:FF:FF,,:,:FF:FF:::FF:FF,F:FFF,:F,F:FFFFF:F,:F,FFFFFFF:FFFFF:FFFFFFFF:F:FFFFFF:FFFFFFFFFFFFFFFFF::FFFFF:FFFFFFF:FFF,FFFFFFFFFFFFFF

How many barcodes are in barcodes_fwd.fasta and barcodes_rev.fasta, respectively?

Two 96 plates: B, and C barcode , example see below:

In Plate B, the barcode combination is as follows: "LIB1" "LIB1_sample_001" "AACAAGCC:GGAATGAG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01A" "LIB1" "LIB1_sample_002" "TTACCGCT:CCAGTATG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01B" "LIB1" "LIB1_sample_003" "TTACGCCA:TCATAGCG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01C" "LIB1" "LIB1_sample_004" "GTGATCTC:ACTTGGCT" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01D" "LIB1" "LIB1_sample_005" "GGATAGCA:CAGTAGAC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01E" "LIB1" "LIB1_sample_006" "AACCTCAG:GGATGATC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01F" "LIB1" "LIB1_sample_007" "CAACCTCA:GGATTCGA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01G" "LIB1" "LIB1_sample_008" "CGCCAATT:TAGCAAGG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01H" "LIB1" "LIB1_sample_009" "GGAATGAG:AATTGCCG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02A" "LIB1" "LIB1_sample_010" "CCAGTATG:TGAGATGC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02B" "LIB1" "LIB1_sample_011" "TCATAGCG:TGAGGACA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02C"

In Plate C, the barcode combination is as follows: "LIB1" "LIB1_sample_001" "AACAAGCC:AATTGCCG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01A" "LIB1" "LIB1_sample_002" "TTACCGCT:TGAGATGC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01B" "LIB1" "LIB1_sample_003" "TTACGCCA:TGAGGACA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01C" "LIB1" "LIB1_sample_004" "GTGATCTC:CGATACAC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01D" "LIB1" "LIB1_sample_005" "GGATAGCA:TAGCCACT" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01E" "LIB1" "LIB1_sample_006" "AACCTCAG:TATCTGGC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01F" "LIB1" "LIB1_sample_007" "CAACCTCA:TTAGGCAC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01G" "LIB1" "LIB1_sample_008" "CGCCAATT:GTCACAGA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_01H" "LIB1" "LIB1_sample_009" "GGAATGAG:AACAAGCC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02A" "LIB1" "LIB1_sample_010" "CCAGTATG:TTACCGCT" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02B" "LIB1" "LIB1_sample_011" "TCATAGCG:TTACGCCA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02C" "LIB1" "LIB1_sample_012" "ACTTGGCT:GTGATCTC" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02D" "LIB1" "LIB1_sample_013" "CAGTAGAC:GGATAGCA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02E" "LIB1" "LIB1_sample_014" "GGATGATC:AACCTCAG" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02F" "LIB1" "LIB1_sample_015" "GGATTCGA:CAACCTCA" "GTGCCAGCMGCCGCGGTAA" "GGACTACHVGGGTWTCTAAT" "F @ position=01_02G"

I combined plates B and C and created barcodes_fwd.fasta and barcodes_rev.fasta files for each sample number. This means that plate B starts with LIB1_sample_001 and ends with LIB1_sample_096, while plate C starts with LIB1_sample_097 and ends with LIB1_sample_160.

barcodes_fwd.fasta

LIB1_sample_001 AACAAGCC LIB1_sample_002 TTACCGCT LIB1_sample_003 TTACGCCA LIB1_sample_004 GTGATCTC LIB1_sample_005 GGATAGCA LIB1_sample_006 AACCTCAG LIB1_sample_007 CAACCTCA ................................... LIB1_sample_160

barcodes_rev.fasta

LIB1_sample_001 GGAATGAG LIB1_sample_002 CCAGTATG LIB1_sample_003 TCATAGCG LIB1_sample_004 ACTTGGCT LIB1_sample_005 CAGTAGAC LIB1_sample_006 GGATGATC LIB1_sample_007 GGATTCGA ............................... LIB1_sample_160

JayalalKJ avatar Apr 17 '24 06:04 JayalalKJ

I still don’t have a clear picture of how your data is structured. I believe you need to understand this yourself before you can proceed. In particular, you need to figure out where the index sequences are.

First, to be explicit: There is a difference between unique dual indexing and combinatorial indexing. Unless you have reliable information that combinatorial indexing was used, it is more likely that unique dual indexing was done.

kit used: NextFlex Rapid XP kit

I am not familiar with it, but the manual for version 2 of that kit talks about Unique Dual Indices:

In addition, the availability of up to 1,536 different Unique Dual Index adapter barcodes facilitates high-throughput applications.

According to the same manual, the UDIs need to be bought separately, so it would still possible to use combinatorial indexing, but that would be against Illumina’s own advice:

Illumina recommends using unique dual indexing (UDI) whenever possible for the most accurate demultiplexing.

Your first read looks like this:

head -n 4 sample2_1_L001_R1_001.fastq

@A00783:1516:HWGV2DRX3:1:2101:5891:5055 1:N:0:CGCTGCTC+GATCTGCC
GTGCCAGCCGCCGCGGTAATACATAGGATGCAAGCGTTATCCGGATTTACTGGGCGTAAAGCGAGCTCAGGCGGATTTACAAGTCTGATGTTAAAGACAACTGCTTAACGGTTGTTTGCATTGGAAACTGTAAGTCTAGAGTATAGTAGAGAGTTTTGGAACTCCATGTGGAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGAGGCGAAGGCGAAAACTTAGGCTATAACTGACGCT
...
  • The first bases are GTGCCAGCCGCCGCGGTAA, which matches your forward primer. That means the indices are not in the beginning of R1 and you therefore cannot use -g ^file:... to demultiplex because that option is for providing "anchored 5' adapters", which must be at the 5' end of the read.
  • The CGCTGCTC+GATCTGCC in the FASTQ header are index sequences. This information is added by the Illumina software that produced the FASTQ file. This means that the software has already done demultiplexing. (It is not impossible that two levels of multiplexing were done, though.)
  • When I BLAST the read sequence above, it yields a 100% match against 16S RNA, that is, there is no index sequence at all in the read.

Two files have been received. sample1_1_L001_R1_001.fastq sample1_1_L001_R2_001.fastq sample2_1_L001_R1_001.fastq sample2_1_L001_R2_001.fastq

Do all reads in the first file contain CGCTGCTC+GATCTGCC? What about the second file?

marcelm avatar Apr 18 '24 08:04 marcelm

Hello there

Do all reads in the first file contain CGCTGCTC+GATCTGCC? What about the second file? Yes.

Checking LIB1_L2_1.fq for sequence CGCTGCTC+GATCTGCC... Number of occurrences in LIB1_1_L2_1.fq: 23013460 Total number of reads in LIB1_L2_1.fq: 23013460 Checking LIB1_L2_2.fq for sequence CGCTGCTC+GATCTGCC... Number of occurrences in LIB1_L2_2.fq: 23013460 Total number of reads in LIB1_L2_2.fq: 23013460 Checking LIB1_L1_1.fq for sequence CGCTGCTC+GATCTGCC... Number of occurrences in LIB1_L1_1.fq: 24176149 Total number of reads in LIB1_L1_1.fq: 24176149 Checking LIB1_L1_2.fq for sequence CGCTGCTC+GATCTGCC... Number of occurrences in LIB1_L1_2.fq: 24176149 Total number of reads in LIB1_L1_2.fq: 24176149

I had given a bad example,: here is right one from L001_1.fastq

In this string 'TTACCGCTGTGCCAGCAGCCGCGGTAA' of of L001_1, the first eight nucleotides are barcodesTTACCGCT, and the rest is forward primer.

@A00783:1516:HWGV2DRX3:1:2145:3992:12289 1:N:0:CGCTGCTC+GATCTGCC TTACCGCTGTGCCAGCAGCCGCGGTAACACATAGGATGCAAGCGTTATCCGGATTTACTGGGCGTAAAGCGAGCGCAGGCGGATTTACAAGTCTGATGTTAAAGACAACTGCTTAACGGTTGTTTGCATTGGAAACTGTAAGTCTAGAGTATAGTAGAGAGTTTTGGAACTCCATGTGGAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGAGGCGAAGGCGAAAACTTAGGCTATAACTGACGCT + FFFFFF:FFFF:FFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFF:FFFF,FFFFFFFFFF:FFFFF::FFFFF:FFFF:FFFFFF,,FFF,FFFFFF:FF:FFF:FFF:FFFFFFFFFFF:FF:FFFFFFFFF

@A00783:1516:HWGV2DRX3:1:2146:4408:26099 1:N:0:CGCTGCTC+GATCTGCC GCCTTCGCCTCTGGTGTTCTTCCATATATCTACGCATTCCACCGCTCCACATGGAGTTCCAAAACTCTCTACTATACTCTAGACTTACAGTTTCCAATGCAAACAACCGTTAAGCAGTTGTCTTTAACATCAGACTTGTAAATCCGCCTGCGCCCGCTTTACGCCCAGTAAATCCGGATAACGCTTGCATCCTATGTGTTACCGCTGTGCCAGCAGCCGCGGTAACAGATCGGAAGAGCACACGTCTGAA + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FF:,FF:FFFFFFFFFFF

@A00783:1516:HWGV2DRX3:1:2147:20889:26741 1:N:0:CGCTGCTC+GATCTGCC ATTACCGCTGTGCCAGCAGCCGCGGTAATACATAGGATGCAAGCGTTATCCGGATTTACTGGGCGTAAAGCGAGCGCAGGCGGATTTACAAGTCTGATGTTAAAGACAACTGCTTAACGGTTGTTTGCATTGGAAACTGTAAGTCTAGAGTATAGTAGAGAGTTTTGGAACTCCATGTGGAGCGGTGGAATGCGTAGAGATATGTAAGTCCCCCGGAGGCGAACGAGACACCTGAGGCGATCTCTGACGC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF:FFFFFFFFF,FFFFFFFFFFFF,FF::F,F,:,,:,,F,:,,F,FF,,,,FF,F,F,,:F:,,F,,,F,,F,FF

@A00783:1516:HWGV2DRX3:1:2151:12608:26271 1:N:0:CGCTGCTC+GATCTGCC AGCCTAAGTTTTCGCCTTCGCCTCTGGTGTTCTTCCATATATCTACGCATTCCACCGCTCCACATGGAGTTCCAAAACTCTCTACTATACTCTAGACTTACAGTTTCCAATGCAAACAACCGTTAAGCAGTTGTCTTTAACATCAGACTTGTAAATCCGCCTGCGCTCGCTTTACGCCCAGTAAATCCGGATAACGCTTGCATCCTATGTATTACCGCTGTGCCAGCAGCCGCGGTAAAGATCGGAAGAG

L001_1.fastq

In this string 'TTACCGCTGGACTACAGGGGTATCTAAT' of L001_1 , the first eight nucleotides are barcodesTTACCGCT, and the rest is reverse primer.

@A00783:1516:HWGV2DRX3:1:2116:7536:34084 1:N:0:CGCTGCTC+GATCTGCC TTACCGCTGGACTACAGGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCACCTGAGCGTCAGTCTTCGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACGAGACTCAAGCTTGCCAGTATCAGATGCAGTTCCCAGGTTGAGCCCGGGGATTTCACATCTGACTTAACAAACCGCCTGCGTGCGCTTTACGCCCA + ,FF:FFF:FFFFFFFF:F::FFFFFFF::F:F:FFFFFFFF:FFFF:FFFF:FFFFFF:FFFFF:FFF:FFF:FFFFFFFFFF:FFFFFFFFFFFF:FFFFF:FFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFF:,FFFFFFF:FFFFFFFFFFF::F,,F:FF,FFFF:FFF:FF::FF:F,FF::F::FF:,FFFF:::FFF:FF,FFFF:FFF:FF,F:F:,,FFFFF:FFFF,,FFFF

@A00783:1516:HWGV2DRX3:1:2117:22815:14325 1:N:0:CGCTGCTC+GATCTGCC CTTGTTACCGCTGGACTACAGGGGTATCTAATCCTGTTTGCTACCCACGCTTTCGAGCCTCAGTGTCAGTATGATGCCAGGAAGCTGCCTTCGCCATCGGTATTCCTTCAGATCTCTACGCATTTCACCGCTACACCTGAAATTCTACTTCCCTCTCACCTACTCTAGCCTAACAGTTTCAGATGCAGTTCCCAGGTTAAGCCCGGGGATTTCACATCTGACTTATCAAGCCACCTACGCTCGCTTTACG + FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFF

@A00783:1516:HWGV2DRX3:1:2121:20509:32659 1:N:0:CGCTGCTC+GATCTGCC CACCTTACCGCTGGACTACAGGGGTATCTAATCCTGTTTGCTCCCCACACTTTCGCACCTCAGCGTCAGTATCGAGCCAGTGAGCCGCCTTCGCCACTGGTGTTCCTCCGAATATCTACGAATTTCACCTCTACACTCGGAATTCCACTCACCTCTCTCGAACTCAAGACCAGGAGTTTACAAGGCAGTTCCAGGGTTGAGCCCTGGGATTTCACCCTATACTTTCTGATCCGCCTACGTGCGCTTTACG

JayalalKJ avatar Apr 19 '24 13:04 JayalalKJ

What is the correct combination of fastq.gz files for further analysis? I have a total of 230 samples. How can I pick those right files from the combination pool folder?

Do you know which 230 barcode combinations are possible? Then you look for only those in the following way:

  • Create a file forward.fasta and one reverse.fasta. The first lists all 230 forward barcodes and the other the 230 corresponding reverse barcodes. That is, the first barcode in forward.fasta and the first in reverse.fasta are the barcodes that are used for the first sample, the second from both are the barcodes for the second sample etc.
  • Use the --pair-adapters option (see the documentation) and use {name} in the file name template.
  • Cutadapt will produce 230 files, one for each sample.

If you do not know which combinations are possible, just use the 230 biggest files. That may not be entirely correct, but it may be good enough.

marcelm avatar Apr 22 '24 13:04 marcelm