dorado icon indicating copy to clipboard operation
dorado copied to clipboard

[question] duplex and barcode demultiplexing?

Open JWDebler opened this issue 1 year ago • 14 comments

Hi, I am wondering if there is currenly a way to do barcode demultiplexing and duplex calling using only dorado, and if so, what would be the best way to do it?

JWDebler avatar Dec 13 '23 01:12 JWDebler

Hi @JWDebler , duplex barcode demux is currently not implemented in Dorado. The best way to currently do it would be to demux the simplex reads and then use the duplex read IDs (which are a concatenation of simplex read IDs) to demux.

vellamike avatar Dec 13 '23 15:12 vellamike

we have this on our roadmap and we plan to implement this for a future release

tijyojwad avatar Dec 14 '23 20:12 tijyojwad

Are there updates on the timeline to release barcode demultiplexing for duplex ? It seems to be a heavily requested feature for a while now... Since the duplex reads are already in the data it's a shame to not be able to use them :/

vdruelle avatar Apr 05 '24 07:04 vdruelle

You can extract them, it's just a bit annoying as it basically means basecalling twice. At least that's what we are currently doing. Basecall in simplex mode, classify into barcodes, extract barcode reads into their own pod5 files and then duplex call those individually:

location=/path/to/raw/data/containing/folder

dorado basecaller sup -r $location --kit-name SQK-NBD114-24 > all.bam && \
dorado demux --output-dir simplex --no-classify all.bam && \
rm simplex/*unclassified* && \
for file in simplex/*.bam; do id=$(echo "$file" | grep -oP 'barcode\d+'); samtools view -h $file | cut -f 1 > simplex/$id.readids.txt; done && \
for file in simplex/*readids.txt; do id=$(echo "$file" | grep -oP 'barcode\d+'); grep -v "@" $file > simplex/$id.clean.txt; done && \
for file in simplex/*clean.txt; do id=$(echo "$file" | grep -oP 'barcode\d+'); pod5 filter -r $location -i $file -t 10 --missing-ok --duplicate-ok --output $id.pod5; done && \
for file in *.pod5; do id=$(echo "$file" | grep -oP 'barcode\d+'); dorado duplex sup $file > $id.duplex.untrimmed.bam; done && \
for file in *untrimmed.bam; do id=$(echo "$file" | grep -oP 'barcode\d+'); dorado trim $file > $id.duplex.bam && \
rm *untrimmed.bam && \
for file in *duplex.bam; do id=$(echo "$file" | grep -oP 'barcode\d+'); samtools view  -@ 8 -O fastq -d dx:1 $file | pigz -9 > $id.duplex.fastq.gz; done && \
for file in *duplex.bam; do id=$(echo "$file" | grep -oP 'barcode\d+'); samtools view  -@ 8 -O fastq -d dx:0 $file | pigz -9 > $id.simplex.fastq.gz; done 

JWDebler avatar Apr 05 '24 08:04 JWDebler

I see, thanks a lot for the details of your implementation !

vdruelle avatar Apr 05 '24 08:04 vdruelle

I created this script to demux a duplex basecalled BAM. Takes a summary file after demultiplexing simplex reads and your duplex called BAM. There’s some options for dealing with situations where the two reads that make a duplex read don’t have the same barcode in simplex mode or if one is unclassified. Also supports fastq output

mbhall88 avatar Apr 05 '24 11:04 mbhall88

Thanks for your answer and the link to your code ! Great work on the manuscript as well, it's super nice to see the variant calling metrics there (also shows that duplex is maybe not essential ^^).

vdruelle avatar Apr 05 '24 12:04 vdruelle

interesting @mbhall88. I'm a bit confused. How did you get barcodes in the duplex bam file? I thought dorado couldn't do duplex calling and barcode assignment at the same time yet, which is why we're doing it the long way around.

JWDebler avatar Apr 05 '24 12:04 JWDebler

There’s no barcodes in the duplex bam. But that script uses the assignment from the sequencing summary file from the simplex basecalling to assign barcodes for the duplex reads. I also found cases when testing that there were sometime simplex reads that are part of a duplex read that have different barcode assignments (normally it was just that one was unclassified and the other was classified) so I added options for how you want to deal with this

mbhall88 avatar Apr 06 '24 02:04 mbhall88

Ahhh, ok, that makes sense. Do you do simplex live calling during the run then? We always use post run basecalling, and I don't think that produces the summary file.

JWDebler avatar Apr 06 '24 02:04 JWDebler

No I generally basecall after the fact also. But you can create a summary file with https://github.com/nanoporetech/dorado?tab=readme-ov-file#sequencing-summary

the demux command also has an —emit-summary option which will place a summary file in the output directory (see https://github.com/nanoporetech/dorado?tab=readme-ov-file#classifying-existing-datasets)

mbhall88 avatar Apr 06 '24 03:04 mbhall88

Thanks for the details, I'll give it a try ! Looking at your script it does not seem super complicated to demultiplex duplex reads. I'm a bit puzzled why there is no native way of doing it with Dorado...

vdruelle avatar Apr 08 '24 07:04 vdruelle

This is my current implementation, maybe useful for somebody ... Assuming you have seqkit and samtools installed in the respective conda environments you can run this bash script after simplex basecalling from the nanopore output directory (the one that contains the fastq_pass and pod5 folder)

dorado_duplex_bact.sh

This uses the bacterial research model for the basecalling, just change this to "sup" or "hac" in line 53 if you want to use the standard models

hagen-wende avatar May 24 '24 10:05 hagen-wende

You can extract them, it's just a bit annoying as it basically means basecalling twice. At least that's what we are currently doing. Basecall in simplex mode, classify into barcodes, extract barcode reads into their own pod5 files and then duplex call those individually:

location=/path/to/raw/data/containing/folder

dorado basecaller sup -r $location --kit-name SQK-NBD114-24 > all.bam && \

Just to clarify, when running doradobase caller would --no-trim need to be added for the subsequent demultiplex step to work? my understanding is that otherwise the barcodes are removed and dorado demux won't work?

luckybillion avatar Jul 23 '24 20:07 luckybillion