duplex-tools icon indicating copy to clipboard operation
duplex-tools copied to clipboard

Document the relationship between ont-guppy-duplex-pipeline and duplex_tools

Open tbooth opened this issue 2 years ago • 3 comments

The current advice from ONT regarding how to perform duplex basecalling is here:

https://community.nanoporetech.com/posts/guppy-v6-0-0-release (dated 6th December 2021 - login required to view)

It makes no mention of duplex-tools, but says to pip install ont-guppy-duplex-pipeline and then run the script from that package, guppy_duplex, on the original fast5 files.

As far as I can see, this script is a rather clunky wrapper that calls guppy in simplex mode, then performs the equivalent of duplex_tools pairs_from_summary (the code for this is in ont_guppy_duplex_pipeline/channel_neighbours.py and looks like it's related to your duplex_tools/pairs_from_summary.py but the logic is not quite the same) and then runs guppy_basecaller_duplex to get the final result.

My main interest just now is to get a good but quick assessment of the approx number of duplex reads in each dataset, for QC purposes, and so duplex-tools seems the more useful approach. But so save others from having to peer through source code like I've been doing, could you please add some info to the README.md to say what is the relationship between these two ONT-developed packages?

Cheers!

tbooth avatar Aug 26 '22 13:08 tbooth

Sorry, my mistake - I see ont-guppy-duplex-pipeline does also incorporate an alignment-based filtering step, but it does not yield the same results as this package. I get about twice the number of candidate duplex pairs. I guess I'll need to actually basecall these to see how many are false positives.

tbooth avatar Aug 26 '22 16:08 tbooth

Hi @tbooth,

The scripts in the current version in Guppy were taken from an earlier version of this repository, hence the similarities. Guppy needs updating, IIRC the major difference is the compute performance. @ollenordesjo can comment on the output differences.

cjw85 avatar Aug 26 '22 16:08 cjw85

Hi @tbooth, sorry, have been on vacation, just seeing this now.

The option that mostly affects the output results is the min_qscore filter (https://github.com/nanoporetech/duplex-tools/blob/master/duplex_tools/pairs_from_summary.py#L320). Without re-running the same things through ont-guppy-duplex-pipeline and duplex_tools and checking carefully, that would be my first guess on why the number of candidate duplex pairs are different.

Depending on your requirements, you may want to set this threshold lower than the default (I would suggest including the best ~85% of reads or something similar, whereever that threshold may be for your dataset). We had some discussions about setting this threshold more adaptively, but decided that a constant threshold would keep it more reproducible on a per-read level.

onordesjo avatar Sep 05 '22 09:09 onordesjo