sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

Improving results for nanopore

Open jsgounot opened this issue 2 years ago • 1 comments

Hi. I explore the possibility to use sourmash to identify isolate origin based on nanopore data. Each sample is supposed to have only one species. I know that ONT reads are not ideal for a k-mer approach but as reported in this tracker, at least one paper used those for a paper. I tried to use gather with or without trimming (even though it's not really appropriate, trim-low-abund.py -C 3 -Z 18 -V -M 2e9) and I while the best hit seems concordant with what is expected, the f_orig_query is very low both for raw (mean=2%) and trimmed (mean=5%) data. Did you explore some other sourmash or khmer parameters to improve results with nanopore reads?

jsgounot avatar Aug 23 '22 09:08 jsgounot

this came across lab slack today -

https://labs.epi2me.io/progressive-kraken2/

Luiz said:

Granularity is different (reads, not contigs/genomes), but would be fun to try
with sourmash (maybe with a s=100 db it would work with reads too?)

ctb avatar Sep 22 '22 11:09 ctb

hi @jsgounot this paper systematically confirms that ONT messes up sourmash -

Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets: https://www.biorxiv.org/content/10.1101/2022.01.31.478527v2

See Fig 3 in particular; screenshot:

Screen Shot 2022-11-16 at 6 08 45 AM

It seems pretty clear that the error profile for nanopore is terrible for sourmash :(.

@dportik, @bluegenes and I are thinking of doing a bit more exploring, but we have no simple solution to offer. thoughts welcome!

ctb avatar Nov 16 '22 14:11 ctb

(see https://github.com/sourmash-bio/sourmash/issues/2360 for some discussion of thresholding that is not entirely irrelevant ;)

ctb avatar Nov 16 '22 14:11 ctb

hi @ctb, thank for you sharing this. Looks like the MEGAN-LR is the good approach for this kind of data at the moment, do you share the same conclusion?

jsgounot avatar Nov 17 '22 01:11 jsgounot

that's my reading as well but @bluegenes @dportik should weigh in!

ctb avatar Nov 17 '22 02:11 ctb

Hi @jsgounot - as @ctb mentioned the error profile of ONT appears to negatively affect sourmash's performance (at least for now).

There are two good options for ONT. We found BugSeq actually had the best performance - it is highly tuned to ONT. But, that is a cloud-based analysis and you've got to sign up for it. If you are looking for a DIY, I would recommend the DIAMOND & MEGAN-LR approach. That pipeline is available as a snakemake workflow at https://github.com/PacificBiosciences/pb-metagenomics-tools. If you choose to make an independent pipeline for this, just be aware there are some landmines involved with getting the DIAMOND outputs into MEGAN.

dportik avatar Nov 17 '22 22:11 dportik