sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

use cases

Open ctb opened this issue 7 years ago • 9 comments

This issue can serve as a placeholder for use cases for sourmash/MinHash more generally.

Stuff we already have implemented:

  • basic MinHash comparisons etc
  • metagenome taxonomy breakdown
  • streaming sequence classification

Off-label and emerging use cases:

  • examining genomic contamination
  • comparing and validating different binning approaches
  • analysis of unknowns / hashes as stable identifiers
  • 16s etc. clustering

please add more here - we're in danger of forgetting all the great ideas we come up ;)

ctb avatar May 09 '17 11:05 ctb

tetramer nucleotide clustering

basic kmer searching (--scaled 1)

ctb avatar May 09 '17 13:05 ctb

contamination detection

ctb avatar May 11 '17 13:05 ctb

  • speed up genome-scale search and membership analysis of arbitrarily large WGS metagenomes by 1000-1m fold.
  • cluster metagenome WGS data sets by similarity on very large scales
  • classify strain variants (As in the above blog post) very quickly
  • index public and private collections of metagenomes and genomes on the scale of ~100k+ to make them publicly and privately searchable.
  • identify known genomes in metagenomes very quickly

ctb avatar May 13 '17 13:05 ctb

  • using our public database, find NCBI accession of genome you're working with
  • using our public database, find (all) strains of genome genome you're working with
  • build a discovery & notification service for new SRA/genbank/IMG/etc genomes

ctb avatar May 18 '17 13:05 ctb

via Cameron Thrash, "when we have pure culture genomes and want to see in which datasets we can recruit large numbers of reads for ecological comparison"

ctb avatar May 18 '17 14:05 ctb

I think "find NCBI accession of genome you're working with" could actually be expanded quite a bit - this could be a super convenient approach to getting full taxonomic information for something quickly, linking out to public databases, and cross-referencing across what NCBI/SRA/IMG/etc have made available. Actually a pretty exciting solution to a whole host of problems.

ctb avatar May 29 '17 14:05 ctb

differential presence of sequences per https://github.com/dib-lab/sourmash/issues/1266 is a pretty good one

ctb avatar Jan 04 '21 14:01 ctb

metagenome "pivot query" use cases: https://github.com/sourmash-bio/sourmash/issues/485

ctb avatar Aug 03 '22 10:08 ctb

Dealing with ridiculous amounts of data:

All samples were sequenced using Illumina shotgun metagenomic sequencing on the Novaseq 6000 platform with 150bp PE reads. Some samples were sequenced to excessive depth and the total dataset is approx. 20 Terabases in size. In addition to the metagenomes, we grew approx. 5,000 microbial isolates from a subset of the samples and sequenced 3,000 of those genomes to build an in-house microbial genome database

ctb avatar Oct 17 '22 13:10 ctb