sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

calculate signature complexity

Open taylorreiter opened this issue 5 years ago • 4 comments

would it be possible to combine size, scaled, and track-abundance info to calculate complexity of a signature in some way? I think what I want to know is the approx number of k-mers as a ratio of the input number of nucleotides

taylorreiter avatar Jan 05 '19 02:01 taylorreiter

This strikes me as related to an issue that @luizirber proposed a while back - ISTR it was keeping track of abundance with HLL or some such. I can find it in the sourmash tracker, wonder if it's in khmer?

Anyway, a few thoughts --

  • complexity is an overloaded term; it'd help to know your use cases for this feature?
  • the specific question is straightforward to approximate but I'm not sure about approximation accuracy!
  • the HULK paper might have some particular relevance to this!
  • we might also want to look at the idea of computing more things as part of signatures (this is a bit out there) - as long as we're iterating over the data once, why not compute histosketches and other features?

Also see #246, tracking number of bp and input sequences.

ctb avatar Jan 05 '19 14:01 ctb

@ctb this? https://github.com/dib-lab/sourmash/pull/506

luizirber avatar Jan 07 '19 05:01 luizirber

no... I'll chat about it in person!

ctb avatar Jan 07 '19 16:01 ctb

ref https://github.com/sourmash-bio/sourmash/issues/33 too

ctb avatar Aug 03 '22 10:08 ctb