sourmash
sourmash copied to clipboard
calculate signature complexity
would it be possible to combine size, scaled, and track-abundance info to calculate complexity of a signature in some way? I think what I want to know is the approx number of k-mers as a ratio of the input number of nucleotides
This strikes me as related to an issue that @luizirber proposed a while back - ISTR it was keeping track of abundance with HLL or some such. I can find it in the sourmash tracker, wonder if it's in khmer?
Anyway, a few thoughts --
- complexity is an overloaded term; it'd help to know your use cases for this feature?
- the specific question is straightforward to approximate but I'm not sure about approximation accuracy!
- the HULK paper might have some particular relevance to this!
- we might also want to look at the idea of computing more things as part of signatures (this is a bit out there) - as long as we're iterating over the data once, why not compute histosketches and other features?
Also see #246, tracking number of bp and input sequences.
@ctb this? https://github.com/dib-lab/sourmash/pull/506
no... I'll chat about it in person!
ref https://github.com/sourmash-bio/sourmash/issues/33 too