sourmash
sourmash copied to clipboard
some thoughts on saving/loading/selecting `SourmashSignature`
I've been digging into some Storage
stuff, and thinking about:
-
SourmashSignature
and how we treat signatures andMinHash
objects as 1:1 - how selectors on databases are really tightly tied to details of
MinHash
(ksize, moltype, etc.) - how manifests are really tightly tied to details of
MinHash
(ksize, moltype, etc.)
and also how right now we really have no actual unique handle for either a SourmashSignature
or a MinHash
object, since the md5sum
is calculated on the MinHash
object and doesn't take the signature name into account.
In https://github.com/sourmash-bio/sourmash/issues/616 we talk about how signatures and MinHash objects are tightly tied together pretty clearly, but the situation has not been improved by selectors and manifests and picklists ;).
This also all gets in the way of storing related MinHash
objects in a single SourmashSignature
/ leaf node in an SBT per https://github.com/sourmash-bio/sourmash/issues/198.
And, more generally, this also prevents us from supporting multiple different sketch types. We don't really have any yet (beyond num
/scaled
signatures, and maybe noabund
/abund
), but it would be nice to support it, which was the goal of https://github.com/sourmash-bio/sourmash/issues/1514.
So I'm thinking about slowly moving in the following direction:
- a
SourmashSignature
will become a collection of different sketch types calculated from the same underlying sequence data, and the best one for a given comparison will be chosen when a comparison is requested. -
Storage
will support saving and loadingSourmashSignatures
of this type, but a storage location will contain at most oneSourmashSignature
(and one or more sketches under that signature). - this then lets us use a storage location as a unique handle, which is useful for various kinds of search indices (SBTs and reverse index, in particular)
-
Storage
then becomes something that stores collections of signatures whileIndex
structures like SBT and revindex move towards being a fast search index for some types of sketches in those signatures, e.g. sketches of a particular ksize/moltype. But then you can use those search indices to pull up the fullSourmashSignature
which will let you transition between different sketches on the same signature (ksize, moltype, etc.) - selectors move towards being things you query with a
SourmashSignature
in order to findSourmashSignature
s with compatible operations available.
One end result would be that things like MinHash
and select
would become much less visible at the top level in the code.
A hack I was thinking of implementing is the idea of a sequence as a sketch type, where we can store actual FASTA sequences and/or collections of k-mers as a signature. It sounds kinda stupid, but could be a good proof of concept in the current absence of different sketches.
+100
HLL
and Nodegraph
are also good candidates for different sketches, but I like the idea of using the sequence as a sketch type too!
as a side note, we could totally use SqliteIndex
in #1808 as a signature storage for SBTs, but this breaks my brain a little at the moment.
but what I really came here to say was that storing FASTA/FASTQ in sqlite might be one way to go in terms of providing FASTA/FASTQ as a sketch type. In particular, using storage converters (see this and this) with gzip compression could work for efficient on-disk storage and retrieval of large FASTA sequences.
And, while thinking about that, it might actually make some vague sense to support sequence storage directly inSqliteIndex
as optional columns in the sketches
table. Then you would have both the hashes and the actual sequence in there 🤯, and would only "suffer" the file size and load time penalties when you used them.
also see sqlite-zstd: https://phiresky.github.io/blog/2022/sqlite-zstd/ - for in-database compression.