sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

some thoughts on saving/loading/selecting `SourmashSignature`

Open ctb opened this issue 3 years ago • 4 comments

I've been digging into some Storage stuff, and thinking about:

  • SourmashSignature and how we treat signatures and MinHash objects as 1:1
  • how selectors on databases are really tightly tied to details of MinHash (ksize, moltype, etc.)
  • how manifests are really tightly tied to details of MinHash (ksize, moltype, etc.)

and also how right now we really have no actual unique handle for either a SourmashSignature or a MinHash object, since the md5sum is calculated on the MinHash object and doesn't take the signature name into account.

In https://github.com/sourmash-bio/sourmash/issues/616 we talk about how signatures and MinHash objects are tightly tied together pretty clearly, but the situation has not been improved by selectors and manifests and picklists ;).

This also all gets in the way of storing related MinHash objects in a single SourmashSignature / leaf node in an SBT per https://github.com/sourmash-bio/sourmash/issues/198.

And, more generally, this also prevents us from supporting multiple different sketch types. We don't really have any yet (beyond num/scaled signatures, and maybe noabund/abund), but it would be nice to support it, which was the goal of https://github.com/sourmash-bio/sourmash/issues/1514.

So I'm thinking about slowly moving in the following direction:

  • a SourmashSignature will become a collection of different sketch types calculated from the same underlying sequence data, and the best one for a given comparison will be chosen when a comparison is requested.
  • Storage will support saving and loading SourmashSignatures of this type, but a storage location will contain at most one SourmashSignature (and one or more sketches under that signature).
  • this then lets us use a storage location as a unique handle, which is useful for various kinds of search indices (SBTs and reverse index, in particular)
  • Storage then becomes something that stores collections of signatures while Index structures like SBT and revindex move towards being a fast search index for some types of sketches in those signatures, e.g. sketches of a particular ksize/moltype. But then you can use those search indices to pull up the full SourmashSignature which will let you transition between different sketches on the same signature (ksize, moltype, etc.)
  • selectors move towards being things you query with a SourmashSignature in order to find SourmashSignatures with compatible operations available.

One end result would be that things like MinHash and select would become much less visible at the top level in the code.

A hack I was thinking of implementing is the idea of a sequence as a sketch type, where we can store actual FASTA sequences and/or collections of k-mers as a signature. It sounds kinda stupid, but could be a good proof of concept in the current absence of different sketches.

ctb avatar Jun 29 '21 14:06 ctb

+100

HLL and Nodegraph are also good candidates for different sketches, but I like the idea of using the sequence as a sketch type too!

luizirber avatar Jun 29 '21 14:06 luizirber

as a side note, we could totally use SqliteIndex in #1808 as a signature storage for SBTs, but this breaks my brain a little at the moment.

ctb avatar Jan 26 '22 15:01 ctb

but what I really came here to say was that storing FASTA/FASTQ in sqlite might be one way to go in terms of providing FASTA/FASTQ as a sketch type. In particular, using storage converters (see this and this) with gzip compression could work for efficient on-disk storage and retrieval of large FASTA sequences.

And, while thinking about that, it might actually make some vague sense to support sequence storage directly inSqliteIndex as optional columns in the sketches table. Then you would have both the hashes and the actual sequence in there 🤯, and would only "suffer" the file size and load time penalties when you used them.

ctb avatar Jan 26 '22 15:01 ctb

also see sqlite-zstd: https://phiresky.github.io/blog/2022/sqlite-zstd/ - for in-database compression.

ctb avatar Aug 02 '22 10:08 ctb