sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

thoughts on `Storage`, manifests, etc.

Open ctb opened this issue 2 years ago • 1 comments

This issue extracts the leftover bits from https://github.com/sourmash-bio/sourmash/issues/1352, mostly to do with evolving the Storage class and SBTs to better intersect.

From comment - @luizirber:

Quick comment on supporting multiple indexes per storage: during #648 I added the list_sbts method to ZipStorage with a trivial implementation, but having in mind that in the future it would be useful to have a similar method in Storage to be able to pull the indices descriptions and be able to select the desired one. Current impl just returns the first .sbt.json file it finds, but we can store multiple indices in one storage and share the signatures/ between them, for example.

From comment - @ctb:

I think we can evolve Storage classes a little bit to support the relevant functionality - we'd want some functionality like,

  • Storage.get_manifest(...) - loads exactly one manifest file

  • Index.signatures_with_internal(...), the current method for getting all signatures in a location in order to build a manifest.

In the short term, supporting manifests on SBTs would be pretty easy as long as we don't try to allow multiple indices within an SBT. I would probably start by supporting separate manifest creation, storage, and loading as per the ZipFileLinearIndex implementation in #1590. This would give us good picklist functionality with SBT.signatures(...) which is very useful in the short term for supporting charcoal and grist use cases.

In the longer-term, some questions will arise -

  • do we want to have separate SBT indices and manifests in a single file? I think the answer is "yes" because that way we can support full signature files in SBTs with #198, and multiple indices on those signature files, all within a single zip file. (Which is good, right?)
  • if we want to support multiple SBT indices and want to keep them separate from a manifest, then I think we must forego using the SBT index itself as a manifest, because it will not necessarily have all the signatures in it and also that gets confusing.
  • if we don't want to support multiple SBT indices, we could upgrade the SBT index to support all the information that needs to be in the manifest, e.g. ksize and moltype (#63).

and more -

hmm, maybe we want to start thinking about using Storage for arbitrary collections of signatures, with manifests; and Index for indexed collections that support search/gather/prefetch?

Then the .signatures() method (and related methods) are defined on Storage, while .find(...) is provided for Index classes, and things like SBTs and LCA DBs support both.

Could also add a Storage.indexes(...) that returns all indexes (currently just SBTs?) in any given storage.

Then manifests belong to Storage, and can be used efficiently with picklists; while Indexes are good for search, but don't work efficiently with picklists.

ctb avatar Mar 26 '22 16:03 ctb

maybe relevant: perhaps Storage classes could provide internal_location information for direct loading viaIndex.get, suggested in https://github.com/sourmash-bio/sourmash/issues/1848.

ctb avatar Aug 15 '22 15:08 ctb