sourmash
sourmash copied to clipboard
thoughts on `Storage`, manifests, etc.
This issue extracts the leftover bits from https://github.com/sourmash-bio/sourmash/issues/1352, mostly to do with evolving the Storage
class and SBTs to better intersect.
From comment - @luizirber:
Quick comment on supporting multiple indexes per storage: during #648 I added the
list_sbts
method toZipStorage
with a trivial implementation, but having in mind that in the future it would be useful to have a similar method inStorage
to be able to pull the indices descriptions and be able to select the desired one. Current impl just returns the first.sbt.json
file it finds, but we can store multiple indices in one storage and share thesignatures/
between them, for example.
From comment - @ctb:
I think we can evolve
Storage
classes a little bit to support the relevant functionality - we'd want some functionality like,
Storage.get_manifest(...)
- loads exactly one manifest file
Index.signatures_with_internal(...)
, the current method for getting all signatures in a location in order to build a manifest.In the short term, supporting manifests on SBTs would be pretty easy as long as we don't try to allow multiple indices within an SBT. I would probably start by supporting separate manifest creation, storage, and loading as per the
ZipFileLinearIndex
implementation in #1590. This would give us good picklist functionality withSBT.signatures(...)
which is very useful in the short term for supporting charcoal and grist use cases.In the longer-term, some questions will arise -
- do we want to have separate SBT indices and manifests in a single file? I think the answer is "yes" because that way we can support full signature files in SBTs with #198, and multiple indices on those signature files, all within a single zip file. (Which is good, right?)
- if we want to support multiple SBT indices and want to keep them separate from a manifest, then I think we must forego using the SBT index itself as a manifest, because it will not necessarily have all the signatures in it and also that gets confusing.
- if we don't want to support multiple SBT indices, we could upgrade the SBT index to support all the information that needs to be in the manifest, e.g. ksize and moltype (#63).
and more -
hmm, maybe we want to start thinking about using
Storage
for arbitrary collections of signatures, with manifests; andIndex
for indexed collections that support search/gather/prefetch?Then the
.signatures()
method (and related methods) are defined onStorage
, while.find(...)
is provided forIndex
classes, and things like SBTs and LCA DBs support both.Could also add a
Storage.indexes(...)
that returns all indexes (currently just SBTs?) in any given storage.Then manifests belong to
Storage
, and can be used efficiently with picklists; whileIndexes
are good for search, but don't work efficiently with picklists.
maybe relevant: perhaps Storage
classes could provide internal_location
information for direct loading viaIndex.get
, suggested in https://github.com/sourmash-bio/sourmash/issues/1848.