sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

brainstorming: alternative signature storage/loading/query formats

Open ctb opened this issue 3 years ago • 10 comments

from https://github.com/dib-lab/sourmash/issues/1226#issuecomment-748382043, @luizirber says:

The challenge then becomes recalculating the reference sigs with smaller scaled values (100?), and efficiently storing it. JSON + gzip for sigs is at the limit for sizes, but not sure what would be a good format that maintains good archival/self-describing/easy to parse/small trade-offs.

a couple of thoughts here -

  • I feel like we've benefited a lot from using a really boring standard format like JSON which has lots of tools & language support
  • so binary formats are fine if they have said tool & language support, but I'm not "up" on what binary formats are good - maybe protocol buffers are an option?
  • alt, I wonder if we could have a database in format that supports the kind of queries we want to do? e.g. in https://github.com/dib-lab/sourmash/issues/821 I suggest sqlite.

ctb avatar Dec 21 '20 16:12 ctb

I feel like we've benefited a lot from using a really boring standard format like JSON which has lots of tools & language support

Yup, I agree.

so binary formats are fine if they have said tool & language support, but I'm not "up" on what binary formats are good - maybe protocol buffers are an option?

I would really like to avoid protobuf (eg https://twitter.com/fasterthanlime/status/1340944948582113282). On the Rust side, serde has support for a bunch of formats, but performance-wise it would be better to have something that doesn't require encoding/decoding for usage (zero-copy deserialization like cap'n proto, also used by mash, or rkyv, which is rust-only), but that is not as flexible as JSON...

(Tree-buf looks REALLY interesting, but still hasn't support for other languages)

alt, I wonder if we could have a database in format that supports the kind of queries we want to do? e.g. in #821 I suggest sqlite.

Mixed feelings. I think it is a good idea when compared to using Zip files for databases, but not so sure about single signatures...

Relevant read: https://www.sqlite.org/affcase1.html

luizirber avatar Dec 21 '20 17:12 luizirber

what about AVRO? https://avro.apache.org/

ctb avatar Feb 11 '21 16:02 ctb

what about AVRO? https://avro.apache.org/

This is probably very easy to test, considering that https://github.com/flavray/avro-rs supports serde, and so it is a drop-in replacement in the current codebase.

I was looking more into the Arrow/Parquet direction, which would also make it easier to work with more data-analysis-like workflows (loading into pandas, and so on).

Another direction to consider: in #1221 I was using the bitmagic serialization/deserialization for saving nodegraphs, but it might be also a good representation for scaled minhash sketches (save a "compressed bitmap" of the hashes, instead of a list). bitmagic is not a good portable format, but I wonder if any of the options mentioned here support something along the bitmap idea. (this can make a GIGANTIC difference for very large sketches).

luizirber avatar Feb 11 '21 19:02 luizirber

I started playing with the easy ones (the formats supported by serde) in https://github.com/luizirber/2021-02-11-sourmash-binary-format, will report when I have more results.

luizirber avatar Feb 12 '21 02:02 luizirber

thoughts stemming from all the manifest work that has happened:

between the recent introduction of StandaloneManifestIndex https://github.com/sourmash-bio/sourmash/pull/1891 and the hopefully-soon merge of SQLite manifests in #1808, we have an increasingly clean separation between metadata (manifests) and sketches (things containing actual hashvals). This separation would seem to make it easier to experiment with non-JSON formats in the primary code base.

there's also the idea of storing sketches in kProcessor kDataFrames or other k-mer-specialized formats.

ctb avatar Apr 20 '22 14:04 ctb

side note: it would be neat to find ways of avoiding even reading or adding hashes (e.g. store them in bands https://github.com/sourmash-bio/sourmash/issues/1578, or hierarchically at different scaled values).

ctb avatar Apr 21 '22 13:04 ctb

briefly looked into Roaring Bitmaps,

https://roaringbitmap.org/about/

which has both rust and python bindings.

however, while the roaring library and roaring-rs both seem to support 64-bit numbers, pyroaring does not yet - https://github.com/Ezibenroc/PyRoaringBitMap/issues/58

update - also see https://pypi.org/project/roroaring64/ which supports deserialization but not serialization.

and also https://pypi.org/project/pilosa-roaring/ which primarily (only?) supports serialization and deserialization. not clear if it supports 64 bits.

and also https://github.com/sunzhaoping/python-croaring/ which is a cffi wrapping? but does not support 64 bits.

ctb avatar Aug 28 '22 14:08 ctb

which has both rust and python bindings.

I'll do a quick check on the rust one for mastiff, I really liked the API!

At the moment #2230 is using rkyv to serialize/deserialize the list of datasets containing a hash, and while that process is fast it is using a regular BTreeMap from the Rust stdlib, which doesn't save much space.

(rkyv is fast, but it has its own binary format, which precludes using it in other languages. roaring bitmaps are well supported in many languages)

luizirber avatar Sep 03 '22 14:09 luizirber

Seems like roaring is smaller and faster than rkyv on a first test, will try more extensive benchmarks soon.

branch: https://github.com/sourmash-bio/sourmash/compare/lirber/mastiff...lirber/mastiff_roaring

luizirber avatar Sep 04 '22 00:09 luizirber

Seems like roaring is smaller and faster than rkyv on a first test, will try more extensive benchmarks soon.

branch: lirber/mastiff...lirber/mastiff_roaring

Caveat on roaring: it only stores presence/absence, so it doesn't work as a replacement for abundance. But I think we can still use a Vec for storing abundances, and call mins.rank(hash) to get the position to get/set in the abundances Vec.

luizirber avatar Sep 05 '22 16:09 luizirber