Mash icon indicating copy to clipboard operation
Mash copied to clipboard

Exporting mash hashes for interoperability

Open ctb opened this issue 8 years ago • 51 comments

Hi all,

first some background:

it seems like sourmash is going to be a thing; I'm building it into a metagenomics data exploration tool, and it's already integrated into https://github.com/dib-lab/khmer/ in some interesting and useful ways. Before it becomes too much of a thing, I'm interested in harmonizing with what you've done with mash, both out of gratitude and because it'd be kind of stupid to have multiple different MinHash implementations out there - interoperability would be really handy!

So, on the topic of interop, I poked around under the hood of mash, and am happy to report that I can swizzle sourmash over to use your exact hash function and seed; I will do so forthwith.

It seems like it would be relatively simple for me to write a parser for your .msh files, but that would depend on capnproto, I think. It seems like it would be better to be part of mash. So, what do you think about a 'dump' command for sketches? This would be an explicit "data transfer" format that we could use to transition sketches between MinHash software implementations. I'd guess that something quite minimal (uniquely identified hash function + seed, k size, identifier, and hashes, all in a CSV file) would work. In our 'signature' files we also include an md5sum of the hashes.

If this is not antithetical to the very principles on which mash was founded, then great! Let me know! And I'm happy to whip up a prototype and submit a pull request - I was thinking of adding a new command, 'mash dump'. Alternate ideas very welcome.

cc @luizirber

ctb avatar May 20 '16 21:05 ctb

Hey Titus, Glad you're finding Mash useful. Happy to help support interop however we can. Have you looked into capnproto tools (like pycapnp) for decoding the serialized format? E.g.

Should be pretty easy to slurp in the current Mash serialized format, which is defined here: https://github.com/marbl/Mash/blob/master/src/mash/capnp/MinHash.capnp

Are you opposed to the extra capnp dependency this requires? What other format would you suggest?

aphillippy avatar May 20 '16 21:05 aphillippy

I'm -1 on adding capnp to my tool, but I can certainly write something quick to take the .msh format and export it (I already have C++ code running). I think there's value to having it be part of mash (because no doubt you will at some point change the details of mash and break whatever I've written if it's not part of mash), but I can certainly make do for now without it.

ctb avatar May 20 '16 21:05 ctb

Looking for something binary or text based? Propose anything reasonable and should be pretty easy to add.

aphillippy avatar May 20 '16 21:05 aphillippy

Something quite minimal (uniquely identified hash function + seed, k size, identifier, and hashes, all in a CSV file) would work. In our 'signature' files we also include an md5sum of the hashes.

I'll put together a Python prototype for comments and get back to you.

ctb avatar May 21 '16 16:05 ctb

Separately - do you have any contribution instructions, coding guidelines, or PR/code review policies I could use to guide me? thx!

ctb avatar May 21 '16 16:05 ctb

Wasn't presuming you'd code it up. Just propose a format for the dump and we can implement.

aphillippy avatar May 26 '16 11:05 aphillippy

I've implemented a dump command here: https://github.com/ctb/Mash/pull/1. Let me know what you think -- I am happy to turn it into a pull request against this repo. The dumpfile format can be changed to include more information; I don't know enough about your internals to do a comprehensive job here!

I was thinking it'd be nice to implement a 'load' or 'import' command for mash as well; if you do that, I can provide a dump command from sourmash as well.

ctb avatar Jun 03 '16 13:06 ctb

I was able to pull and run this, and the code is pretty straightforward, so I'd be fine with merging a pull request. A few comments:

  • It might make sense for this to be a function of mash info to keep the command space clean?
  • Other info we may want to throw in for future-proofing would be the alphabet and whether it's canonical.
  • As far as the format itself, we could end up with some pretty long lines e.g. for s=10,000, which isn't ideal, but single-line parsing does have some convenience to it. I'm also not crazy about nesting whitespace-separation within comma-separation (but of course these files aren't really meant to be viewed anyway).

ondovb avatar Jun 07 '16 21:06 ondovb

On Tue, Jun 07, 2016 at 02:43:10PM -0700, ondovb wrote:

I was able to pull and run this, and the code is pretty straightforward, so I'd be fine with merging a pull request. A few comments:

  • It might make sense for this to be a function of mash info to keep the command space clean?

I was thinking that too. Alternatively, part of the "expert" commands that aren't shown by default? (You don't have any just yet, but you could have more.)

  • Other info we may want to throw in for future-proofing would be the alphabet and whether it's canonical.

Absolutely - but I haven't looked into this for mash, so don't know what's important for your current & future functionality. Specific suggestions?

  • As far as the format itself, we could end up with some pretty long lines e.g. for s=10,000, which isn't ideal, but single-line parsing does have some convenience to it. I'm also not crazy about nesting whitespace-separation within comma-separation (but of course these files aren't really meant to be viewed anyway).

As you say - agree to all of it :)

thanks! --titus

ctb avatar Jun 08 '16 21:06 ctb

Hi all,

Has there been any more work on this? Both tools are absolutely fantastic and I'd be happy to follow suit in making yet another MinHash classifier compatible with Mash/sourmash (especially since I've contributed to the overgrowth of implementations).

I'm not opposed to incorporating capnproto but it'd be nice to have a text format as well - the sourmash sigs are great to have around!

edawson avatar Jul 04 '16 22:07 edawson

Hi Eric, Yes, but summer holidays have put us behind. We should get back to this shortly.

Best, -Adam

aphillippy avatar Jul 10 '16 20:07 aphillippy

The master branch now has this implemented via mash info -d. I adhered to JSON syntax for more potential compatibility, but the whitespace is controlled, so a custom line-by-line parser should be pretty straightforward. I'll write up some schema docs once it's solidified; for now it should be mostly self-explanatory.

The one thing I didn't include was the MD5, since I wasn't sure of the scope of each signature (e.g. metadata or multiple sketches).

Update: forgot to print seed in original; just committed fix.

ondovb avatar Jul 26 '16 16:07 ondovb

I plan make to an official release tomorrow, which will include this feature. Any feedback on the format or what is included is certainly welcome...it will be fairly easy to add little things later, but any major changes will become more difficult once it is in use.

ondovb avatar Aug 11 '16 14:08 ondovb

thanks for the heads up! taking a look now.

ctb avatar Aug 11 '16 14:08 ctb

Compiles and runs for me. I have to write a sourmash parser before I can tell if it 100% works - will see what I can do, but today is unlikely. So :+0: from me and maybe +1 by tomorrow :)

ctb avatar Aug 11 '16 15:08 ctb

Ok, take your time and I will hold for your comments, since you're the primary target of the feature.

ondovb avatar Aug 11 '16 22:08 ondovb

Did you get a chance to take a look at the JSON? I think I will go ahead with the release in the next couple days, but it should be easy to throw in any last minute tweaks to the format.

ondovb avatar Aug 22 '16 16:08 ondovb

@ondovb – One potential issue with JSON encoding here is that while serializing in C++ will work for 64-bit uints, the behavior decoding the JSON is going to be implementation specific (i.e., loading in Javascript will truncate the numbers to 64-bit floats).

For the purposes of having a maximally interchangeable format, thoughts on encoding these as strings in the JSON array?

boydgreenfield avatar Aug 24 '16 23:08 boydgreenfield

Ooo, good catch, Nick. Thanks. I am not a JS person, so I don't know the best way to address this. Encoding as a string seems like a sensible solution to me...

aphillippy avatar Aug 25 '16 20:08 aphillippy

Yeah, I think string encoding is the simplest approach here (esp. since many languages may will then load an array of ints interspersed with longs vs. all unsigned 64-bit ints). That would also avoid really cryptic errors where folks pass this JSON blob through, e.g., Javascript services, and then end up with mysteriously truncated hash values.

Does that work for your use case @ctb?

boydgreenfield avatar Aug 25 '16 21:08 boydgreenfield

Yep, encoding as strings works for me!

ctb avatar Aug 31 '16 23:08 ctb

Would it be possible to add the functionality to deserialize the JSON dump into a mash .msh file? This would allow interopability to work for both loading and dumping.

I think having a standard format for mash hash interoperability is a good idea. It may be useful to define an RFC-like document with a description of all the fields. Nothing fancy - a simple document with the specifications of each key and corresponding value. My small recommendation would be to add a version field, this would help with maintenance of changes if additional fields are added in future etc.

All data that might change in the future should be tagged with a version number. -- Joe Armstrong

michaelbarton avatar Dec 09 '16 18:12 michaelbarton

I supoose it also follows that once you have an interchangable format for mashes, a centralised database of precomputed hashes would be useful for everyone. Titus mentions something similar in the last paragraph of his blog post MinHash signatures as ways to find samples, and collaborators?.

I would suggest maintaining the individual hashes of each organism and then allowing each collaborator to mash paste hashes together. This would allow the creation of custom databases, and cross-validation of tools through leave-n out benchmarking.

michaelbarton avatar Dec 09 '16 18:12 michaelbarton

@michaelbarton this is certainly something we are open to, but the standardization you noted I think would be a prerequisite, especially with regard to the hash function, to keep things robust. We'll think about this more and keep you posted.

ondovb avatar Dec 09 '16 20:12 ondovb

On Fri, Dec 09, 2016 at 10:43:06AM -0800, Michael Barton wrote:

I supoose it also follows that once you have an interchangable format for mashes, a centralised database of precomputed hashes would be useful for everyone. Titus mentions something similar in the last paragraph of his blog post MinHash signatures as ways to find samples, and collaborators?.

I would suggest maintaining the individual hashes of each organism and then allowing each collaborator to mash paste hashes together. This would allow the creation of custom databases, and cross-validation of tools through leave-n out benchmarking.

We're working in this direction ourselves, yes.

One thought: the signatures are small, and the databases are fast to search. I'm not sure we will need custom databases made from public signatures, as we might be able to have just one.

ctb avatar Dec 11 '16 15:12 ctb

@ondovb : the standardization of the hash function would indeed be necessary.

However, even with such a standardization sharing k-mers would seem preferable to sharing their hash values. This would allow added robustness through a rather simple check mechanism: check locally that a local hashing function produces the same minhashes as the one in the signature fetched from somewhere. This would also allow fancy handshakes between databases of signatures (pass the kmers and the corresponding hashes, and have an empirical assessment that the hash function is likely the same).

This would require a bit of added code upstream (tools creating signatures) with the need to store kmers in an signature-for-export (kmers and corresponding hash values) at creation time, but this does seem relatively easy to add.

PS: Sharing the k-mers, might also make a JSON-based format relatively free from the concern about JS and long integers.

lgautier avatar Dec 30 '16 21:12 lgautier

It seems like sharing the k-mers that are the lowest n hashes under a particular hash function is almost identical to sharing the lowest n hashes together with the id of the hash function, no? You're not going to get the same n k-mers yielding the lowest n hashes for a particular data set with a different hash function. If the goal is to verify the identity of the hash functions, that can be done by hashing a signal k-mer and sharing that hash specifically.

I can see one or two potential advantages to sharing the k-mers, tho.

ctb avatar Dec 30 '16 22:12 ctb

140 chars version of my previous post: Kmer + hash value allow check. Kmer-only allows to overcome JS issue with integers.

lgautier avatar Dec 30 '16 22:12 lgautier

On Fri, Dec 30, 2016 at 02:29:41PM -0800, Laurent Gautier wrote:

140 chars version of my previous post: Kmer + hash value allow check. Kmer-only allows to overcome JS issue with integers.

:)

ctb avatar Dec 30 '16 22:12 ctb

(...)

If the goal is to verify the identity of the hash functions, that can be done by hashing a signal k-mer and sharing that hash specifically.

Empirical assessment of identical hashing functions by comparing output from known input is not going to be a guarantee no matter the numbers of inputs used but I'd say: use as much as possible while remaining within a practical data transfer and computational burden. This would likely mean more than one, and may point towards using the list of signature k-mers with their hash values.

Should the sharing of both the k-mers and the corresponding hash values for all entries in the signature represent an excessive burden (the signatures are relatively small, but let's assume it could happen), the fact that a signature is constituted of the N lowest hash values provides a rather straightforward way to dynamically adjust the amount of k-mer/hash value pairs in a signature-for-export. One can then only share M k-mer/hash value pairs for verification (with M <= N and and the and the pairs such as the hash values are the M lowest ones), as well as the remaining N-M entries in the signatures (k-mers, or hash values) obviously. It would cover up to the case of one signal k-mer with M=1(or no check at all withM=0`).

lgautier avatar Dec 31 '16 03:12 lgautier