Mash
Mash copied to clipboard
JSON schema
A first pass of the JSON schema is in the Mash repo: https://github.com/marbl/Mash/blob/master/src/mash/schema.json
For now, I put k-mers as a separate array parallel to hashes rather than an array of tuples, since the latter seemed unwieldy, especially if they are optional.
@ondovb This looks like a great start! A few items we've been tracking here that it'd be great to include (we were literally just whiteboarding this):
- A
counts
array, and extra parameters around count-based trimming of the min-hashes - An
Object
typemetadata
value for storing extra sample metadata (per sketch) - An optional
metadata_schema
value that could point to a schema for said metadata
A few more tactical items and questions:
- Is
name
thefilename
for each sketch? In our implementation, we've previously also tracked the file size and an md5 checksum, which is useful for deduplicating - Is
length
the number of k-mers? Or file size? I think it could be useful to store both. One should probably only count valid k-mers per the alphabet in an implementation -
alphabet
should maybe be anenum
? -
hashSeed
should probably be nullable since not all hash functions expose a seed (or one just makes it0
in that case?)
@boydgreenfield Thanks for the feedback.
A counts array, and extra parameters around count-based trimming of the min-hashes
Counts make sense, and are in the Mash Cap'n Proto. What kind of trimming parameters did you have in mind?
An Object type metadata value for storing extra sample metadata (per sketch) An optional metadata_schema value that could point to a schema for said metadata
Seems like as good a way as any.
Is name the filename for each sketch? In our implementation, we've previously also tracked the file size and an md5 checksum, which is useful for deduplicating
Usually, but for Mash it could also be a fasta tag if -i
was active (in which case any description after whitespace goes in comment
). To support these two modes (and format independence, e.g. fasta/fastq), I think an MD5 would have to be based on the sequence itself.
Is length the number of k-mers? Or file size? I think it could be useful to store both. One should probably only count valid k-mers per the alphabet in an implementation
It is the raw sequence length (or, for read sets, a genome size estimate based on k-mer content). We only use this for p-value calculation. The number of valid k-mers is currently not tracked in Mash, but I agree that allowing more information about this would be better.
alphabet should maybe be an enum?
I assume you mean only allowing nucleotide/protein options? For Mash we kept it as generic as possible in case someone would want to do text mining or something, but I could see the argument for that.
hashSeed should probably be nullable since not all hash functions expose a seed (or one just makes it 0 in that case?)
Makes sense. Regarding the hash function itself, I have it as an enum but am not sure if that's the best way. It would ensure specificity but would also require a new schema version to use any other hash function.
In general we could also allow additional fields ("additionalProperties" : true
), though I feel like that would weaken the schema as a validation tool. Of course, if people end up ignoring schema compliance to get functionality, that's not very useful either.
The suggested JSON format appears to be extremely inefficient. BBMap's sketches look like this:
#SZ:30 CD:AD GS:1430 ID:393251 NM:Paenibacillus nanensis NM0:gi|343200804|ref|NR_041491.1| Paenibacillus nanensis strain MX2-3 16S ribosomal RNA gene, partial sequence 7XJemnRFJVG HXETE>jil 18BHI?<JhP 1?ZKWJ=1CU 48anA5Vkc 1<7TlMGgOo AT`\bZgKR 2]nnIcK;I_
...etc. They are coded in 2-bit format ASCII-48 with delta-compression so even when you gzip them the size is only reduced by ~30%. So, they are extremely efficient to store and load. I suggest, if the goal is to make an efficient interoperable standard, that you adopt something similar and abandon JSON.
@bbushnell : binary formats are pretty much always more efficient than text formats, and I'd expect respective tools looking for performance to use their own representation. IIUIC this is trying to be an initial attempt at having a data exchange format between tools. I'd say that JSON is attractive because of the ubiquitous availability of tools around it (libraries exist for all languges) and as it defines basic structures like arrays and key-value maps could let focus on the content first (what should that format contain).
As use-cases emerge may the there might be a need to optimize, may be in incremental steps (2bit packing k-mers although this limited to {A,(T|U),C,G} sequences), encoding arrays with the minhash sketches as bytes-packed strings, etc...), but this would be for later ?
BBMap's sketch format is text, not binary, as you can see from my post - that is the exact, literal, first 9 lines of a BBMap sketch. Binary might be more efficient, but then you can't look at the sketches in a text editor, so I'm not really interested in that. They already are only 150% of their gzipped size, so I don't think it's much of a problem.
Oh - as for noncanonical bases... yes, you're right. My sketch format can only accommodate ACGT. I don't think this is a problem, though, because... well, what is the goal of sketches? It's to rapidly evaluate whether sequences are similar. Does anyone care whether you have a poly-N sequence that matches everything? ...no.
@bbushnell I agree with @lgautier that interoperability is the primary goal of this effort, above efficiency. I think the point is that if parsing requires any other custom code or less-than-mainstream libraries, one might as well maximize efficiency with a binary format. This was certainly the motivation behind our use of Cap'n Proto serialization (which actually does provide a schema and libraries for several languages). The ASCII encoding is an interesting middle ground if we want to compress the string within the JSON in the future, but any such solution would have to support the protein alphabet at the very least.
Agree with all points raised by @lgautier and @ondovb
@boydgreenfield 's suggestion to add count-based trimming is pointing out that in the case of DNA, RNA, or protein data the definition of a minhash sketch extends beyond the definition of an hash function (which the redundancy in sharing k-mers/n-tuples and associated hash values would empirically verify when sharing a sketch) and should cover a bit the nature of the data shared and associated pre-processing leading to the minhash sketch. In a way this is part of the "metadata" that was also suggested to be added. Fully defining it is a complex problem that should probably stay out), but at the same it the information might be important to make use of the sketch / signature (one of the reason they are exchanged in the first place).
For example, whether a DNA minhash sketch is build from a complete assembled genome or reads from shotgun sequencing for a given genome would have an influence on what a minhash sketch means or could be used. I am more specifically thinking of the use-case where the subset of kmers constituted by a sketch is used to query a database / service about whether they have a matching signature. With a convention the server might be able to answer the best way (e.g., prioritize / adjust threshold when using search).
I have the initial feeling that while this is looking like opening a Pandora's box, but I also think that major use-cases can be defined/covered well enough to have a practical exchange format.
Would the notion of hash value-level metadata and minhash sketch-level metaa seems like a interesting starting point ?
- hash value-level would be:
- hash values sorted (say, in increasing order)
- required hash value-level metadata is the sequence (k-mer/n-gram) from which the hash is computed
- optional hash value-level can comprise count (can other hash-value levels be included)
- minhash sketch-level would be:
- filter (metadata Pandora's box again here, but may be common filters can be agreed on ? e.g., count, complexity)
- total number of k-mers/n-grams evaluated for inclusion in the minhash sketch
alphabet should maybe be an enum?
I assume you mean only allowing nucleotide/protein options? For Mash we kept it as generic as possible in case someone would want to do text mining or something, but I could see the argument for that.
I think that agree with @ondovb : the alphabet is defining defining explicitly the space of k-mers / n-grams. Not space optimal (e.g. all amino acids repeated with each minhash sketch of polypeptides) but the mihash sketch is likely taking much more space anyway. It would also allow exotic bases, and all sort oddities synthetic biology can be coming up with.
hashSeed should probably be nullable since not all hash functions expose a seed (or one just makes it 0 in that case?)
Makes sense. Regarding the hash function itself, I have it as an enum but am not sure if that's the best way. It would ensure specificity but would also require a new schema version to use any other hash function.
The definition of the hash function can be relaxed to being a string. There can be common-agreed-upon hashing function, but even so the redundancy of sharing hash values along with their originating k-mers/n-tuples is there to empirically double-check it.
That's fine. All I care about is efficiency, which is the point of min-hash-sketch. I'm surprised that you guys are willing to compromise efficiency for a basically intangible benefit of interoperability which may or may not happen. Good luck!
@ondovb BBMap's sketch format uses 2-bit notation, but there's no problem with using 5-bit instead, to support proteins... BBMap does not currently support 5-bit format, but I could certainly add it if it would be useful. Currently, it's designed to match nucleotide sequences.
@lgautier @ondovb I agree on both fronts re: the alphabet
(to be clear, we're suggesting a list of valid characters as a string, correct?) and am fine with relaxing the hashFunction
to a string. I also like your suggestions re: additional "how was this sketch constructed?" information @lgautier. Perhaps we can call all of these params
so there's also a place to put metadata about the file being sketched, e.g., this is a sketch of short-read NGS data from a stool sample.
@bbushnell I think the point here is to get to something easy enough to use for interoperability, and so we should try to optimize for parse-ability and ease-of-correct-implementation over efficiency. E.g., we've actually been storing all of our min-hashes as binary data in Postgres.
@boydgreenfield
If you are interested in interoperability, doesn't it make more sense to store data as text? Personally, I consider binary formats to be inherently non-interoperable.
To emphasize this - I have written a lot of tools. All of them support text formats. I have zero interest in writing programs to read custom binary formats that are language-specific or format-specific, when they are less efficient than a text-based protocol.
@bbshnell May be a slight misunderstanding here. While text vs binary was may be not the best way to describe it, it was (inaccurately) implied that your format was the binary one. In other words everyone has a text format, and this is not why JSON is considered.
@boydgreenfield Yes, I was suggesting the alphabet
be a string, any characters in which would be considered valid. Characters that are different cases of the same letter would also be considered valid unless preserveCase
were true
. This is based on Mash's default case-insensitivity, but for this format it may make more sense to invert the parameter to ignoreCase
.
I've updated schema-1.0.0.json
in the repo to address some of these issues and I've added comments, which may not be strictly valid JSON but seem to be ignored by the validator I'm using.
- additional properties - We've changed it to allow these, with the idea that this could be a minimal schema that can have additional information (e.g. metadata or novel filters) layered on top in derived schemas. This way tools could adapt the format for their own needs while being able to convey the most necessary information to other tools, with the extra data simply being ignored.
- version - I addressed this with a string that is expected to be the URI of the schema used, whose name would include a version.
- filters - There is a dedicated property for these, currently inhabited only by the minimum copy number filter.
We plan on updating Mash to read and write the format as proposed soon, but others are welcome to continue working on standards related to metadata or to create a shared repo. For a name, I would like to propose Jam (JSON MinHash), in keeping with the edibles theme :P
"Jam" has a nice ring has it can also mean an informal and spontaneous musical performance. Visibility in search engines might be an other matter though.
I am about ready to write read/write code for that format but I have a question about the license for the JSON definition being discussed: what is it released under ? (CC-like would seem to make sense).
@lgautier Public domain (I'm a govt employee). If someone else wants to open a repo and merge contributions from others, then I'd vote for CC0.
Thanks. Public domain is good to start. We can see if need for anything else because of contributions or so later on...
In case anyone is looking for the schema: the URL at the top of this thread appears no longer valid. It is here: https://github.com/marbl/Mash/blob/master/src/mash/schema-1.0.0.json