Brian Bushnell
Brian Bushnell
The suggested JSON format appears to be extremely inefficient. BBMap's sketches look like this: #SZ:30 CD:AD GS:1430 ID:393251 NM:Paenibacillus nanensis NM0:gi|343200804|ref|NR_041491.1| Paenibacillus nanensis strain MX2-3 16S ribosomal RNA gene, partial...
BBMap's sketch format is text, not binary, as you can see from my post - that is the exact, literal, first 9 lines of a BBMap sketch. Binary might be...
Oh - as for noncanonical bases... yes, you're right. My sketch format can only accommodate ACGT. I don't think this is a problem, though, because... well, what is the goal...
That's fine. All I care about is efficiency, which is the point of min-hash-sketch. I'm surprised that you guys are willing to compromise efficiency for a basically intangible benefit of...
@ondovb BBMap's sketch format uses 2-bit notation, but there's no problem with using 5-bit instead, to support proteins... BBMap does not currently support 5-bit format, but I could certainly add...
@boydgreenfield If you are interested in interoperability, doesn't it make more sense to store data as text? Personally, I consider binary formats to be inherently non-interoperable.
To emphasize this - I have written a lot of tools. All of them support text formats. I have zero interest in writing programs to read custom binary formats that...