smile-format-specification icon indicating copy to clipboard operation
smile-format-specification copied to clipboard

Clarification of "Safe" binary encoding

Open Zankoku-Okuno opened this issue 2 years ago • 5 comments

I'm running into some issues determining a precise meaning of the specification for encoding BigInteger and BigDecimal.

The specification states:

"Safe" binary encoding simply uses 7 LSB: data is left aligned (i.e. any padding of the last byte is in its rightmost, least-significant, bits).

Which led me to believe that 7 LSB is a well-known encoding method that I simply hadn't yet heard of. Unfortunately, google searches for "7 LSB" encoding (and similar) only turned up

  • some Lucene documentation on Vint8
  • a bunch of steganography-related content
  • this specification I spotted the phrase "7/8 encoding" in the description of token type 0xE8 (which I think is the same as the "safe binary encoding" mentioned earlier), but that search term leads to
  • a question on Xilinx Support (so probably about an fpga or dsp, or some other hardware)
  • a paper in the field of "The translation of finite CSPs into SAT"
  • patents

Digging deeper led to FasterXML/jackson-dataformats-binary#37, which has been open and afaict essentially untouched for five years now. At this point I have to assume that the implementation has won by default, and that's how we'll be proceeding with our own encoder unless integration testing says otherwise.

Anyway, I figured I'd show up to confirm y'all's notion of safe binary encoding (or 7 LSB, or 7/8 encoding) according to the behavior of the implementation:

Safe binary encoding is an 8-bit clean encoding of arbitrary byte arrays. The byte array is segmented into seven-bit groups from lowest-to-highest index and most-to-least significant bit within each byte. Each group is then padded with leading zero bits until the group is the size of a full octet. Note that this means that each full seven bit group is prepended with a single leading zero bit, but the last group (which may have fewer than seven bits) contains its data in the rightmost (least-significant) bits. (Note: the standard at time of writing specifies that a partial trailing group contains padding in its "rightmost, least-significant bits". I'm honestly not sure which is more "elegant" and don't really have an opinion other than "break the fewest things".)

As an example, the encoding of the array {0xDE, 0xAD, 0xBE, 0xEF} is performed as follows:

DE AD BE EF
11011110 10101101 10111110 11101111          // convert to binary
1101111 0101011 0110111 1101110 1111         // segment into 7-bit groups
01101111 00101011 00110111 01101110 1111     // prepend zero bit on each full group
01101111 00101011 00110111 01101110 00001111 // prepend zero padding on the trailing partial group
6F 2B 37 6E 0F                               // convert to octets

Zankoku-Okuno avatar Jan 25 '22 17:01 Zankoku-Okuno