concise-encoding icon indicating copy to clipboard operation
concise-encoding copied to clipboard

Parsing variable-width integers in CBE

Open AkshatM opened this issue 3 years ago • 2 comments

The spec indicates the encoding is [type] [length] <integer in byes>, where length is an unsigned LEB128.

However, since unsigned LEB128 is also variable-width, how would a binary parser tell when length ends and the number actually begins? I took a look at go-concise-encoding but was unable to follow where the actual binary parsing began.

AkshatM avatar Jan 11 '22 00:01 AkshatM

LEB128 encoding stores data in 7-bit groups with a continuation bit field in the high bit of each group. So decoding of the length field continues progressively as long as the high bit of each byte you read is 1. Once it's 0, the current encoded group is the last one in the series (in this case, the end of the length field, so the next byte after it is the first byte of the payload).

Here's an example from https://en.wikipedia.org/wiki/LEB128 that encodes the value 624485:

MSB ------------------ LSB
      10011000011101100101  In raw binary
     010011000011101100101  Padded to a multiple of 7 bits
 0100110  0001110  1100101  Split into 7-bit groups
00100110 10001110 11100101  Add high 1 bits on all but last (most significant) group to form bytes
    0x26     0x8E     0xE5  In hexadecimal

→ 0xE5 0x8E 0x26            Output stream (LSB to MSB)

So if you were reading a HUGE integer that required 624485 bytes to store, you'd have the data [e5 8e 26 ff ...]:

  • e5 has the high bit set, so you keep reading more length data.
  • 8e has the high bit set so you keep reading more length data.
  • 26 has the high bit cleared, so this is the last group of the length data.
  • ff is the first (low) byte of the 624485 bytes of actual numeric data.

Note that for lengths 0-127, the length field will only be 1 byte long because it always fits into a single 7-bit group:

  • 0 = continuation 0, data 0000000 = 0b00000000 = 0x00
  • 1 = continuation 0, data 0000001 = 0b00000001 = 0x01
  • 127 = continuation 0, data 1111111 = 0b01111111 = 0x7f

So for almost all realistic data containing variable length integers, you'll never see more than one length byte.

kstenerud avatar Jan 11 '22 05:01 kstenerud

The code that reads variable length integers in go-concise-encoding is here, which gets the length field by calling here and ultimately this

kstenerud avatar Jan 11 '22 05:01 kstenerud