concise-encoding
concise-encoding copied to clipboard
Parsing variable-width integers in CBE
The spec indicates the encoding is [type] [length] <integer in byes>, where length is an unsigned LEB128.
However, since unsigned LEB128 is also variable-width, how would a binary parser tell when length ends and the number actually begins? I took a look at go-concise-encoding but was unable to follow where the actual binary parsing began.
LEB128 encoding stores data in 7-bit groups with a continuation bit field in the high bit of each group. So decoding of the length field continues progressively as long as the high bit of each byte you read is 1. Once it's 0, the current encoded group is the last one in the series (in this case, the end of the length field, so the next byte after it is the first byte of the payload).
Here's an example from https://en.wikipedia.org/wiki/LEB128 that encodes the value 624485:
MSB ------------------ LSB
10011000011101100101 In raw binary
010011000011101100101 Padded to a multiple of 7 bits
0100110 0001110 1100101 Split into 7-bit groups
00100110 10001110 11100101 Add high 1 bits on all but last (most significant) group to form bytes
0x26 0x8E 0xE5 In hexadecimal
→ 0xE5 0x8E 0x26 Output stream (LSB to MSB)
So if you were reading a HUGE integer that required 624485 bytes to store, you'd have the data [e5 8e 26 ff ...]:
e5has the high bit set, so you keep reading more length data.8ehas the high bit set so you keep reading more length data.26has the high bit cleared, so this is the last group of the length data.ffis the first (low) byte of the 624485 bytes of actual numeric data.
Note that for lengths 0-127, the length field will only be 1 byte long because it always fits into a single 7-bit group:
- 0 = continuation 0, data 0000000 = 0b00000000 = 0x00
- 1 = continuation 0, data 0000001 = 0b00000001 = 0x01
- 127 = continuation 0, data 1111111 = 0b01111111 = 0x7f
So for almost all realistic data containing variable length integers, you'll never see more than one length byte.