go-hamt-ipld icon indicating copy to clipboard operation
go-hamt-ipld copied to clipboard

Simplify bitfield format; size, ordering, etc.

Open rvagg opened this issue 5 years ago • 7 comments

In https://github.com/ipld/specs/pull/282 I'm claiming that the map (bitfield) is addressed here in the same way as the specification, and therefore the reference implementation for the IPLD HashMap. The spec currently doesn't use super clear language about how you address a specific bit of a byte array but it's assuming standard LE addressing where "reading index bit of the byte array" implies a certain ordering that you can apply masks to. I don't know how the Go bitfield works but it would be good to verify if the bit checking used here matches operations on the same byte array as used here https://github.com/rvagg/iamap/blob/master/bit-utils.js and if not, document exactly how it works so implementers can be crystal clear.

rvagg avatar Jul 22 '20 02:07 rvagg

/cc @warpfork

rvagg avatar Jul 22 '20 02:07 rvagg

Changed the title of this after playing a bit with the format. Dropping some thoughts here as I explore this.

The bitfield at the moment uses a big.Int which can spit out a type of big-endian format. (some of the guts are in https://golang.org/src/math/big/nat.go with the API in https://golang.org/src/math/big/int.go). We're using the Bytes() and SetBytes() methods for serialization and deserialization. What this seems to give us is the most compact big-endian representation of the number it holds. We're using SetBit() to set individual bits on and off to indicate the presence or absence of an element in our data array, so the number is arbitrary, it's the bits that matter.

The maximum length of the bitfield should be 2^bitWidth to hold enough bits to cover all the indexing we need for any node. So a bitWidth of 8 gives us 256 bits needed to store our bitmap data. So if we were to turn on all the bits because we have a full data array, we'd end up serializing 0xff... for 32-bytes, i.e. 256 1's. But if we only tinker with the first 8 bits, then we only need to serialize one byte. e.g. if we only had an element at index 0 then our bitfield would be a single byte, 0x01, but if we only set the 8th bit then we need to bytes so would serialize 0x0100, and so on.

Filecoin only uses a bitWidth of 5, so that's 32-bits, or 4-bytes needed to represent the bitfield.

Some thoughts about this format:

  • It's somewhat Go specific. It's convenient to get these from a big.Int and set them to a big.Int but it's going to be slightly annoying for everyone else unless they have something already that works in exactly the same way. The ideal internal representation is for a node to have a bitfield of exactly 2^bitWidth ready and available to set and unset bits on. The convenience of big.Int bypasses this entirely but that's not going to be the same story across languages. I have https://github.com/rvagg/iamap/blob/master/bit-utils.js for this in JS, but to go to and from this serialization format I'd have to be trimming left-most bytes that contain zeros on the way in and padding them back on the way out.
  • The randomness of the hash algorithm means that the chances of setting the 32nd bit are the same as setting the 1st. There's no bias toward smaller bits built-in here. So we buy some compaction in the serialization by using this truncated format, but it's only going to get us so far in a HAMT that's got more than a few values in it.
  • In Filecoin the bitWidth is 5, so 32 bits, or a maximum of 4 bytes in the serialization format of this field. There will be some cases with only 3, less with 2 and even less with just 1. We're saving bytes, but not many, and at the cost of complexity for everyone but Go implementers.
  • It's hard to validate canonical block form. The current implementation will (probably) take an arbitrarily long byte array and turn it into a valid big.Int. It'll treat as valid a block that has a byte array 1000 long in the position for the bitfield (I believe big.Int can handle this kind of arbitrary size). But then it should round-trip it back out as just-long-enough if the block was re-serialized. (So this is in a similar category to the problems suggested in https://github.com/filecoin-project/specs/issues/1045).

I don't have a strong opinion here yet, would like to hear others' thoughts. My personal preference would be for it to be stable and consistent, with the bitfield byte array in CBOR being exaclty 2^bitWidth (exactly 4 bytes, every time, for Filecoin) so serialization, validation and explanation of this spec is simpler than it currently is. I doubt that the number of bytes being saved here is very meaningful—but it's not zero.

@warpfork @Stebalien @anorth thoughts?

rvagg avatar Aug 07 '20 04:08 rvagg

Thoughts in no particular order:

  • Agree that this is definitely a canonicalization concern and is important to specify.
  • Therefore, agree that there should be a mode flag (on by default, or just hardcoded as true) for strict validation of the bitfield byte size in serial data, one way or another.
  • Expending an expected value of (4-1)/2 = 1.5 bytes as "waste" per bitfield (given max four bytes, and random distribution, as you outlined) seems near to a rounding error... and that's the cost when one bit is set; the expected value for 'waste' decreases as more bits are set. So if even a handful of bits are set, the expected value of waste is less than a single byte. Seems pretty cheap for the consistency and simplicity gain of a fixed size.
    • And there's one bitfield stored per block, just to sanity check I've got that right? Yeah, this seems very low cost indeed.
  • Agree that seeing bigint just generally sets my spider senses a'tingle. There's nothing that's hard-edged wrong about it, but it would make me more comfortable if we didn't have it.
    • Anecdata: I have (distant) unhappy memories of learning the details of Java's BigInt the hard way; something about the sign being stored in not at all the way I would've initially expected, and this reflecting how many bytes were in the serial form. I don't have specific opinions about Golang's big.Int, and I haven't made a specific comparison of it to other bigints implementations such as Java's BigInt... but I bet they're different in "interesting" ways; and even if it turns out they aren't, I think I'd rather not have to think about it (and more importantly, not ask other implementors in other languages to do it either, for whatever their BigInt options might be).

So, I haven't estimated how much work it would be to change this, but I think I broadly agree with your preference for the bitfield byte array to be exactly 2^bitWidth.

warpfork avatar Aug 07 '20 11:08 warpfork

@warpfork For comparison, we don't use Java's BigInteger in our implementation, but BitSet which is much simpler and I think inline with go's big.Int? https://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html

ianopolous avatar Aug 07 '20 11:08 ianopolous

I agree with just storing the entire bitmap.

Stebalien avatar Aug 08 '20 02:08 Stebalien

Storing the bitfield explicitly sgtm.

anorth avatar Aug 08 '20 04:08 anorth

done in https://github.com/ipfs/go-hamt-ipld/pull/63

rvagg avatar Aug 13 '20 13:08 rvagg