musli icon indicating copy to clipboard operation
musli copied to clipboard

Floating point variable-length encoding is optimized for values like 0.000000000000000000000000000000000000000000014

Open cormac-ainc opened this issue 1 year ago • 5 comments

I'm coming at this from the perspective of writing a musli-wire format for a very float-heavy data structure. As a primary observation, floats in musli-wire are encoded by bitcasting them to u32, and then encoding that integer using the chosen integer encoding (fixed or variable, default variable).

Variable-length-int-encoding for floats is pretty useless. If you bitcast 1.0f32 to a u32, you get 1065353216. This takes 4 bytes to encode in variable encoding, resulting in [68, 63, 128, 0, 0]. The circumstances under which you get an int representation small enough to make use of variable length encoding are not obvious without studying IEEE754, and furthermore it is not a useful set of values that gets optimised: f32::from_bits(10) prints as 0.000000000000000000000000000000000000000000014. It doesn't help with "small integral floats" or even something like "values in a small range with 8-bit quantization". It only helps with the really really small non-negative floats, and +0.0. If you have floats that close to zero at ANY time, you are probably using floats wrong and about to get an enormous floating point error from your calculations using it. The wire format should not be optimized for those danger-zone floats. (Check out Herbie to see this charted visually: example where values of x close to 0 produce extremely inaccurate results.)

I suggest:

  1. Add a type param to musli-wire's Encoding for floating point encoding, defaulting it to the chosen (or default) integer encoding, and therefore allowing independent selection of fixed/variable for ints, lengths, and floats.
  2. Optionally extend that with a new VariableFloat encoding that for lossless conversion to f32/f24/f16/f8. This is probably a lot of work. There's a C++ library called vf128 that does something in this vein: https://github.com/michaeljclark/vf128. I also think I've seen one of the serde crates doing this, can't remember which.

I don't personally need the latter.

Finally, because creating a musli-wire Encoding isn't very ergonomic, and in this issue + #23 I am suggesting a lot more type params on Encoding, I suggest adding an impl of Default for Encoding such that you can write:

pub const WIRE_ENCODING: musli_wire::Encoding<
    musli::mode::DefaultMode,
    musli_wire::int::Fixed,
    musli_wire::int::FixedUsize,
    ...
> = musli_wire::Encoding::default();

instead of duplicating all your choices between the type params and the builder methods.

cormac-ainc avatar May 23 '23 09:05 cormac-ainc