Insure that std::float16_t/std::bfloat16_t support is exact

Open lemire opened this issue 11 months ago • 0 comments

In release 8.0.0, we support float16_t and bfloat16_t (thanks @dalle). We have reasonable testing and our code is based on an implementation publicly available since GCC 13 (thanks @jakubjelinek for providing support).

However issues remain:

At least one issue was identified. It is a minor issue so we still released, but it should be fixed. With float16_t, we have that he smallest value (subnormal) that can be represented using float16 is 2**-24 Consider 5.9604644775390625E-8 which is exactly 2**-25. This value is exactly midpoint between the float16 0 and the smallest float16 value. It should be zero (with rounding to even) but it is not. GCC shares this issue and it is quite minor. But there may be other similar issues and this requires investigation. A cause of the issue is that our subnormal code assumes that there cannot be short strings requiring round-to-even: and that is a mathematically proven assumption for 32-bit and 64-bit floats. However that is not true in general.
More generally, we need to go through both the float16_t and bfloat16_t and prove (mathematically) that all parameters are correct and optimal.

Feb 07 '25 01:02 lemire