fastutil Floating point precision loss causes unnecessary rehashing at large set sizes

I discovered this behavior while experimenting with very large hash sets, in particular with ObjectLinkedOpenHashSet.

The usage of a float to store the load factor can lead to HashCommon.arraySize breaking its method contract and returning a power of two smaller than Math.ceil(expected / f). This occurrence can also cause HashCommon.maxFill to produce the same new size as the current set size, causing a pointless rehash to the same set size.

Example: HashCommon.arraySize(50331650, 0.75f) produces 67108864, whilst Math.ceil(50331650 / 0.75d) = 67108867, so the expected result is 134217728.

Floats of course can only store each integer value exactly up to 8,338,607, and will begin to lose precision as they continue to increase. These particular functions (and likely others) will lose integer precision when performing operations with large floats. Utilizing a double to store the load factor would resolve these issues, as doubles can represent every integer from 0 up to 4,503,599,627,370,495.

It may also be desirable to implement a check within rehashing such that it does nothing if it is rehashing to the same size.

I've also included a simple demo simulating the unnecessary rehashes and that using doubles resolves the issue.

FloatingPointPrecisionLossDemo.txt

Oct 19 '24 09:10 Captain-S0L0

Whoa. That's really something tough to catch—kudos. It seems to me that the problem can be solved by simply casting f to double—at that point the integer is cast implicitly to double, and we use the Math.ceil(double) function.

Somehow inconsciously I had in mind that the computation between the typecast integer and the float would happen in double precision, which is true: the problem, however, is that the result is then cast to float, and that is the moment where the precision loss occurs.

Do you think rehashing to the same size can happen even when using double?

Oct 19 '24 15:10 vigna

If you want to tell me your name I'll credit you in the CHANGES file; otherwise, I'll use your github handle.

Oct 19 '24 15:10 vigna

Rehashing to the same size will not occur with double within integer bounds. It might also still work for long, but I have not tested that.

In regards to credit, my github handle is fine.

Oct 19 '24 18:10 Captain-S0L0

For longs problems should appear after 2^53 items, and that is a reasonable bound (in the sense that it seems to me unlikely one has maps of that size).

Oct 19 '24 18:10 vigna

Fixed in 8.5.15.

Oct 20 '24 07:10 vigna

Floats of course can only store each integer value exactly up to 8,338,607, and will begin to lose precision as they continue to increase. These particular functions (and likely others) will lose integer precision when performing operations with large floats. Utilizing a double to store the load factor would resolve these issues, as doubles can represent every integer from 0 up to 4,503,599,627,370,495.

Just a note—it's 16777215 (float)/9007199254740991 (double). The IEEE 754 32/64-bit representation can represent 24/53 significant bits using a significand of 23/52 bits by assuming that the first digit after the point is always 1.

Oct 21 '24 15:10 vigna