Incorrect results for binary search on large arrays due to miscomputation of midpoint
I think I've spotted a bug in d3's binary search implementations that arises due to the use of unsigned bit shifts for the midpoint computation, which is a good approach when the numbers are actually unsigned integers, but not when they're floats:
https://observablehq.com/d/229d8bcf448308b9
If I've understood the problem correctly, the fix is simple, replacing
const mid = (lo + hi) >>> 1;
in bisector.left and bisector.right with
const mid = Math.floor(lo + (hi - lo) / 2)
for reference https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Unsigned_right_shift
Fun! Does this manifest in practice? The largest array I've been able to generate has 2,145,386,496 Uint8 elements, and it seems that d3 bisects it correctly.
In Safari the following code freezes the tab, which I think is compatible with the idea described here.
a = new Uint8Array(2147483649)
d3.bisectRight(a, 0)
https://github.com/d3/d3-array/blob/main/src/bisector.js#L34-L44
The comparison will always be zero, so lo will be incremented, even though mid is no longer in [lo, hi).
Thanks! My tests were with Chrome, which throws a "RangeError: Array buffer allocation failed".
Confirmed with Safari's debugger – after a few iterations, e is the left endpoint, i is the (exclusive) right endpoint, and r is the midpoint (zero!)
Yes it makes sense. Since there are log2(n) such divisions, I don't think it's expensive to change >>> to / 2 — the question is more, do we want to support such huge data arrays even though most browsers can't yet handle them (apparently you have found a use case!).
If we do, are there any other places in the code where we have this assumption that indices are <= 2^30?
If we don't, should we write this limitation in the documentation (support for arrays up to 2^30 elements…)?
All good questions.
It looks like this isn't a common issue right now, but given the direction things are heading I wouldn't be surprised if Chrome lifts the limit at some point in the next few years. Binary search has a rich history of bugs like this, which weren't discovered until many years later when data sizes increased. :)
I took a quick look and d3 uses the x >>> 1 idiom in other places, some of which would probably want a similar update.
FWIW I noticed this while experimenting with techniques for interactively zooming and panning ~billion-point datasets on the client side, which is a fairly niche use case right now.
Actually, I think we could just use
const mid = lo + (hi - lo) >>> 1;
as a strict improvement, which will work correctly for a larger range of values.
I was also reminded today of the existence of Math.trunc, which would work for this case and is said to be faster than Math.floor, though I haven't benchmarked it:
const mid = lo + Math.trunc((hi - lo) / 2);
Is there any reason not to use the simpler form?
const mid = Math.trunc((lo + hi) / 2);
I mean sure, you could run out of integer precision when adding lo + hi, but that should only happen if the combined index is greater than or equal to 2^53 (9,007,199,254,740,992). Which at a minimum would represent 9 petabytes of memory. And if d3-array is being used on a supercomputer of that size… well, they’d probably find other problems lol.
That's a good point. I don't see any issues with the simpler form.
Which at a minimum would represent 9 petabytes of memory. And if d3-array is being used on a supercomputer of that size… well, they’d probably find other problems lol.
😄
Related https://lemire.me/blog/2022/12/06/fast-midpoint-between-two-integers-without-overflow/