arkouda
arkouda copied to clipboard
`setops.py` benchmark `check-correctness` for intersect
For the setops.py
benchmark I think that sampling with N = 10**4
might be too low to expect the intersection to be non-empty for check correctness
https://github.com/Bears-R-Us/arkouda/blob/3ffadffbd7303fc1a88c00479af5d683a8796783/benchmarks/setops.py#L68-L74
Let's assume there's no overlap in the 10**4
integers selected from the range 2**32
for a
, then for b
we the chance of a single element coinciding with an element in a
is 10**4/2**32
since there's 2**32
total options and only 10**4
of them are in a
.
Some quick math (good chance it's wrong since probability is not my strong suit):
The odds that any given element of b
does not coincide with any element of a
is 1- (10**4/2**32)
or
In [5]: 1-(10**4/2**32)
Out[5]: 0.9999976716935635
There are 10**4
elements in b
so we have that many chances to hit an element of a
. These are independent events so the odds that they all miss are (1-(10**4/2**32))**(10**4)
In [6]: (1-(10**4/2**32))**(10**4)
Out[6]: 0.9769858682552917
So I think the call ak.intersect1d(a,b)
is expected to be empty 97.7% of the time. I don't know if this matters, but I figured I'd drop an issue so someone other than me was aware of it. Maybe @reuster986 can weigh in?
Note: a quick fix for this is to just lower the range we're drawing from. If we select a
and b
from 2**20
instead we get this
In [7]: (1-(10**4/2**20))**(10**4)
Out[7]: 2.4193111959798607e-42
So we expect the intersection to be empty 2.42**(10**-40)
% of the time (so basically never). Granted this all depends on my math being right which is a big assumption
Hey @mhmerrill is this something that we care to fix or should we go ahead and close the issue?