arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

`setops.py` benchmark `check-correctness` for intersect

Open stress-tess opened this issue 2 years ago • 1 comments

For the setops.py benchmark I think that sampling with N = 10**4 might be too low to expect the intersection to be non-empty for check correctness

https://github.com/Bears-R-Us/arkouda/blob/3ffadffbd7303fc1a88c00479af5d683a8796783/benchmarks/setops.py#L68-L74

Let's assume there's no overlap in the 10**4 integers selected from the range 2**32 for a, then for b we the chance of a single element coinciding with an element in a is 10**4/2**32 since there's 2**32 total options and only 10**4 of them are in a.

Some quick math (good chance it's wrong since probability is not my strong suit): The odds that any given element of b does not coincide with any element of a is 1- (10**4/2**32) or

In [5]: 1-(10**4/2**32)
Out[5]: 0.9999976716935635

There are 10**4 elements in b so we have that many chances to hit an element of a. These are independent events so the odds that they all miss are (1-(10**4/2**32))**(10**4)

In [6]: (1-(10**4/2**32))**(10**4)
Out[6]: 0.9769858682552917

So I think the call ak.intersect1d(a,b) is expected to be empty 97.7% of the time. I don't know if this matters, but I figured I'd drop an issue so someone other than me was aware of it. Maybe @reuster986 can weigh in?

Note: a quick fix for this is to just lower the range we're drawing from. If we select a and b from 2**20 instead we get this

In [7]: (1-(10**4/2**20))**(10**4)
Out[7]: 2.4193111959798607e-42

So we expect the intersection to be empty 2.42**(10**-40)% of the time (so basically never). Granted this all depends on my math being right which is a big assumption

stress-tess avatar Apr 11 '22 16:04 stress-tess

Hey @mhmerrill is this something that we care to fix or should we go ahead and close the issue?

stress-tess avatar Jun 14 '22 15:06 stress-tess