cykhash
cykhash copied to clipboard
Intersect and union are order-dependent
First - I love cykhash, this library is amazing, thank you so much for making this. The project I'm working on would be impossible without it.
I found some surprising behavior.
setup
big_set = Int64Set(range(10**6))
small_set = Int64Set(range(10**4))
intersection
%%timeit -n 5 -r 5
big_set.intersection(small_set)
Gives: 35.4 ms ± 1.73 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
%%timeit -n 5 -r 5
small_set.intersection(big_set)
Gives: 1.36 ms ± 131 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)
union
%%timeit -n 5 -r 5
small_set.union(big_set)
Gives: 76.2 ms ± 2.25 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
%%timeit -n 5 -r 5
big_set.union(small_set)
Gives: 31.9 ms ± 942 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)
Summary
When intersecting small and big, you want small.intersection(big)
, while union is better with the reverse order.
It's easy enough to fix, you just need an if len(one_set) > len(other_set)
type of thing. But it's unexpected to have to do this, since most set implementations handle this internally.
@manimino thanks for spotting the issue.
I have no plan where the next version will be released, but will include this improvement.
@manimino
After looking more into it: small.update(big)
will always be slower than big.update(small)
because much more elements must be added to small
(and similar for intersection
).
So if performance is the most important part, e.g. the union of two sets should be calculated and it is not important what happens to the input-sets, one could do:
def fast_union(a,b):
if len(a)>len(b):
a.update(b)
return a
else:
b.update(a)
return b
but in general fast_union
has strange side effects on a
and b
so I'm not sure, it should be really part of the simple interface cykhash
provides - one should write this function if performance is what count.
For the time being this will not be added to cykhash for the time being.