cykhash Intersect and union are order-dependent

First - I love cykhash, this library is amazing, thank you so much for making this. The project I'm working on would be impossible without it.

I found some surprising behavior.

setup

big_set = Int64Set(range(10**6))
small_set = Int64Set(range(10**4))

intersection

%%timeit -n 5 -r 5
big_set.intersection(small_set)

Gives: 35.4 ms ± 1.73 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

%%timeit -n 5 -r 5
small_set.intersection(big_set)

Gives: 1.36 ms ± 131 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)

union

%%timeit -n 5 -r 5
small_set.union(big_set)

Gives: 76.2 ms ± 2.25 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

%%timeit -n 5 -r 5
big_set.union(small_set)

Gives: 31.9 ms ± 942 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)

Summary

When intersecting small and big, you want small.intersection(big), while union is better with the reverse order.

It's easy enough to fix, you just need an if len(one_set) > len(other_set) type of thing. But it's unexpected to have to do this, since most set implementations handle this internally.

Jul 06 '22 13:07 manimino

@manimino thanks for spotting the issue.

I have no plan where the next version will be released, but will include this improvement.

Jul 07 '22 07:07 realead

@manimino

After looking more into it: small.update(big) will always be slower than big.update(small) because much more elements must be added to small (and similar for intersection).

So if performance is the most important part, e.g. the union of two sets should be calculated and it is not important what happens to the input-sets, one could do:

def fast_union(a,b):
    if len(a)>len(b):
        a.update(b)
        return a
    else:
        b.update(a)
        return b

but in general fast_union has strange side effects on a and b so I'm not sure, it should be really part of the simple interface cykhash provides - one should write this function if performance is what count.

For the time being this will not be added to cykhash for the time being.

Feb 07 '23 07:02 realead

cykhash cykhash copied to clipboard

Intersect and union are order-dependent

setup

intersection

union

Summary

cykhash
cykhash copied to clipboard