Koloboke icon indicating copy to clipboard operation
Koloboke copied to clipboard

Performance issues when copying between Koloboke sets

Open gauravagarwal opened this issue 6 years ago • 2 comments

Hi, due to the way the iteration is performed and new entries are added, If one iterates over one set and copies to another set, the performance is very slow. Consider the following code snippet:

            final Random r = new Random();
            final int num = (int) (1024 * 1024 * 2.1);
            final HashLongSet set1 = HashLongSets.newMutableSet();
            for (int i = 0; i < num; i++) {
                final long oid = r.nextLong();
                set1.add(oid);
            }

            System.out.println("populated first set..");

            final HashLongSet set2 = HashLongSets.newMutableSet();
            final LongCursor cursor = set1.cursor();
            while (cursor.moveNext()) {
                set2.add(cursor.elem());
            }
            System.out.println("populated first set..");

Is there any way to accelerate the population of second set in this case? I understand that if I knew the expected set size upfront, I could have used it on second set construction and made things faster - but that is not always possible - I could have inserted some conditions in between that determined which output set the value needs to be inserted to, or thrown away completely.

gauravagarwal avatar Jan 08 '18 12:01 gauravagarwal

I don't understand what you are asking. Please try to re-formulate.

What is "very slow", to what figure do you compare? What do you expect from set copying?

leventov avatar Jan 10 '18 22:01 leventov

Hi - I have better understanding of the problem now and understand that this probably cannot be improved as it is. What I meant is following:

input: set1
for k in set1 {
 if (cond1(k) {s2.add(k)}
 else if (cond2(k) {s3.add(k)}
 else if (cond3(k) {s4.add(k)}
}

With the above code structure, the addition will be much slow due to the fact that we are iterating over the elements in set1 in reverse order and then when adding to set2, set3, set4, it is leading to a lot of collisions.

Each addition call degenerates into an O(n) operation where n being current size of the set being added to.

I was hoping if there could be an efficient set iteration order that does not lead to the degenerate set add() performance when adding to another koloboke set.

Please let me know if this makes sense; if not, I will post a code snippet that exactly compiles and provides sense of absolute "slowness" I am referring to.

PS: I got around this problem by copying the elements from set1 into a long[] and then shuffling the long[] and iterating over it (instead of iterating over the set cursor() directly) when adding to set2/set3/set4...

gauravagarwal avatar Jan 11 '18 03:01 gauravagarwal