drill DRILL-6845: Semi-Hash-Join to skip incoming build duplicates, automatically stop skipping if too few

The first two commits here were extracted from the original PR #1522 (DRILL-6735), where the Semi-Hash-Join was implemented in a straightforward way: Read data like a regular hash join (e.g. into partitions, then later build hash-tables), and only during probe time perform at most a single probe match. The issue with the above implementation is the case of excessive incoming build-side duplicates (more common with synthetic data in benchmarks). In such a case, reading all the data first can blow up the hash join memory (e.g., cause spills) and regress performance.

This PR addresses the problem by creating the hash-tables first, and using them to detect build duplicates early (before copying from incoming into partitions), so those duplicates can be simply ignored/skipped (see the new method insertKeyIntoHashTable()). After all the build side is read (if no spill), there is no need to build the hash tables as they already exist - see the new method buildContainers() .
All this logic is in the first commit. The issue with this logic is that it adds overhead (e.g., hash table doubling), which is a waste when there are very little duplicates. So this issue is addressed by the second commit. (Also note the new option semi_skip_duplicates that can be used to disable this whole feature).

The second commit performs some "runtime statistics" to decide if there are too few duplicates. In such a case, it drops those hash tables and falls back to the simple semi-join work (a la PR #1522). This decision uses a "threshold", which is half the size of all the hash tables (so they won't double), and incoming duplicates are counted. After so many incoming rows are processed, the percentage of duplicates is checked - if under %20 (hard coded), then stop skipping, else continue using the hash tables to eliminate the duplicates.

The third commit extends the memory manager to handle this special "duplicate skipping" mode. With a new class HashJoinSpillControlImpl and interface HashJoinMemoryCalculator.HashJoinSpillControl. The technique used for shouldSpill() is simply ensuring that the available memory is large enough for at least 3 (see below) more batches. That required a change to all the shouldSpill() methods - add the currentVectorContainer parameter. Most of the code changes in HashPartition were a rename (batch -> vectorContainer) and in HashJoinBatch (added "semi" to some variable names). As for "running out of memory" while inserting into the hash table (either allocating a new keys batch, or resizing the hash table) -- this is handled by the hash table throwing RetryAfterSpillException, which is caught in the new insertKeyIntoHashTable() which leads to a spill, and a reset of the hash table anyway, and return false (it's a new key - it would be inserted into the new empty hash-table). So this case is much simpler than Hash-Aggr.

The fourth commit adds an option min_batches_in_available_memory instead of the above hard coded "3". Also added a method IntegerValidator that can specify the min/max values.

Jan 09 '19 03:01 Ben-Zvi

Commit added ( 6fb890c ) - The number of spilled partitions is taken from partitionStatSet (not passed via initialize()), and the size of the probe side is also considered when calculating the number of partitions possible (in calculateMemoryUsage() )

Jan 15 '19 03:01 Ben-Zvi

Seems that this PR was closed by mistake; re-opening now. @ilooner - do you have more comments or suggestions ? we are trying to finish and commit this work soon. The spill control may not be perfect, but in most use cases this "duplicate skipping" work will not depend on such fine control.

Jan 16 '19 03:01 Ben-Zvi

@Ben-Zvi let's not rush this PR. I agree with you we should not be trying to do things perfectly, that's why I mainly only focus on functional correctness in my reviews and avoid superficial comments about variable names and shortening lines of code. As you've experienced first hand with HashAgg, getting memory calculations right is extremely tricky, and they're even trickier to debug when users hit bugs. Let's make sure the logic is rock solid and unit tested while everything is still fresh in our minds. Doing this now will save us a lot more time in the coming months. Plus I think we are getting pretty close, I don't think there is that much code left to write.

If there is a time crunch and this needs to go into our private branch soon, I have no issues with putting this into the private branch, and continuing the review process in open source. Since the changes are mainly in the memory calculators, the chances of any significant merge conflict are almost zero.

Jan 16 '19 06:01 ilooner

drill drill copied to clipboard

DRILL-6845: Semi-Hash-Join to skip incoming build duplicates, automatically stop skipping if too few

drill
drill copied to clipboard