incubator-gluten
incubator-gluten copied to clipboard
Rules of anti join in Spark can be not aligned with query engines' behavior
Rules of Anti Join in Spark
The config naming "is_null_aware_anti_join" from Spark decides how to handle null value. This config is true if below conditions are satisfied:
if (is_null_aware_anti_join) {
require(leftKeys.length == 1, "leftKeys length should be 1")
require(rightKeys.length == 1, "rightKeys length should be 1")
require(joinType == LeftAnti, "joinType must be LeftAnti.")
require(buildSide == BuildRight, "buildSide must be BuildRight.")
require(condition.isEmpty, "null aware anti join optimize condition should be
empty.")
}
If this config is true, there are four cases:
(1) If hash table is empty, return stream side;
(2) Else if existing key in hash table being null, no row will be joined;
(3) Else if the key from stream side is null, this row will not be joined;
(4) Else if the key from stream side cannot be found in build side, this row will be
joined.
If this config is false, there are three cases:
(1) If key from stream side contains null, this row will be joined;
(2) Else if the key from stream side cannot be found in build side, this row will be
joined;
(3) Else the key from stream side does not statisfy the condition, this row
will be joined.
cc @zzcclp
Thank @rui-mo . We also found some issues when executing gluten + clickhouse backend, please refer to CH-93 and CH-94
And then, we tested some cases with pg and ch, found some inconsistent behaviors:
- not in empty set, pg return all left table records, but ch filter null
- left anti join empty right table, pg return all left table records, but ch filter null
- not in all null set, pg return no records, but ch return all records without null
- left anti join all null key right table, pg return all records, but ch return all records without null
- left anti join left table has null key, pg return records with null, but ch return reocrds without null
- not in set has null, pg return empty, but ch return records
- left anti join right table has null key, pg return records with null, but ch return reocrds without null
spark and pg are the same behaviors.
@rui-mo thanks
There are 3 cases for not in or not exists where sets:
- empty
- sets which have null vlues
- sets which have none-null values.
When not in and not exists are implemetented in terms of anti join, then anti-join has the following semantcis:
empty sets | sets which have null vlues | sets which have none-null values. | |
---|---|---|---|
not in | return stream side even if join key has null value | return emtpy | return join key which doesn't exist in hash set, null join key will be filtered |
not exists | return stream side even if join key has null value | return join key which doesn't exist in hash set, null join key will be remained | return join key which doesn't exist in hash set, null join key will be remained |
Actaully only BHJ anti -join supports null aware which is descirbed in SingleColumn Null Aware Anti Join Optimize, both smj and shj don't filter null value in the case 3.
cc @FelixYBW.
is this solved?
The issues on Velox backend are all fixed: https://github.com/facebookincubator/velox/issues/1722 https://github.com/facebookincubator/velox/discussions/3555
velox seems to have solved this. there is a great article on this: https://facebookincubator.github.io/velox/develop/anti-join.html (thanks to zhouyuan)