incubator-gluten icon indicating copy to clipboard operation
incubator-gluten copied to clipboard

Rules of anti join in Spark can be not aligned with query engines' behavior

Open rui-mo opened this issue 2 years ago • 5 comments

Rules of Anti Join in Spark

The config naming "is_null_aware_anti_join" from Spark decides how to handle null value. This config is true if below conditions are satisfied:

  if (is_null_aware_anti_join) {
    require(leftKeys.length == 1, "leftKeys length should be 1")
    require(rightKeys.length == 1, "rightKeys length should be 1")
    require(joinType == LeftAnti, "joinType must be LeftAnti.")
    require(buildSide == BuildRight, "buildSide must be BuildRight.")
    require(condition.isEmpty, "null aware anti join optimize condition should be
empty.")
  }

If this config is true, there are four cases:

  (1) If hash table is empty, return stream side;
  (2) Else if existing key in hash table being null, no row will be joined;
  (3) Else if the key from stream side is null, this row will not be joined;
  (4) Else if the key from stream side cannot be found in build side, this row will be
joined.

If this config is false, there are three cases:

  (1) If key from stream side contains null, this row will be joined;
  (2) Else if the key from stream side cannot be found in build side, this row will be
joined;
  (3) Else the key from stream side does not statisfy the condition, this row
will be joined.

rui-mo avatar Aug 30 '22 03:08 rui-mo

cc @zzcclp

rui-mo avatar Aug 30 '22 03:08 rui-mo

Thank @rui-mo . We also found some issues when executing gluten + clickhouse backend, please refer to CH-93 and CH-94

And then, we tested some cases with pg and ch, found some inconsistent behaviors:

  1. not in empty set, pg return all left table records, but ch filter null
  2. left anti join empty right table, pg return all left table records, but ch filter null
  3. not in all null set, pg return no records, but ch return all records without null
  4. left anti join all null key right table, pg return all records, but ch return all records without null
  5. left anti join left table has null key, pg return records with null, but ch return reocrds without null
  6. not in set has null, pg return empty, but ch return records
  7. left anti join right table has null key, pg return records with null, but ch return reocrds without null

spark and pg are the same behaviors.

zzcclp avatar Aug 30 '22 07:08 zzcclp

@rui-mo thanks

There are 3 cases for not in or not exists where sets:

  1. empty
  2. sets which have null vlues
  3. sets which have none-null values.

When not in and not exists are implemetented in terms of anti join, then anti-join has the following semantcis:

empty sets sets which have null vlues sets which have none-null values.
not in return stream side even if join key has null value return emtpy return join key which doesn't exist in hash set, null join key will be filtered
not exists return stream side even if join key has null value return join key which doesn't exist in hash set, null join key will be remained return join key which doesn't exist in hash set, null join key will be remained

Actaully only BHJ anti -join supports null aware which is descirbed in SingleColumn Null Aware Anti Join Optimize, both smj and shj don't filter null value in the case 3.

baibaichen avatar Aug 30 '22 08:08 baibaichen

cc @FelixYBW.

rui-mo avatar Sep 15 '22 03:09 rui-mo

is this solved?

FelixYBW avatar Oct 18 '22 18:10 FelixYBW

The issues on Velox backend are all fixed: https://github.com/facebookincubator/velox/issues/1722 https://github.com/facebookincubator/velox/discussions/3555

zhouyuan avatar Aug 24 '23 01:08 zhouyuan

velox seems to have solved this. there is a great article on this: https://facebookincubator.github.io/velox/develop/anti-join.html (thanks to zhouyuan)

binmahone avatar Aug 24 '23 02:08 binmahone