spark
spark copied to clipboard
[SPARK-33538][SQL]Directly push IN predicates to the Hive Metastore
What changes were proposed in this pull request?
see https://issues.apache.org/jira/browse/SPARK-33538. we want to pushdown IN predictor as hive2.0 already support it, instead of converting to Or predicate.
Why are the changes needed?
see https://issues.apache.org/jira/browse/SPARK-33538.
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT
Was this patch authored or co-authored using generative AI tooling?
No
@yuexing do you have perf numbers for this change? with which kind/version of backend RDBMS of HMS?
Blindly replacing = to IN (...) even for a single-element lookups can actually cause regressions, depending on HMS backend.
Worse performance: IN (...) can skip indexes or change query plans probably affects pgsql, Oracle, SQL Server, MySQL etc. Needs to be properly tested for regressions!
NULL behavior differences: col = NULL vs col IN (NULL) behave differently.
Plan caching / cursor sharing changes - affects at least Oracle, SQL Server. May be a lesser issue, but could add up for metastores with high qps.
Safer Strategy - only rewrite to IN (...): If there's more than one value.
Separately, spark.sql.hive.metastorePartitionPruningDirectInEnabled is too long? I would not understand what this means.
Perthaps spark.sql.hive.metastore.directInPruning or spark.sql.hive.hms.directIn is better? Just as an option
Separately,
spark.sql.hive.metastorePartitionPruningDirectInEnabledis too long? I would not understand what this means. Perthapsspark.sql.hive.metastore.directInPruningorspark.sql.hive.hms.directInis better? Just as an option
hi @Tagar Thanks for commenting, let me reply to those questions;
1st, 'in' predicate with null is already skipped in old code, see
def convert(expr: Expression): Option[String] = expr match {
case Not(InSet(_, values)) if values.size > inSetThreshold =>
None
case Not(In(_, list)) if hasNullLiteral(list) => None
case Not(InSet(_, list)) if list.contains(null) => None
...
2nd, the conf is already removed following the code review comment. please see newest patch
Safer Strategy - only rewrite to IN (...): If there's more than one value.
@Tagar OptimizeIn will rewrite In to EqualTo If there's only one value: https://github.com/apache/spark/blob/da67217590cda3d287eaf66f68eb2613273629b4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L324-L329
Merged into master.Thanks @yuexing @wangyum @pan3793 and @Tagar