spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-33538][SQL]Directly push IN predicates to the Hive Metastore

Open yuexing opened this issue 5 months ago • 4 comments

What changes were proposed in this pull request?

see https://issues.apache.org/jira/browse/SPARK-33538. we want to pushdown IN predictor as hive2.0 already support it, instead of converting to Or predicate.

Why are the changes needed?

see https://issues.apache.org/jira/browse/SPARK-33538.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

Was this patch authored or co-authored using generative AI tooling?

No

yuexing avatar Jun 09 '25 12:06 yuexing

@yuexing do you have perf numbers for this change? with which kind/version of backend RDBMS of HMS?

pan3793 avatar Jun 10 '25 06:06 pan3793

Blindly replacing = to IN (...) even for a single-element lookups can actually cause regressions, depending on HMS backend.

Worse performance: IN (...) can skip indexes or change query plans probably affects pgsql, Oracle, SQL Server, MySQL etc. Needs to be properly tested for regressions!

NULL behavior differences: col = NULL vs col IN (NULL) behave differently.

Plan caching / cursor sharing changes - affects at least Oracle, SQL Server. May be a lesser issue, but could add up for metastores with high qps.

Safer Strategy - only rewrite to IN (...): If there's more than one value.

Tagar avatar Jun 10 '25 15:06 Tagar

Separately, spark.sql.hive.metastorePartitionPruningDirectInEnabled is too long? I would not understand what this means. Perthaps spark.sql.hive.metastore.directInPruning or spark.sql.hive.hms.directIn is better? Just as an option

Tagar avatar Jun 10 '25 15:06 Tagar

Separately, spark.sql.hive.metastorePartitionPruningDirectInEnabled is too long? I would not understand what this means. Perthaps spark.sql.hive.metastore.directInPruning or spark.sql.hive.hms.directIn is better? Just as an option

hi @Tagar Thanks for commenting, let me reply to those questions;

1st, 'in' predicate with null is already skipped in old code, see

def convert(expr: Expression): Option[String] = expr match {
      case Not(InSet(_, values)) if values.size > inSetThreshold =>
        None

      case Not(In(_, list)) if hasNullLiteral(list) => None
      case Not(InSet(_, list)) if list.contains(null) => None
...

2nd, the conf is already removed following the code review comment. please see newest patch

yuexing avatar Jun 11 '25 06:06 yuexing

Safer Strategy - only rewrite to IN (...): If there's more than one value.

@Tagar OptimizeIn will rewrite In to EqualTo If there's only one value: https://github.com/apache/spark/blob/da67217590cda3d287eaf66f68eb2613273629b4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L324-L329

wangyum avatar Jun 16 '25 10:06 wangyum

Merged into master.Thanks @yuexing @wangyum @pan3793 and @Tagar

LuciferYang avatar Jun 17 '25 02:06 LuciferYang