hudi [WIP][HUDI-6472] fix spark sql does not ignore case

Change Logs

github issue: #10558

first: SimpleAnalyzer will case sensitive, use sessionState analyzer replace it.

second: after support first feature, will get exception.

Unexpected exception thrown: java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken.
org.opentest4j.AssertionFailedError: Unexpected exception thrown: java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken.
	at org.junit.jupiter.api.AssertDoesNotThrow.createAssertionFailedError(AssertDoesNotThrow.java:83)
	at org.junit.jupiter.api.AssertDoesNotThrow.assertDoesNotThrow(AssertDoesNotThrow.java:54)
	at org.junit.jupiter.api.AssertDoesNotThrow.assertDoesNotThrow(AssertDoesNotThrow.java:37)
	at org.junit.jupiter.api.Assertions.assertDoesNotThrow(Assertions.java:3060)
	at org.apache.spark.sql.hudi.TestInsertTable.$anonfun$new$226(TestInsertTable.scala:2469)
	at org.apache.spark.sql.hudi.TestInsertTable.$anonfun$new$226$adapted(TestInsertTable.scala:2438)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.hudi.TestInsertTable.$anonfun$new$225(TestInsertTable.scala:2438)

we will use primariKey to replace datafram schema, because avro is case-sensitive. And optimizer rule FoldablePropagation will replace attributes with aliases of the original foldable expressions if possible. we need optimizer it before we use primaryKey name replace it, because avro is case-sensitive. for example, a sql insert into table $tableName select 1 as ID, name, price, ts from $tableNameA order by ID

if primaryKey is id, and logical plan has resolved, when we use id replace ID#22 name, the analyze plan will like:
   Project [id#22, name#24, price#25, ts#26L]
     +- Sort [ID#22 ASC NULLS FIRST], true
        +- Project [1 AS ID#22, name#24, price#25, ts#26L]
           +- SubqueryAlias spark_catalog.default.h1
              +- Relation default.h1[ID#23,name#24,price#25,ts#26L] parquet
this logical plan will be optimizer in FoldablePropagation rule:
   Project [1 AS ID#22, name#24, price#25, ts#26L]
     +- Sort [1 ASC NULLS FIRST], true
        +- Project [1 AS ID#22, name#24, price#25, ts#26L]
           +- Relation default.h1[ID#23,name#24,price#25,ts#26L] parquet
in optimizer, `RuleExecutor` will use `isPlanIntegral` to check prePlan and curPlan schema are the same, ut will failed

Impact

insert into and merge sql will increase the optimizer analysis.

Risk level (write none, low medium or high below)

low

Documentation Update

none

Contributor's checklist

[ ] Read through contributor's guide
[ ] Change Logs and Impact were stated clearly
[ ] Adequate tests were added if applicable
[ ] CI passed

Jan 29 '24 14:01 KnightChess

CI report:

b2f4afe93e6c67f73bcfde03557268a047f422a1 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Jan 29 '24 17:01 hudi-bot

Hi, @KnightChess , what's the plan for this fix, do we want to include it into the release 0.14.2?

Feb 18 '24 02:02 danny0405

made some changes to this pr and put them into a new one https://github.com/apache/hudi/pull/10826. @danny0405 how should we proceed?

Mar 06 '24 01:03 jonvex

Sorry for the late reply. @jonvex I will close this pr, thank you work for it.

Mar 06 '24 02:03 KnightChess

hudi hudi copied to clipboard

[WIP][HUDI-6472] fix spark sql does not ignore case

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

CI report:

hudi
hudi copied to clipboard