hudi
hudi copied to clipboard
[WIP][HUDI-6472] fix spark sql does not ignore case
Change Logs
github issue: #10558
first: SimpleAnalyzer will case sensitive, use sessionState analyzer replace it.
second:
after support first
feature, will get exception.
Unexpected exception thrown: java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken.
org.opentest4j.AssertionFailedError: Unexpected exception thrown: java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken.
at org.junit.jupiter.api.AssertDoesNotThrow.createAssertionFailedError(AssertDoesNotThrow.java:83)
at org.junit.jupiter.api.AssertDoesNotThrow.assertDoesNotThrow(AssertDoesNotThrow.java:54)
at org.junit.jupiter.api.AssertDoesNotThrow.assertDoesNotThrow(AssertDoesNotThrow.java:37)
at org.junit.jupiter.api.Assertions.assertDoesNotThrow(Assertions.java:3060)
at org.apache.spark.sql.hudi.TestInsertTable.$anonfun$new$226(TestInsertTable.scala:2469)
at org.apache.spark.sql.hudi.TestInsertTable.$anonfun$new$226$adapted(TestInsertTable.scala:2438)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.hudi.TestInsertTable.$anonfun$new$225(TestInsertTable.scala:2438)
we will use primariKey to replace datafram schema, because avro is case-sensitive.
And optimizer rule FoldablePropagation
will replace attributes with aliases of the original foldable expressions if possible.
we need optimizer it before we use primaryKey name replace it, because avro is case-sensitive.
for example, a sql insert into table $tableName select 1 as ID, name, price, ts from $tableNameA order by ID
if primaryKey is id, and logical plan has resolved, when we use id replace ID#22 name, the analyze plan will like:
Project [id#22, name#24, price#25, ts#26L]
+- Sort [ID#22 ASC NULLS FIRST], true
+- Project [1 AS ID#22, name#24, price#25, ts#26L]
+- SubqueryAlias spark_catalog.default.h1
+- Relation default.h1[ID#23,name#24,price#25,ts#26L] parquet
this logical plan will be optimizer in FoldablePropagation rule:
Project [1 AS ID#22, name#24, price#25, ts#26L]
+- Sort [1 ASC NULLS FIRST], true
+- Project [1 AS ID#22, name#24, price#25, ts#26L]
+- Relation default.h1[ID#23,name#24,price#25,ts#26L] parquet
in optimizer, `RuleExecutor` will use `isPlanIntegral` to check prePlan and curPlan schema are the same, ut will failed
Impact
insert into
and merge sql
will increase the optimizer analysis.
Risk level (write none, low medium or high below)
low
Documentation Update
none
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
CI report:
- b2f4afe93e6c67f73bcfde03557268a047f422a1 Azure: FAILURE
Bot commands
@hudi-bot supports the following commands:-
@hudi-bot run azure
re-run the last Azure build
Hi, @KnightChess , what's the plan for this fix, do we want to include it into the release 0.14.2?
made some changes to this pr and put them into a new one https://github.com/apache/hudi/pull/10826. @danny0405 how should we proceed?
Sorry for the late reply. @jonvex I will close this pr, thank you work for it.