[SPARK-47482] Add HiveDialect to sql module
What changes were proposed in this pull request?
Add HiveDialect to sql module
Why are the changes needed?
In scenarios with multiple hive catalogs, throw ParseException
SQL
bin/spark-sql \
--conf "spark.sql.catalog.aaa=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog" \
--conf "spark.sql.catalog.aaa.url=jdbc:hive2://172.16.10.12:10000/data" \
--conf "spark.sql.catalog.aaa.driver=org.apache.hive.jdbc.HiveDriver" \
--conf "spark.sql.catalog.bbb=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog" \
--conf "spark.sql.catalog.bbb.url=jdbc:hive2://172.16.10.13:10000/data" \
--conf "spark.sql.catalog.bbb.driver=org.apache.hive.jdbc.HiveDriver"
select count(1) from aaa.data.data_part;
Exception
24/03/19 21:58:25 INFO HiveSessionImpl: Operation log session directory is created: /tmp/root/operation_logs/f15a5434-6356-455b-aa8e-4ce9903c1b81
24/03/19 21:58:25 INFO SparkExecuteStatementOperation: Submitting query 'SELECT * FROM "data"."data_part" WHERE 1=0' with a7459d6d-2a5c-4b56-945c-3159e58d12fd
24/03/19 21:58:25 INFO SparkExecuteStatementOperation: Running query with a7459d6d-2a5c-4b56-945c-3159e58d12fd
24/03/19 21:58:25 INFO DAGScheduler: Asked to cancel job group a7459d6d-2a5c-4b56-945c-3159e58d12fd
24/03/19 21:58:25 ERROR SparkExecuteStatementOperation: Error executing query with a7459d6d-2a5c-4b56-945c-3159e58d12fd, currentState RUNNING,
org.apache.spark.sql.catalyst.parser.ParseException:
Syntax error at or near '"data"'(line 1, pos 14)
== SQL ==
SELECT * FROM "data"."data_part" WHERE 1=0
--------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:89)
at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:620)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:620)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651)
Does this PR introduce any user-facing change?
no
How was this patch tested?
local test
Was this patch authored or co-authored using generative AI tooling?
no
@xleoken I think you can implements the catalog plugin and register two custom hive jdbc dialects.
Just FYI, SPARK-47496 makes loading a custom dialect much easier.
This is too heavy for users and there's no need for it.
As Daniel Fernandez said, only two functions should be overriden. in https://issues.apache.org/jira/browse/SPARK-22016
https://issues.apache.org/jira/browse/SPARK-21063 https://issues.apache.org/jira/browse/SPARK-22016 https://issues.apache.org/jira/browse/SPARK-31457
Hi @dongjoon-hyun @yaooqinn @HyukjinKwon, please look into this issue seriously. The old related PRs hasn't been active for a long time, we can discuss this here.
When we met this issue, the client told me Table or view not found, while the server told org.apache.spark.sql.catalyst.parser.ParseException. We spend a lot time to analyze this issue, and solved it.
By the way, can throw not support jdbc:hive2 exception directly? Or update the doc to told user need to custom dialect.
Make a list
- From the following exception stacktrace, we need to spend a lot of time analyzing that the root cause of this problem is from
JdbcDialects#quoteIdentifier. - It can be provided as a thridparty library or implements the catalog plugin, it is too heavy for users. As yaooqinn said, it's difficult to register a custom JDBC dialect to use. https://github.com/apache/spark/pull/45626
1、Startup thriftserver
sbin/start-thriftserver.sh
2、Startup spark-shell
bin/spark-shell \
--conf spark.sql.catalog.aaa=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog \
--conf spark.sql.catalog.aaa.url=jdbc:hive2://172.16.10.12:10000/data \
--conf spark.sql.catalog.aaa.driver=org.apache.hive.jdbc.HiveDriver
3、Query
select * from aaa.data.data_part limit 1
4、Client Exception : (Table or view not found: aaa.data.data_part)
scala> spark.sql("select * from aaa.data.data_part limit 1").show();
24/03/22 08:35:53 WARN HiveConnection: Failed to connect to 172.16.10.12:10000
org.apache.spark.sql.AnalysisException: Table or view not found: aaa.data.data_part; line 1 pos 14;
'GlobalLimit 1
+- 'LocalLimit 1
+- 'Project [*]
+- 'UnresolvedRelation [aaa, data, data_part], [], false
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:131)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:102)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:367)
5、Server Exception (org.apache.spark.sql.catalyst.parser.ParseException)
24/03/22 08:45:42 INFO ThriftCLIService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V10
24/03/22 08:45:42 INFO HiveSessionImpl: Operation log session directory is created: /tmp/root/operation_logs/4d373392-cc24-45fd-b9b7-4e27eeb48292
24/03/22 08:45:42 INFO SparkExecuteStatementOperation: Submitting query 'SELECT * FROM "data"."data_part" WHERE 1=0' with b5e0d91c-6d3f-4a79-9bd6-d78233150e56
24/03/22 08:45:42 INFO SparkExecuteStatementOperation: Running query with b5e0d91c-6d3f-4a79-9bd6-d78233150e56
24/03/22 08:45:42 INFO DAGScheduler: Asked to cancel job group b5e0d91c-6d3f-4a79-9bd6-d78233150e56
24/03/22 08:45:42 ERROR SparkExecuteStatementOperation: Error executing query with b5e0d91c-6d3f-4a79-9bd6-d78233150e56, currentState RUNNING,
org.apache.spark.sql.catalyst.parser.ParseException:
Syntax error at or near '"data"'(line 1, pos 14)
== SQL ==
SELECT * FROM "data"."data_part" WHERE 1=0
--------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:89)
at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:620)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:620)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651)
This patch works for me too.
cc @dongjoon-hyun @yaooqinn @HyukjinKwon
I think we shouldn't add HiveDialect as built-in dialect. Users could add custom dialect with https://github.com/apache/spark/pull/45626.