spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-47482] Add HiveDialect to sql module

Open xleoken opened this issue 1 year ago • 5 comments

What changes were proposed in this pull request?

Add HiveDialect to sql module

Why are the changes needed?

In scenarios with multiple hive catalogs, throw ParseException

SQL

bin/spark-sql \
  --conf "spark.sql.catalog.aaa=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog" \
  --conf "spark.sql.catalog.aaa.url=jdbc:hive2://172.16.10.12:10000/data" \
  --conf "spark.sql.catalog.aaa.driver=org.apache.hive.jdbc.HiveDriver" \
  --conf "spark.sql.catalog.bbb=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog" \
  --conf "spark.sql.catalog.bbb.url=jdbc:hive2://172.16.10.13:10000/data" \
  --conf "spark.sql.catalog.bbb.driver=org.apache.hive.jdbc.HiveDriver"

select count(1) from aaa.data.data_part;

Exception

24/03/19 21:58:25 INFO HiveSessionImpl: Operation log session directory is created: /tmp/root/operation_logs/f15a5434-6356-455b-aa8e-4ce9903c1b81
24/03/19 21:58:25 INFO SparkExecuteStatementOperation: Submitting query 'SELECT * FROM "data"."data_part" WHERE 1=0' with a7459d6d-2a5c-4b56-945c-3159e58d12fd
24/03/19 21:58:25 INFO SparkExecuteStatementOperation: Running query with a7459d6d-2a5c-4b56-945c-3159e58d12fd
24/03/19 21:58:25 INFO DAGScheduler: Asked to cancel job group a7459d6d-2a5c-4b56-945c-3159e58d12fd
24/03/19 21:58:25 ERROR SparkExecuteStatementOperation: Error executing query with a7459d6d-2a5c-4b56-945c-3159e58d12fd, currentState RUNNING, 
org.apache.spark.sql.catalyst.parser.ParseException: 
Syntax error at or near '"data"'(line 1, pos 14)

== SQL ==
SELECT * FROM "data"."data_part" WHERE 1=0
--------------^^^

	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143)
	at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:89)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:620)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:620)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651)

Does this PR introduce any user-facing change?

no

How was this patch tested?

local test

Was this patch authored or co-authored using generative AI tooling?

no

xleoken avatar Mar 21 '24 15:03 xleoken

@xleoken I think you can implements the catalog plugin and register two custom hive jdbc dialects.

Just FYI, SPARK-47496 makes loading a custom dialect much easier.

This is too heavy for users and there's no need for it.

As Daniel Fernandez said, only two functions should be overriden. in https://issues.apache.org/jira/browse/SPARK-22016

https://issues.apache.org/jira/browse/SPARK-21063 https://issues.apache.org/jira/browse/SPARK-22016 https://issues.apache.org/jira/browse/SPARK-31457

xleoken avatar Mar 21 '24 15:03 xleoken

Hi @dongjoon-hyun @yaooqinn @HyukjinKwon, please look into this issue seriously. The old related PRs hasn't been active for a long time, we can discuss this here.

When we met this issue, the client told me Table or view not found, while the server told org.apache.spark.sql.catalyst.parser.ParseException. We spend a lot time to analyze this issue, and solved it.

By the way, can throw not support jdbc:hive2 exception directly? Or update the doc to told user need to custom dialect.

Make a list

  • From the following exception stacktrace, we need to spend a lot of time analyzing that the root cause of this problem is from JdbcDialects#quoteIdentifier.
  • It can be provided as a thridparty library or implements the catalog plugin, it is too heavy for users. As yaooqinn said, it's difficult to register a custom JDBC dialect to use. https://github.com/apache/spark/pull/45626

1、Startup thriftserver

sbin/start-thriftserver.sh

2、Startup spark-shell

bin/spark-shell \
--conf spark.sql.catalog.aaa=org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog \
--conf spark.sql.catalog.aaa.url=jdbc:hive2://172.16.10.12:10000/data \
--conf spark.sql.catalog.aaa.driver=org.apache.hive.jdbc.HiveDriver

3、Query

select * from aaa.data.data_part limit 1

4、Client Exception : (Table or view not found: aaa.data.data_part)

scala> spark.sql("select * from aaa.data.data_part limit 1").show();
24/03/22 08:35:53 WARN HiveConnection: Failed to connect to 172.16.10.12:10000
org.apache.spark.sql.AnalysisException: Table or view not found: aaa.data.data_part; line 1 pos 14;
'GlobalLimit 1
+- 'LocalLimit 1
   +- 'Project [*]
      +- 'UnresolvedRelation [aaa, data, data_part], [], false

  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:131)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:102)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:367)

5、Server Exception (org.apache.spark.sql.catalyst.parser.ParseException)

24/03/22 08:45:42 INFO ThriftCLIService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V10
24/03/22 08:45:42 INFO HiveSessionImpl: Operation log session directory is created: /tmp/root/operation_logs/4d373392-cc24-45fd-b9b7-4e27eeb48292
24/03/22 08:45:42 INFO SparkExecuteStatementOperation: Submitting query 'SELECT * FROM "data"."data_part" WHERE 1=0' with b5e0d91c-6d3f-4a79-9bd6-d78233150e56
24/03/22 08:45:42 INFO SparkExecuteStatementOperation: Running query with b5e0d91c-6d3f-4a79-9bd6-d78233150e56
24/03/22 08:45:42 INFO DAGScheduler: Asked to cancel job group b5e0d91c-6d3f-4a79-9bd6-d78233150e56
24/03/22 08:45:42 ERROR SparkExecuteStatementOperation: Error executing query with b5e0d91c-6d3f-4a79-9bd6-d78233150e56, currentState RUNNING, 
org.apache.spark.sql.catalyst.parser.ParseException: 
Syntax error at or near '"data"'(line 1, pos 14)

== SQL ==
SELECT * FROM "data"."data_part" WHERE 1=0
--------------^^^

	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143)
	at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:89)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:620)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:620)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651)

xleoken avatar Mar 22 '24 00:03 xleoken

This patch works for me too.

charlesy6 avatar Mar 23 '24 05:03 charlesy6

cc @dongjoon-hyun @yaooqinn @HyukjinKwon

xleoken avatar Mar 26 '24 01:03 xleoken

I think we shouldn't add HiveDialect as built-in dialect. Users could add custom dialect with https://github.com/apache/spark/pull/45626.

beliefer avatar Apr 23 '24 03:04 beliefer