spark icon indicating copy to clipboard operation
spark copied to clipboard

[BUG]: Hive incompatibility when using microsoft-spark-3-1_2.12-2.1.1.jar

Open dcosta91 opened this issue 1 year ago • 1 comments

Describe the bug We are trying to upgrade some 2.4 spark jobs to 3.1 and everything goes well until it tries to write into a partition in Hive. It complains it cannot find the get_partition_names_ps_req method.

Just as a note, if I use microsoft-spark-2-4_2.11-2.1.1.jar, it works. So it must be some incompatibility between microsoft-spark-3-1_2.12-2.1.1.jar and the cluster I'm using.

Some questions:

  • Where can I find the source code for the microsoft-spark-3-1_2.12-2.1.1.jar?
  • Which version of hive is this jar using?
  • Does anybody has ideas on how to put this working?

Utilized versions:

  • microsoft-spark-3-1_2.12-2.1.1.jar
  • Cluster versions:
    • CDH-7.1.7
    • Spark: 3.1.1
    • Hive: 3.1.3

Error message:

[2023-03-03T09:18:44.7972462Z] [ldchio1091] [Error] [JvmBridge] org.apache.spark.sql.AnalysisException: org.apache.thrift.TApplicationException: Invalid method name: 'get_partition_names_ps_req' at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadDynamicPartitions$1(HiveClientImpl.scala:969) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:316) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:245) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:244) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:298) at org.apache.spark.sql.hive.client.HiveClientImpl.loadDynamicPartitions(HiveClientImpl.scala:956) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadDynamicPartitions$1(HiveExternalCatalog.scala:945) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) at org.apache.spark.sql.hive.HiveExternalCatalog.loadDynamicPartitions(HiveExternalCatalog.scala:925) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadDynamicPartitions(ExternalCatalogWithListener.scala:189) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:221) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:103) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:133) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:132) at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:997) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:997) at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:542) at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:496) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.spark.api.dotnet.DotnetBackendHandler.handleMethodCall(DotnetBackendHandler.scala:165) at org.apache.spark.api.dotnet.DotnetBackendHandler.$anonfun$handleBackendRequest$2(DotnetBackendHandler.scala:105) at org.apache.spark.api.dotnet.ThreadPool$$anon$1.run(ThreadPool.scala:34) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)

dcosta91 avatar Mar 03 '23 10:03 dcosta91

I found out now that the method get_partition_names_ps_req does not exist in Hive 3.1.3 and was only added on Hive 4.0.0 (which is not released yet...)

Therefore my question is, why is Microsoft Worker not compatible with Hive 3.1.3? If I run Java/Scala on the same spark version, it works with hive. What am I missing here?

dcosta91 avatar Mar 06 '23 09:03 dcosta91