tis
tis copied to clipboard
spark sql查询批量同步生成的hudi分区表会报错,而hive和presto查询正常
order_info表设置分区策略为slashEncodedDay,字段选取create_time,同时设置pt为分区字段,触发执行成功。分别使用hive和presto查询表的分区情况和count数都无误,然后改使用spark-sql查询分区情况时会提示如下信息:
spark-sql> show partitions order_info;
org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not allowed on order_info since its partition metadata is not stored in the Hive metastore. To import this information into the metastore, run `msck repair table order_info`;
at org.apache.spark.sql.execution.command.DDLUtils$.verifyPartitionProviderIsHive(ddl.scala:835)
at org.apache.spark.sql.execution.command.ShowPartitionsCommand.run(tables.scala:888)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
于是执行 msck repair table order_info 命令后再重新执行上述命令显示正常
scala> spark.sql("msck repair table order_info").show(100, false)
22/06/08 10:17:05 WARN command.AlterTableRecoverPartitionsCommand: ignore hdfs://namenode/user/admin/default/20220608181019/order_info/hudi/2019
++
||
++
++
spark-sql> show partitions order_info;
+-------------+
|partition |
+-------------+
|pt=2019-11-23|
+-------------+
接下来查询count数
spark-sql> select count(1) from order_info
Caused by: java.io.IOException: Required column is missing in data file. Col: [pt]
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initializeInternal(VectorizedParquetRecordReader.java:292)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:132)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:418)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:352)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
收到
complete