gazelle_plugin
gazelle_plugin copied to clipboard
Failed to read chinese data with gbk encoded in Parquet
Describe the bug We meet this problem when there is some data, whose encoding is gbd, written in parquet, and we want to read data from it. For vanilla spark, it won't check utf8 valid and just return the data.
Caused by: java.lang.RuntimeException: Invalid UTF8 payload
at org.apache.arrow.dataset.jni.JniWrapper.nextRecordBatch(Native Method)
at org.apache.arrow.dataset.jni.NativeScanner$1.hasNext(NativeScanner.java:88)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:43)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:488)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.execution.datasources.v2.arrow.SparkMemoryUtils$UnsafeItr.hasNext(SparkMemoryUtils.scala:330)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
at com.intel.oap.execution.ColumnarHashAggregateExec$$anon$1.process(ColumnarHashAggregateExec.scala:162)
at com.intel.oap.execution.ColumnarHashAggregateExec$$anon$1.hasNext(ColumnarHashAggregateExec.scala:199)
at com.intel.oap.vectorized.CloseableColumnBatchIterator.hasNext(CloseableColumnBatchIterator.scala:47)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:96)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:510)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:513)
To Reproduce
Expected behavior No exception
Additional context