gazelle_plugin icon indicating copy to clipboard operation
gazelle_plugin copied to clipboard

Failed to read chinese data with gbk encoded in Parquet

Open jackylee-ch opened this issue 2 years ago • 0 comments

Describe the bug We meet this problem when there is some data, whose encoding is gbd, written in parquet, and we want to read data from it. For vanilla spark, it won't check utf8 valid and just return the data.

Caused by: java.lang.RuntimeException: Invalid UTF8 payload
        at org.apache.arrow.dataset.jni.JniWrapper.nextRecordBatch(Native Method)
        at org.apache.arrow.dataset.jni.NativeScanner$1.hasNext(NativeScanner.java:88)
        at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:43)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:488)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.execution.datasources.v2.arrow.SparkMemoryUtils$UnsafeItr.hasNext(SparkMemoryUtils.scala:330)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
        at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
        at com.intel.oap.execution.ColumnarHashAggregateExec$$anon$1.process(ColumnarHashAggregateExec.scala:162)
        at com.intel.oap.execution.ColumnarHashAggregateExec$$anon$1.hasNext(ColumnarHashAggregateExec.scala:199)
        at com.intel.oap.vectorized.CloseableColumnBatchIterator.hasNext(CloseableColumnBatchIterator.scala:47)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:96)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:510)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:513)

To Reproduce

Expected behavior No exception

Additional context

jackylee-ch avatar Jun 20 '22 16:06 jackylee-ch