gazelle_plugin
gazelle_plugin copied to clipboard
[ORC] Encounter bitmap out of bound issue in evaluateFilter
Describe the bug When run TPCDS integration testing. Encounter below out of bound issue
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 9) (vsr532 executor 1): max_bitmap_index 1920799 must be <= maxSupportedValue 65535 in selection vector
at org.apache.arrow.gandiva.evaluator.JniWrapper.evaluateFilter(Native Method)
at org.apache.arrow.gandiva.evaluator.Filter.evaluate(Filter.java:179)
at org.apache.arrow.gandiva.evaluator.Filter.evaluate(Filter.java:131)
at com.intel.oap.expression.ColumnarConditionProjector$$anon$1.hasNext(ColumnarConditionProjector.scala:241)
at com.intel.oap.vectorized.CloseableColumnBatchIterator.hasNext(CloseableColumnBatchIterator.scala:47)
at org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$2(ColumnarBroadcastExchangeExec.scala:107)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
By debugging,have figured out the cause was from Arrow:file_orc.cc
Result<RecordBatchIterator> Execute() override {
...
Result<std::shared_ptr<RecordBatch>> Next() {
if (i_ == num_stripes_) {
return nullptr;
}
std::shared_ptr<RecordBatch> batch;
// TODO (https://issues.apache.org/jira/browse/ARROW-14153)
// pass scan_options_->batch_size
return reader_->ReadStripe(i_++, included_fields_);
}
...
}
Now ORC in Arrow dataset has not yet honored the ScanOptions batch_size option.
So the returned recordbatch size maybe > 65535
cc @zhouyuan @zhztheplayer
#556 may help