hudi
hudi copied to clipboard
[HUDI-7769] Fix Hudi CDC read with legacy parquet file format on Spark
Change Logs
The CDC releation expects InternalRow from the base and log files for merging, so we have to explicitly turn off spark.sql.parquet.enableVectorizedReader. Otherwise, the error is thrown for the CDC query with Hudi legacy parquet file format on Spark:
Job aborted due to stage failure: Task 0 in stage 84.0 failed 1 times, most recent failure: Lost task 0.0 in stage 84.0 (TID 122) (fv-az692-999.kaylvc4pbm2utmerkaq2ecni0a.ex.internal.cloudapp.net executor driver): java.lang.AssertionError
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLongs(OnHeapColumnVector.java:389)
at org.apache.spark.sql.vectorized.ColumnarArray.toLongArray(ColumnarArray.java:88)
at org.apache.spark.sql.vectorized.ColumnarArray.copy(ColumnarArray.java:65)
at org.apache.spark.sql.vectorized.ColumnarBatchRow.copy(ColumnarBatchRow.java:77)
at org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.$anonfun$loadCdcFile$1(HoodieCDCRDD.scala:443)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.loadCdcFile(HoodieCDCRDD.scala:441)
at org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.hasNextInternal(HoodieCDCRDD.scala:250)
at org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.hasNext(HoodieCDCRDD.scala:278)
Impact
Fixes CDC read on newer Spark versions.
Risk level
low
Documentation Update
none
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
CI report:
- 922efda55e668b992e1b12b873be49c7f1645fba Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build