hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-7769] Fix Hudi CDC read with legacy parquet file format on Spark

Open yihua opened this issue 1 year ago • 1 comments

Change Logs

The CDC releation expects InternalRow from the base and log files for merging, so we have to explicitly turn off spark.sql.parquet.enableVectorizedReader. Otherwise, the error is thrown for the CDC query with Hudi legacy parquet file format on Spark:

Job aborted due to stage failure: Task 0 in stage 84.0 failed 1 times, most recent failure: Lost task 0.0 in stage 84.0 (TID 122) (fv-az692-999.kaylvc4pbm2utmerkaq2ecni0a.ex.internal.cloudapp.net executor driver): java.lang.AssertionError
	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLongs(OnHeapColumnVector.java:389)
	at org.apache.spark.sql.vectorized.ColumnarArray.toLongArray(ColumnarArray.java:88)
	at org.apache.spark.sql.vectorized.ColumnarArray.copy(ColumnarArray.java:65)
	at org.apache.spark.sql.vectorized.ColumnarBatchRow.copy(ColumnarBatchRow.java:77)
	at org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.$anonfun$loadCdcFile$1(HoodieCDCRDD.scala:443)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.loadCdcFile(HoodieCDCRDD.scala:441)
	at org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.hasNextInternal(HoodieCDCRDD.scala:250)
	at org.apache.hudi.cdc.HoodieCDCRDD$CDCFileGroupIterator.hasNext(HoodieCDCRDD.scala:278)

Impact

Fixes CDC read on newer Spark versions.

Risk level

low

Documentation Update

none

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

yihua avatar May 16 '24 04:05 yihua

CI report:

  • 922efda55e668b992e1b12b873be49c7f1645fba Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar May 18 '24 17:05 hudi-bot