hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

Open alexeykudinkin opened this issue 2 years ago • 3 comments

Change Logs

Currently, HoodieParquetReader is not specifying projected schema properly when reading Parquet files which ends up failing in cases when the provided schema is not equal the schema of the file being read (even though it might be a proper projection, ie subset of it)

To address the original issue described in HUDI-4588, we also have to relax the constraints imposed by TableSchemaResolver.isSchemaCompatible method not allowing columns to be evolved by the way of dropping columns.

Changes:

  1. Adding missing schema projection when reading Parquet file (using AvroParquetReader)
  2. Relaxing schema evolution constraints to allow columns to be dropped
  3. Revisiting schema reconciliation logic to make sure it's consistent
  4. Streamlining schema handling in HoodieSparkSqlWriter to make sure it's uniform for all operations (it isn't applied properly for Bulk-insert at the moment)
  5. Added comprehensive test for basic schema evolution.

Impact

Medium

There are a few critical changes taken forward by this PR:

Now, w/ incoming batches being able to drop columns (relative to the Table's existing schema), unless hoodie.datasource.write.reconcile.schema is enabled:

  • Incoming batch's schema will be taken as Writer's schema (same as before)
  • New/updated base files will be (re)written in the new schema (previously it would have failed)

This subtle change in behavior (dropped columns would not be leading to failures anymore) could open up a new set of problems where data quality issues (for ex, column is missing while it shouldn't) could trickle down into existing table.

To alleviate that we should consider flipping hoodie.datasource.write.reconcile.schema to true by default (there's already #6196 for that)

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

alexeykudinkin avatar Aug 10 '22 21:08 alexeykudinkin

As outlined in https://github.com/apache/hudi/pull/6196#discussion_r961984500, this PR should go hand in hand w/ the https://github.com/apache/hudi/pull/6196, which flips Schema Reconciliation to be enabled by default (entailing that every incoming batch would be reconciled relative to the table's schema)

alexeykudinkin avatar Sep 15 '22 20:09 alexeykudinkin

@alexeykudinkin could you check the CI failure?

yihua avatar Sep 21 '22 15:09 yihua

@hudi-bot run azure

alexeykudinkin avatar Sep 29 '22 20:09 alexeykudinkin

@xiarixiaoyao : can you review the patch as well. Some of the code that you have touched is being updated in this patch. would be good to get it reviewed by you.

nsivabalan avatar Nov 16 '22 15:11 nsivabalan

@nsivabalan @alexeykudinkin will review this pr, today. thanks

xiarixiaoyao avatar Nov 17 '22 02:11 xiarixiaoyao

Had to disable TestCleaner as it's consistently failing due to HDFS cluster issues.

I ran it locally and locally all tests pass.

alexeykudinkin avatar Nov 21 '22 22:11 alexeykudinkin

Had to disable TestCleaner as it's consistently failing due to HDFS cluster issues.

I ran it locally and locally all tests pass.

@alexeykudinkin this was landed https://github.com/apache/hudi/commit/0e1f9653c0e73287527ae62f75ba6e679cf0c1da

xushiyan avatar Nov 22 '22 12:11 xushiyan

@xushiyan i rebased on the latest master yday, and it was still failing (other tests started to fail). At this point i think we should offboard whole of TestCleaner off HDFS

alexeykudinkin avatar Nov 22 '22 18:11 alexeykudinkin

CI report:

  • 288d166c49602a4593b1e97763a467811903737d UNKNOWN
  • 8a37a64610ed23294dc48570bbda72aeb0bb00ea UNKNOWN
  • 94c53ba11e5cc8ec4ee40f9f46a553b875ab0d90 Azure: FAILURE
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Nov 24 '22 03:11 hudi-bot

CI is green:

Screenshot 2022-11-24 at 1 31 03 AM https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=13222&view=results

alexeykudinkin avatar Nov 24 '22 09:11 alexeykudinkin

@aditiwari01 @ad1happy2go @alexeykudinkin alexeykudinkin

Execuse me. This error https://github.com/apache/hudi/issues/8904

I use hudi 0.13.1 and set hoodie.datasource.write.reconcile.schema=true; but, spark-sql with hudi query still error.

Caused by: org.apache.hudi.exception.HoodieException: Exception when reading log file at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:374) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:223) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:198) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:114) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:73) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:464) at org.apache.hudi.LogFileIterator$.scanLog(Iterators.scala:326) at org.apache.hudi.LogFileIterator.(Iterators.scala:92) at org.apache.hudi.RecordMergingFileIterator.(Iterators.scala:172) at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:100) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:133) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1474) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.avro.AvroTypeException: Found string, expecting union at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:308) at org.apache.avro.io.parsing.Parser.advance(Parser.java:86) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:275) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:187) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160) at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:259) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:199) at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:149) at org.apache.hudi.common.util.MappingIterator.next(MappingIterator.java:40) at org.apache.hudi.common.util.ClosableIteratorWithSchema.next(ClosableIteratorWithSchema.java:53) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processDataBlock(AbstractHoodieLogRecordReader.java:630) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:670) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:365) ... 25 more

zyclove avatar Jun 09 '23 06:06 zyclove