hudi
hudi copied to clipboard
[HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file
Change Logs
Currently, HoodieParquetReader
is not specifying projected schema properly when reading Parquet files which ends up failing in cases when the provided schema is not equal the schema of the file being read (even though it might be a proper projection, ie subset of it)
To address the original issue described in HUDI-4588, we also have to relax the constraints imposed by TableSchemaResolver.isSchemaCompatible
method not allowing columns to be evolved by the way of dropping columns.
Changes:
- Adding missing schema projection when reading Parquet file (using
AvroParquetReader
) - Relaxing schema evolution constraints to allow columns to be dropped
- Revisiting schema reconciliation logic to make sure it's consistent
- Streamlining schema handling in
HoodieSparkSqlWriter
to make sure it's uniform for all operations (it isn't applied properly for Bulk-insert at the moment) - Added comprehensive test for basic schema evolution.
Impact
Medium
There are a few critical changes taken forward by this PR:
Now, w/ incoming batches being able to drop columns (relative to the Table's existing schema), unless hoodie.datasource.write.reconcile.schema
is enabled:
- Incoming batch's schema will be taken as Writer's schema (same as before)
- New/updated base files will be (re)written in the new schema (previously it would have failed)
This subtle change in behavior (dropped columns would not be leading to failures anymore) could open up a new set of problems where data quality issues (for ex, column is missing while it shouldn't) could trickle down into existing table.
To alleviate that we should consider flipping hoodie.datasource.write.reconcile.schema
to true
by default (there's already #6196 for that)
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
As outlined in https://github.com/apache/hudi/pull/6196#discussion_r961984500, this PR should go hand in hand w/ the https://github.com/apache/hudi/pull/6196, which flips Schema Reconciliation to be enabled by default (entailing that every incoming batch would be reconciled relative to the table's schema)
@alexeykudinkin could you check the CI failure?
@hudi-bot run azure
@xiarixiaoyao : can you review the patch as well. Some of the code that you have touched is being updated in this patch. would be good to get it reviewed by you.
@nsivabalan @alexeykudinkin will review this pr, today. thanks
Had to disable TestCleaner
as it's consistently failing due to HDFS cluster issues.
I ran it locally and locally all tests pass.
Had to disable
TestCleaner
as it's consistently failing due to HDFS cluster issues.I ran it locally and locally all tests pass.
@alexeykudinkin this was landed https://github.com/apache/hudi/commit/0e1f9653c0e73287527ae62f75ba6e679cf0c1da
@xushiyan i rebased on the latest master yday, and it was still failing (other tests started to fail). At this point i think we should offboard whole of TestCleaner
off HDFS
CI report:
- 288d166c49602a4593b1e97763a467811903737d UNKNOWN
- 8a37a64610ed23294dc48570bbda72aeb0bb00ea UNKNOWN
- 94c53ba11e5cc8ec4ee40f9f46a553b875ab0d90 Azure: FAILURE
Bot commands
@hudi-bot supports the following commands:-
@hudi-bot run azure
re-run the last Azure build
CI is green:

@aditiwari01 @ad1happy2go @alexeykudinkin alexeykudinkin
Execuse me. This error https://github.com/apache/hudi/issues/8904
I use hudi 0.13.1 and set hoodie.datasource.write.reconcile.schema=true; but, spark-sql with hudi query still error.
Caused by: org.apache.hudi.exception.HoodieException: Exception when reading log file
at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternalV1(AbstractHoodieLogRecordReader.java:374)
at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:223)
at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:198)
at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.