hudi
hudi copied to clipboard
[HUDI-4898] presto/hive respect payload during merge parquet file and logfile when reading mor table
Change Logs
- presto/hive respect payload during merge parquet file and logfile when reading mor table.
- presto/hive support read timestamp type for mor table.
Impact
Risk level: high fixed the todo in line115 in RealtimeCompactedRecordReader // TODO(NA): Invoke preCombine here by converting arrayWritable to Avro. This is required since the // deltaRecord may not be a full record and needs values of columns from the parquet
reproduce step
spark.sql(
"""create table tx_null
|(id int, comb int, col0 int, col1 bigint, col2 float, col3 double, col4 decimal(10,4),
| col5 string, col6 date, col7 timestamp, col8 boolean, col9 binary, par date)
| using hudi
| partitioned by (par)
| options(
| type='mor', primaryKey='id', preCombineField='comb',
| 'hoodie.index.type' = 'BLOOM', 'hoodie.compaction.payload.class'='org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload')""".stripMargin)
spark.sql(
s"""
| insert into tx_null values
| (1,1,99,1111111,101.01,1001.0001,100001.0001,'x000001','2021-12-25','2021-12-25 12:01:01',true,'a01','2021-12-25'),
| (2,2,99,1111111,102.02,1002.0002,100002.0002,'x000002','2021-12-25','2021-12-25 12:02:02',true,'a02','2021-12-25'),
| (3,3,99,1111111,103.03,1003.0003,100003.0003,'x000003','2021-12-25','2021-12-25 12:03:03',false,'a03','2021-12-25'),
| (4,4,99,1111111,104.04,1004.0004,100004.0004,'x000004','2021-12-26','2021-12-26 12:04:04',true,'a04','2021-12-26'),
| (5,5,99,1111111,105.05,1005.0005,100005.0005,'x000005','2021-12-26','2021-12-26 12:05:05',false,'a05','2021-12-26')
|""".stripMargin)
spark.sql(
s"""
| insert into tx_null values
| (1,0,null,100002,101.01,1001.0001,100001.0001,'x000001','2021-12-25','2021-12-25 12:01:01',true,'a01','2021-12-25'),
| (2,1,null,100003,102.02,1002.0002,100002.0002,'x000002','2021-12-25','2021-12-25 12:02:02',true,'a02','2021-12-25'),
| (3,2,null,100004,103.03,1003.0003,100003.0003,'x000003','2021-12-25','2021-12-25 12:03:03',false,'a03','2021-12-25'),
| (4,3,null,100005,104.04,1004.0004,100004.0004,'x000004','2021-12-26','2021-12-26 12:04:04',true,'a04','2021-12-26'),
| (5,4,null,100006,105.05,1005.0005,100005.0005,'x000005','2021-12-26','2021-12-26 12:05:05',false,'a05','2021-12-26')
|""".stripMargin)
select col0, col1 from tx_null when use spark-sql/flink
99 100002 99 100003 99 100001 99 100005 99 100004
when use presto/hive, query result is +-------+-------+ | 99 | NULL | | 99 | NULL | | 99 | NULL | | 99 | NULL | | 99 | NULL | +-------+-------+
also other payload is not supported.
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
cancelling all azure CI runs for now to investigate CI flakiness. will retrigger build once we are in stable state. sorry about the inconvenience.
@codope @danny0405 @xushiyan @XuQianJin-Stars could you pls help me review this pr, thanks. the UT failed has nothing to do with this pr
@xiarixiaoyao There are some CI failures. Can you please fix them and rebase?
will fixed the CI, thanks
@hudi-bot run azure
Canceling the CI run to prioritize release blocker PRs. Apologies. I will re-trigger once the blockers have finished.
@hudi-bot run azure
@hudi-bot run azure
@codope could you pls review again, fix all comments, thanks
@hudi-bot run azure
@hudi-bot run azure