hudi
hudi copied to clipboard
[HUDI-4612][RFC-59] RFC-59 Materials (RFC Proposal) Submission: "Multiple event_time Fields Latest Verification in a Single Table"
Change Logs
According to Hudi RFC Process, once the RFC-number has been assigned, proposers should submit their RFC materials as soon as possible. We are very glad that we finished proposal writing in a short time with a high-qualified proposal. We would like to invite anyone who is interested in this feature to read.
The RFC proposal could be read in rfc/rfc-59/rfc-59.md
Impact
Just added a new markdown file with some pictures in the folder rfc/rfc-59
No impact on code running, but will have a BIG IMPACT in brain-storming area.
Risk level: None
Contributor's checklist
- [x] Read through contributor's guide
- [x] Change Logs and Impact were stated clearly
- [x] Adequate tests were added if applicable
- [x] CI passed
Hi Sivabalan @nsivabalan, as we said before in the dev-maillist, we finished writing the RFC proposal and now are submitting these materials with vivid pictures for the new feature: "Multiple event-time fields latest verification" with concept as well as high-level code implementation.
Could you please spend some time to help us do a brief review? If you don't have time then may I ask you to recommend someone else to do this? Thanks a lot! :)
Wish you all the best. XInyao
Hi Shiyan @xushiyan , since you give us a positive feedback in the dev-maillist, may I ask you to be our RFC reviewer? In case you don't have time, could you please recommend someone to do this? Thanks a lot! Xinyao
@XinyaoTian Thanks for the RFC submission. Merge logic is going to be well abstracted and can be custom implemented once RFC-46 lands (it is very close to landing). I suggest you should rethink the problem after the HoodieMerge interface is in place and see if this combining logic is generic enough to implement it within Hudi.
cc @alexeykudinkin
@prasannarajaperumal Hi Prasanna, thanks for your review :) I have read document of RFC-46 carefully according to your suggestion.
To my understanding, RFC-46 intends to improve the entire design of HoodieRecordPayload, which is extremely awesome and will provide quite a lot benefits. However, this doesn't give Hoodie the ability to verify multiple event-time fields in a single table (Although it may be easier to implement this feature due to the new Payload design, but HoodieRecordPayload is just a part of this feature). What we would like to achieve is to give Hudi the ability to JOIN multiple tables in stream-consuming mode without multiple event-time disordering. Therefore, I think we still need to propose this feature since it's really matter to have multiple event-time fields verification in a single Hudi table (currently we ONLY have one, i.e. precombine.field='ts'; What we want to achieve is precombine.field='ts1, ts2, ts3, ts4').
For your convenience, we can wait for the final landing of the RFC-46 and then implement the feature proposed in this RFC. I promise this feature is very important because people asking for this feature in many place (including Hudi Slack e.g. Thread and dev-maillist e.g. Disscussion ) almost every week. We really need to have MORE THAN ONE event-time fields so than we can ensure the accuracy of events even if there are many JOIN operations sinking to ONE Hudi table.
If there's anything worth to note please contact me! Look forward to receiving your further feedback.
Hi @yihua ,thanks for your review. We almost finished developing this new feature and intend to submit our code. Should we wait for the landing of this RFC proposal or we could submit related code directly? Since @prasannarajaperumal suggested that there's a new design of the Payload Class (RFC-46), we don't know whether we should land our feature based on that.
The feature we implemented looks like below. We gave a simple but useful example here to illustrate directly what this RFC is doing.
If we have a table whose configuration contains multiple event-time fields, which could be looked like this: hoodie.payload.combine.fields=a_ts,b_ts, rather than only a single field currently given by Hudi hoodie.payload.combine.field=ts.
We check the table and see this table has a record, whose schema is simple: public_id:int, a_info:string, a_ts:int, b_info:string, b_ts:int, pt:string, ts:int(not-used)
spark-sql> select * from test_db.hudi_payload_test_03;
20220622111029695 20220622111029695_0_1 public_id:1 pt=DD 214f6985-fee5-4091-a65d-d52e9eb20634-0_0-67-4219_20220622111029695.parquet 1 a_101 101 b_101 101 0DD
Time taken: 0.858 seconds, Fetched 1 row(s)
We upsert this record with a bigger value in b_ts field but any fields related with a is null:
INSERT INTO test_db.hudi_payload_test_03
SELECT 1 AS public_id, null AS a_info, null AS a_ts, 'b_105_New_record' AS b_info, 105 AS b_ts, 0 AS ts, 'DD' AS pt;
The result should be looked like this, only columns related with b has been updated, and a_columns keep unchanged.
spark-sql> select * from test_db.hudi_payload_test_03;
20220622111939468 20220622111939468_0_1 public_id:1 pt=DD 214f6985-fee5-4091-a65d-d52e9eb20634-0_0-30-2209_20220622111939468.parquet 1 a_101 101 b_105_New_record 105 0 DD
Time taken: 0.496 seconds, Fetched 1 row(s)
If we upsert a smaller value in the b_ts field, nothing happened. Neither null fields or fields containing values.
INSERT INTO test_db.hudi_payload_test_03
SELECT 1 AS public_id, null AS a_info, null AS a_ts, 'b_99_Some_Record' AS b_info, 99 AS b_ts, 0 AS ts, 'DD' AS pt;
spark-sql> select * from test_db.hudi_payload_test_03;
20220622112743351 20220622112743351_0_1 public_id:1 pt=DD 214f6985-fee5-4091-a65d-d52e9eb20634-0_0-69-4422_20220622112743351.parquet 1 a_101 101 b_105_New_record 105 0 DD
Time taken: 0.501 seconds, Fetched 1 row(s)
By using this feature, a Hudi table can gain the ability only upsert a part of a record, which means data developers can combine several tables into one table (and keep everything up-to-date through streaming ingestion), and only use this table to conduct further work like ML algorithms, AI training, or BIg-screen visualization. This feature will make many things really fast and simple.
Hope my example is useful for understanding the feature provided by our RFC :) @yihua @prasannarajaperumal @alexeykudinkin
lets review/revisit this, after 1.1 and see if this is still needed.. It may not be.