hudi
hudi copied to clipboard
[HUDI-9332] Pluggable Table Format Support with native Integration
Change Logs
[HUDI-9332] Pluggable Table Format Support with native Integration
- Includes base interface for pluggable table format
- Includes native hudi integration for pluggable table format
- Includes the MDT side integration for puggability.
Impact
Initia Implementation of RFC - 93
Risk level (write none, low medium or high below)
none
Documentation Update
none
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
I don't understand costs in terms of performance, that we should pay for this change. There are a lot of work that was already done, and there is still a lot to do down the road. From my point of view, performance is a critical parameter for users when we are talking about Big Data processing.
Is there any benchmark results for this change?
Another question, what will happen, if we wrote Hudi metadata successfully, but failed during writing of Iceberg or Delta metadata? How we will process this case, when Hudi table is fine, but interoperability support is broken at some commit?
@hudi-bot run azure
@hudi-bot run azure
@geserdugarov sorry. missed your comments there.. does the RFC help address some of those
CI report:
- 222d7b08a976998e5416ed6bd7c90abb1d2aee0e Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build
@geserdugarov sorry. missed your comments there.. does the RFC help address some of those
@vinothchandar, thanks for your reply. Unfortunately, the RFC doesn't address mentioned questions. My main concern is performance, I really hope we won’t lose it after these changes. Since there are no performance tests in the project, we’ll have to verify it manually. Another issue is the increasing code complexity. It keeps growing, and it’s becoming harder and harder to provide a stable solution. I hope some progress will be done under HUDI-9054.
@geserdugarov what is the performance concern that it may induces in this PR? We should definitely fix it IMO, can you guide me the lines in this patch?
In general, all the changes in this path should not bring in performance issues, some hook/empty invocations shoud be fine and the cost could be negligible.
I agree we should limit the code complexity, it's the way we are evolving the project, we welcome new features always if it is valuable, then try to make the code in good shape as much as we can, but yeah, sometimes it is hard to give a elegant solution based on the amout of work it could take.
Appreciate your points.
@geserdugarov what is the performance concern that it may induces in this PR? We should definitely fix it IMO, can you guide me the lines in this patch?
In general, all the changes in this path should not bring in performance issues, some hook/empty invocations shoud be fine and the cost could be negligible.
I agree we should limit the code complexity, it's the way we are evolving the project, we welcome new features always if it is valuable, then try to make the code in good shape as much as we can, but yeah, sometimes it is hard to give a elegant solution based on the amout of work it could take.
Appreciate your points.
@danny0405 , from quick checking of this patch, you're right, looks like there should be no performance issues. I only see that we started to pass HoodieTableMetaClient way more, and it's a Serializable class, which contains a lot of not transient objects. If there are any frequently called place in the code, where we will start to use Kryo SerDe of this class, (which I don't actually see) it could potentially decrease performance. But again, it's just a guessing, and it would be great to have some continuous performance monitoring to be sure in all big changes. I also will try to think about it in my spare time.
looks like there should be no performance issues. I only see that we started to pass
HoodieTableMetaClientway more
@geserdugarov I also noticed this change, in most of the changes, the context already contains a meta client, and we just need it's light-weight member like "table format", the other logic are almost the same before the patch.
it's just a guessing, and it would be great to have some continuous performance monitoring to be sure in all big changes.
+1. this is a great idea in general.. for this PR, we had a bunch of conversations around this. and @bvaradar explicitly assured me and ensured that we ll not incur overheads on native path.
@danny0405 I plugged the one issue I saw around extra timeline calls. CI is green. I'll land this in few hours, until I hear otherwise.