hudi
hudi copied to clipboard
[HUDI-9527] Switch to HoodieFileGroupReader in HoodieTableMetadataUtil
Change Logs
- Removes usage of
HoodieMergedLogRecordScannerinHoodieTableMetadataUtiland replace it with theHoodieFileGroupReader - Fixes handling of deleted records when reading as
HoodieRecord
Impact
- Uses new standard way of reading
Risk level (write none, low medium or high below)
Low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
- The config description must be updated if new configs are added or the default value of the configs are changed
- Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.
Contributor's checklist
- [x] Read through contributor's guide
- [x] Change Logs and Impact were stated clearly
- [x] Adequate tests were added if applicable
- [x] CI passed
CI report:
- 3b2cb77f4b40bd4c8250c9aa67f3f22bcf76c173 UNKNOWN
- 57e41ca2b93ffa6c0891673e0a1f72f5cc57da3a UNKNOWN
- 2ca6f2d1ef89f973291cbe2372b28647829faff3 UNKNOWN
- 26f86c3b49c18facba107c80d9fafd2e8e82d62d UNKNOWN
- e75a1af8faa1cc6fba9df9c71cc63467c68547ce UNKNOWN
- 1aa2d2fd21199cf18f8b7ea10e7fe229d5d3264e UNKNOWN
- dab35428d0c43f93330d41d0ad21f218de5dba85 Azure: FAILURE
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build
@the-other-tim-brown I think the emitDeletes support for HoodieRecord iterator brings in too much overhead than I thought, can we drop it in this PR, the delete keys fetching should be just used in the legacy code path, now we have streaming write to MDT, and the code should be removed in the future anyway(once the streaming write is stable).
The emitDeletes is introduced mainly for streaming read scenarios with engine specific rows.
Also can we revert the changes for size estimation into a serapate PR to make the review of the current one easier.
@the-other-tim-brown I think the
emitDeletessupport for HoodieRecord iterator brings in too much overhead than I thought, can we drop it in this PR, the delete keys fetching should be just used in the legacy code path, now we have streaming write to MDT, and the code should be removed in the future anyway(once the streaming write is stable).The
emitDeletesis introduced mainly for streaming read scenarios with engine specific rows.Also can we revert the changes for size estimation into a serapate PR to make the review of the current one easier.
@danny0405 I don't understand what you are recommending here. Streaming write to the MDT is only used for Spark and I don't think there are plans to use it for other engines.
Started a new PR with the same end result of moving off deprecated code but smaller changeset https://github.com/apache/hudi/pull/13470
Streaming write to the MDT is only used for Spark and I don't think there are plans to use it for other engines.
That's true, for Flink and Java, there is no even a solution/plan to support the RLI there, that is why I said those codes like constructing RLI from files should be deemed as legacy.