hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-9527] Switch to HoodieFileGroupReader in HoodieTableMetadataUtil

Open the-other-tim-brown opened this issue 5 months ago • 2 comments

Change Logs

  • Removes usage of HoodieMergedLogRecordScanner in HoodieTableMetadataUtil and replace it with the HoodieFileGroupReader
  • Fixes handling of deleted records when reading as HoodieRecord

Impact

  • Uses new standard way of reading

Risk level (write none, low medium or high below)

Low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

  • [x] Read through contributor's guide
  • [x] Change Logs and Impact were stated clearly
  • [x] Adequate tests were added if applicable
  • [x] CI passed

the-other-tim-brown avatar Jun 16 '25 16:06 the-other-tim-brown

CI report:

  • 3b2cb77f4b40bd4c8250c9aa67f3f22bcf76c173 UNKNOWN
  • 57e41ca2b93ffa6c0891673e0a1f72f5cc57da3a UNKNOWN
  • 2ca6f2d1ef89f973291cbe2372b28647829faff3 UNKNOWN
  • 26f86c3b49c18facba107c80d9fafd2e8e82d62d UNKNOWN
  • e75a1af8faa1cc6fba9df9c71cc63467c68547ce UNKNOWN
  • 1aa2d2fd21199cf18f8b7ea10e7fe229d5d3264e UNKNOWN
  • dab35428d0c43f93330d41d0ad21f218de5dba85 Azure: FAILURE
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Jun 18 '25 23:06 hudi-bot

@the-other-tim-brown I think the emitDeletes support for HoodieRecord iterator brings in too much overhead than I thought, can we drop it in this PR, the delete keys fetching should be just used in the legacy code path, now we have streaming write to MDT, and the code should be removed in the future anyway(once the streaming write is stable).

The emitDeletes is introduced mainly for streaming read scenarios with engine specific rows.

Also can we revert the changes for size estimation into a serapate PR to make the review of the current one easier.

danny0405 avatar Jun 19 '25 01:06 danny0405

@the-other-tim-brown I think the emitDeletes support for HoodieRecord iterator brings in too much overhead than I thought, can we drop it in this PR, the delete keys fetching should be just used in the legacy code path, now we have streaming write to MDT, and the code should be removed in the future anyway(once the streaming write is stable).

The emitDeletes is introduced mainly for streaming read scenarios with engine specific rows.

Also can we revert the changes for size estimation into a serapate PR to make the review of the current one easier.

@danny0405 I don't understand what you are recommending here. Streaming write to the MDT is only used for Spark and I don't think there are plans to use it for other engines.

the-other-tim-brown avatar Jun 20 '25 16:06 the-other-tim-brown

Started a new PR with the same end result of moving off deprecated code but smaller changeset https://github.com/apache/hudi/pull/13470

the-other-tim-brown avatar Jun 20 '25 19:06 the-other-tim-brown

Streaming write to the MDT is only used for Spark and I don't think there are plans to use it for other engines.

That's true, for Flink and Java, there is no even a solution/plan to support the RLI there, that is why I said those codes like constructing RLI from files should be deemed as legacy.

danny0405 avatar Jun 21 '25 01:06 danny0405