hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-9340] Add MDT streaming write support for secondary index

Open lokeshj1703 opened this issue 5 months ago • 2 comments

Change Logs

PR adds streaming support for secondary index. It generates secondary index updates in the write handles like CreateHandle, AppendHandle and MergeHandle. These updates are fetched by the metadata writer from write status returned by the handles for updating the corresponding secondary index partition.

PR reuses the existing utils methods created for secondary index generation. HoodieCreateHandle - Creates a secondary index stat for every record written to the file HoodieMergeHandle - Creates a secondary index stat for every record while comparing with the previous version of the record in the file HoodieAppendHandle - Creates secondary index stat by reading the file slice without the new log files and comparing it by reading file slice with the new log files written by the handle.

To ensure secondary index stats are not computed for compaction and clustering, we have made changes to corresponding handles to ignore secondary index computation. SparkSingleFileSortExecutionStrategy, SparkSortAndSizeExecutionStrategy, JavaSortAndSizeExecutionStrategy and HoodieSparkFileGroupReaderBasedMergeHandle. The first three classes here are used for clustering execution, in the PR we ensure these classes instantiate create handle such that the secondary index stats are disabled. Similarly for HoodieSparkFileGroupReaderBasedMergeHandle which is used for compaction.

Impact

Adds streaming support for secondary index metadata partition

Risk level (write none, low medium or high below)

low

Documentation Update

NA

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

lokeshj1703 avatar Jun 17 '25 09:06 lokeshj1703

and do follow up on CI failures @lokeshj1703

nsivabalan avatar Jun 18 '25 02:06 nsivabalan

CI report:

  • 92f9f740ce9a33ed3c2ddd36f6903dd4aa9545b3 UNKNOWN
  • 41116736014cb24859320394681538181ddabccc UNKNOWN
  • f2025e07572e7635e4cebbde68eca6f5c7af7d69 Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Jun 25 '25 05:06 hudi-bot