hudi
hudi copied to clipboard
[HUDI-9340] Add MDT streaming write support for secondary index
Change Logs
PR adds streaming support for secondary index. It generates secondary index updates in the write handles like CreateHandle, AppendHandle and MergeHandle. These updates are fetched by the metadata writer from write status returned by the handles for updating the corresponding secondary index partition.
PR reuses the existing utils methods created for secondary index generation. HoodieCreateHandle - Creates a secondary index stat for every record written to the file HoodieMergeHandle - Creates a secondary index stat for every record while comparing with the previous version of the record in the file HoodieAppendHandle - Creates secondary index stat by reading the file slice without the new log files and comparing it by reading file slice with the new log files written by the handle.
To ensure secondary index stats are not computed for compaction and clustering, we have made changes to corresponding handles to ignore secondary index computation. SparkSingleFileSortExecutionStrategy, SparkSortAndSizeExecutionStrategy, JavaSortAndSizeExecutionStrategy and HoodieSparkFileGroupReaderBasedMergeHandle. The first three classes here are used for clustering execution, in the PR we ensure these classes instantiate create handle such that the secondary index stats are disabled. Similarly for HoodieSparkFileGroupReaderBasedMergeHandle which is used for compaction.
Impact
Adds streaming support for secondary index metadata partition
Risk level (write none, low medium or high below)
low
Documentation Update
NA
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
and do follow up on CI failures @lokeshj1703
CI report:
- 92f9f740ce9a33ed3c2ddd36f6903dd4aa9545b3 UNKNOWN
- 41116736014cb24859320394681538181ddabccc UNKNOWN
- f2025e07572e7635e4cebbde68eca6f5c7af7d69 Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build