hudi
hudi copied to clipboard
[HUDI-7384] Secondary index support
Initial commit. Supports the following features:
- Modify schema to add secondary index to metadata
- New partition type in the metadata table to store secondary_keys-to-record_keys mapping
- Various options to support secondary index enablement, column mappings (for secondary keys) etc
- Initialization of secondary keys
- Update secondary keys on inserts/upsert/deletes
- Add hooks in HoodieFileIndex to prune candidate files (to scan) based on secondary key column filters.
- Add ability in HoodieMergedLogRecordScanner to buffer non-unique key (i.e secondary key) records and merge 'similar' records
- Add support for merging secondary index records (from delta log files and base files)
- Ability to merge secondary index records across a group of log-files and across log-file/base-file
Limitations:
- Supports only one secondary index at the moment.
- Scanning of the secondary index partition is done sequentially (both on the query side and the index-maintainance side)
Pending items:
- Integrate with compaction
- Handle rollback
- Cleanup existing tests and add more
Change Logs
Initial commit. Supports the following features:
- Modify schema to add secondary index to metadata
- New partition type in the metadata table to store secondary_keys-to-record_keys mapping
- Various options to support secondary index enablement, column mappings (for secondary keys) etc
- Initialization of secondary keys
- Update secondary keys on inserts/upsert/deletes
- Add hooks in HoodieFileIndex to prune candidate files (to scan) based on secondary key column filters.
- Add ability in HoodieMergedLogRecordScanner to buffer non-unique key (i.e secondary key) records and merge 'similar' records
- Add support for merging secondary index records (from delta log files and base files)
- Ability to merge secondary index records across a group of log-files and across log-file/base-file
Limitations:
- Supports only one secondary index at the moment.
- Scanning of the secondary index partition is done sequentially (both on the query side and the index-maintainance side)
Pending items:
- Integrate with compaction
- Handle rollback
- Cleanup existing tests and add more
Impact
Support secondary index on columns (similar to record index, but for non-unique columns)
Risk level (write none, low medium or high below)
Medium. New and existing tests
Documentation Update
NA. Will be done later
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
Rebase and resolve conflicts. Fix a bug related to MOR tables with secondary index.
Moved away from using HoodieUnMergedLogRecordScanner. Added new buffer in HoodieMergedLogRecordScanner (based on SpillableDiskMap) to handle non-unique keys (secondary keys)
CI report:
- 32d4469a9f0f7aaa87b510a936a5b74c3d734711 Azure: FAILURE
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build
Hi @bhat-vinay! Is this design of secondary index through MDT is the only one to be implemented or there plans to make some other Index Types? As I remember there was RFC for Lucene Index and maybe some other types in future?
Hi @bhat-vinay! Is this design of secondary index through MDT is the only one to be implemented or there plans to make some other Index Types? As I remember there was RFC for Lucene Index and maybe some other types in future?
Please get in touch with @codope for the latest update on this. AFAIK, lucene based secondary index is not planned at this time and MDT based secondary index is the one being developed.
Hi @skyshineb , we do plan to add more index types. If you are interested in contributing to lucene based secondary index, I can help you to get started with multi-modal indexing framework.
hi @codope! I planned to test this MDT implementation and the Lucene(one which I took from previous SI attempt and finished myself). And figure out is it profitable to use Lucene or not. But why this PR got closed?