hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-7384] Secondary index support

Open bhat-vinay opened this issue 1 year ago • 3 comments

Initial commit. Supports the following features:

  1. Modify schema to add secondary index to metadata
  2. New partition type in the metadata table to store secondary_keys-to-record_keys mapping
  3. Various options to support secondary index enablement, column mappings (for secondary keys) etc
  4. Initialization of secondary keys
  5. Update secondary keys on inserts/upsert/deletes
  6. Add hooks in HoodieFileIndex to prune candidate files (to scan) based on secondary key column filters.
  7. Add ability in HoodieMergedLogRecordScanner to buffer non-unique key (i.e secondary key) records and merge 'similar' records
  8. Add support for merging secondary index records (from delta log files and base files)
  9. Ability to merge secondary index records across a group of log-files and across log-file/base-file

Limitations:

  1. Supports only one secondary index at the moment.
  2. Scanning of the secondary index partition is done sequentially (both on the query side and the index-maintainance side)

Pending items:

  1. Integrate with compaction
  2. Handle rollback
  3. Cleanup existing tests and add more

Change Logs

Initial commit. Supports the following features:

  1. Modify schema to add secondary index to metadata
  2. New partition type in the metadata table to store secondary_keys-to-record_keys mapping
  3. Various options to support secondary index enablement, column mappings (for secondary keys) etc
  4. Initialization of secondary keys
  5. Update secondary keys on inserts/upsert/deletes
  6. Add hooks in HoodieFileIndex to prune candidate files (to scan) based on secondary key column filters.
  7. Add ability in HoodieMergedLogRecordScanner to buffer non-unique key (i.e secondary key) records and merge 'similar' records
  8. Add support for merging secondary index records (from delta log files and base files)
  9. Ability to merge secondary index records across a group of log-files and across log-file/base-file

Limitations:

  1. Supports only one secondary index at the moment.
  2. Scanning of the secondary index partition is done sequentially (both on the query side and the index-maintainance side)

Pending items:

  1. Integrate with compaction
  2. Handle rollback
  3. Cleanup existing tests and add more

Impact

Support secondary index on columns (similar to record index, but for non-unique columns)

Risk level (write none, low medium or high below)

Medium. New and existing tests

Documentation Update

NA. Will be done later

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

bhat-vinay avatar Feb 05 '24 15:02 bhat-vinay

Rebase and resolve conflicts. Fix a bug related to MOR tables with secondary index.

bhat-vinay avatar Feb 22 '24 06:02 bhat-vinay

Moved away from using HoodieUnMergedLogRecordScanner. Added new buffer in HoodieMergedLogRecordScanner (based on SpillableDiskMap) to handle non-unique keys (secondary keys)

bhat-vinay avatar Feb 22 '24 19:02 bhat-vinay

CI report:

  • 32d4469a9f0f7aaa87b510a936a5b74c3d734711 Azure: FAILURE
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Mar 05 '24 12:03 hudi-bot

Hi @bhat-vinay! Is this design of secondary index through MDT is the only one to be implemented or there plans to make some other Index Types? As I remember there was RFC for Lucene Index and maybe some other types in future?

skyshineb avatar Jun 08 '24 10:06 skyshineb

Hi @bhat-vinay! Is this design of secondary index through MDT is the only one to be implemented or there plans to make some other Index Types? As I remember there was RFC for Lucene Index and maybe some other types in future?

Please get in touch with @codope for the latest update on this. AFAIK, lucene based secondary index is not planned at this time and MDT based secondary index is the one being developed.

bhat-vinay avatar Jun 11 '24 13:06 bhat-vinay

Hi @skyshineb , we do plan to add more index types. If you are interested in contributing to lucene based secondary index, I can help you to get started with multi-modal indexing framework.

codope avatar Jun 11 '24 17:06 codope

hi @codope! I planned to test this MDT implementation and the Lucene(one which I took from previous SI attempt and finished myself). And figure out is it profitable to use Lucene or not. But why this PR got closed?

skyshineb avatar Jun 23 '24 09:06 skyshineb