hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-4373] Flink Consistent hashing bucket index write path code

Open YuweiXiao opened this issue 3 years ago • 1 comments

Change Logs

Implement consistent hashing bucket index for flink. This PR only covers the write core of the index, and the resizing implementation will be in another PR.

There are three main changes:

  • Extract common code of consistent hashing bucket index, to serve both Spark&Flink engine.
  • Have Flink engine write path adapt to consistent hashing bucket index, e.g., introduce ConsistentBucketStreamWriteOperator
  • Introduce the basic framework of UpdateStrategy for Flink, to handle conflict between concurrent clustering & update.

Impact

No public API change.

Risk level: none | low | medium | high

Low

Contributor's checklist

  • [x] Read through contributor's guide
  • [x] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

YuweiXiao avatar Sep 22 '22 02:09 YuweiXiao

CI report:

  • 5de4d1e47173545289d97627fd1c97d2d9da5059 Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Oct 12 '22 11:10 hudi-bot

4373.patch.zip Thank, i have reviewed and applied a patch, let's move the clustering update strategy logic into sub-clazzs of StreamWriteFunction first.

danny0405 avatar Oct 14 '22 03:10 danny0405

4373.patch.zip Thank, i have reviewed and applied a patch, let's move the clustering update strategy logic into sub-clazzs of StreamWriteFunction first.

Thanks for the patch, Danny! Moving the update strategy to sub-clazzs will bring some duplicate code (e.g., flushing logic). Is it ok?

Moving down the update strategy logic to consistent hashing sub-clazzs could limit the scope of influence. And we can bring it to the standard stream write pipeline once we are certain it is stable.

YuweiXiao avatar Oct 14 '22 03:10 YuweiXiao