druid [Extension] oak-incremental-index: Low resource (RAM and CPU) incremental-index implementation using off-heap key/value map (OakMap)

Description

This PR introduces an extension that improves Druid’s ingestion memory and CPU efficiency. It uses 60% less memory and 50% less CPU-time to achieve the same performance. This translated to nearly double the system's ingestion-throughput with the same memory budget, and a 75% increase in throughput with the same CPU-time budget. The experimental setup and the full results are available here.

To understand the motivation and rationale behind some of the proposed changes below, it is necessary to read the related issue: #9967.

Introduce `OakIncrementalIndex`

We add a new incremental index implementation: OakIncrementalIndex as a Druid extension. The implementation is mostly borrowed from OnheapIncrementalIndex and OffheapIncrementalIndex, but has a few notable differences:

It stores both keys and values off-heap (as opposed to the off-heap implementation that stores only the values off-heap)
It is based on OakMap instead of Java’s ConcurrentSkipList (CSL)
It does not need to keep a mapping from row index to an actual row
It is always ordered (as expected by FactsHolder.persistIterable()), even in plain mode

To achieve the best performance of our implementation, we had to refactor some interfaces of IncrementalIndexRow and IncrementalIndex. This refactoring is explained in #12122.

Key changed/added classes in this commit

Added everything under druid/extensions-contrib/oak-incremental-index. Most notable additions:
- OakIncrementalIndex: follows the IncrementalIndex API
- OakIncrementalIndexRow: follows the IncrementalIndexRow API
- OakKey: handles the serialization, deserialization, and comparison of keys
Updated benchmarks to evaluate our implementation.

This PR has:

[X] been self-reviewed.
[X] added documentation for new or modified features or behaviors.
[X] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
[X] added or updated version, license, or notice information in licenses.yaml
[X] added comments explaining the "why" and the intent of the code wherever it would not be obvious for an unfamiliar reader.
[X] added unit tests or modified existing tests to cover new code paths.
[X] added integration tests.
[X] been tested in a test Druid cluster.

Jun 08 '20 14:06 liran-funaro

This pull request introduces 3 alerts when merging eed0606a67f89b653e8cc402eb0fe36af511de30 into 45b699fa4a1330a7c002fb3dabc99c05fb4fc4cd - view on LGTM.com

new alerts:

3 for Result of multiplication cast to wider type

Jun 08 '20 15:06 lgtm-com[bot]

Hi @jihoonson, @leventov, @gianm, @ebortnik, @b-slim, and @jon-wei. Since you had an interest in our previous discussions regarding OakIncrementalIndex, we wanted to update you that our new and improved implementation (presented in this PR) reduces Druid's CPU and RAM consumption by over 50% during the ingestion process. This translates to nearly double the system's ingestion-throughput with the same CPU and RAM budget. Please check out this PR and its related issue (#9967), as well as our detailed system-level experiments.

Jun 09 '20 21:06 liran-funaro

Glad you guys are still working on this! I will take a look soon.

Jun 09 '20 22:06 jihoonson

@jihoonson Thank you. We are eager to hear your valuable feedback.

Jun 12 '20 13:06 liran-funaro

@clintropolis We noticed your recent commits modified the same benchmarks as we did in this PR (commit #a607c95). Since you are familiar with this part of Druid, we will appreciate it if you can take the time to review the benchmark part of this PR. This commit alone can contribute to Druid since it resolves issues we had with the benchmarks. You can find a summary of the modifications here.

Jun 12 '20 13:06 liran-funaro

We noticed a lot of Druid users run their workload on Amazon EC2. We want to point out that this PR will not only improve performance but will also reduce operational costs by allowing the users to choose more affordable EC2 instances without sacrificing performance.

The figure below shows the operational cost of different required ingestion throughput on Amazon EC2.

Jun 15 '20 14:06 liran-funaro

Thanks to @yuanlihan for helping us find bugs on the realtime query side. Our system experiments focused on batch ingestion, so his contribution is highly appreciated.

Jun 19 '20 11:06 liran-funaro

Hi @jihoonson, have you had a chance to check out our issue/PR? We will be happy to answer any questions you might have.

Jun 25 '20 12:06 liran-funaro

Updates

We are working with our production teams at Verizon Media toward testing our incremental-index implementation on actual production data. As part of this effort, we discovered some issues with: (1) sketches, and (2) scenarios where multiple indexes are used during ingestion in a single Peon. We just updated the PR to solves these issues. Please let us know if you encounter similar issues (or others) and if this update solves these issues.

Jul 20 '20 11:07 liran-funaro

We continue to evaluate Oak incremental-index on our (Yahoo) production workload and seeing interesting results. For example, we run Druid's middle-manager on two identical VMs that consumed events from the same Kafka feed for an entire day; one with Oak and one with the "vanilla" Druid (on-heap incremental index).

Oak vs. on-heap ("vanilla") incremental-index comparison:

Oak required 35% less flush operations for the entire duration
Which produced 50% larger partitions
As a result, Oak spent 50 minutes less time in flush operations; i.e., almost 40% reduction compared to the on-heap implementation
This also reduced the time it took to merge and push the segments by almost a minute.

In the latest Druid Summit, we've seen that these kinds of optimizations are vital. Companies invest resources in post-compaction, thus, creating larger partitions during ingestion can reduce these efforts cost.

We appreciate any effort by the community for moving this forward, starting by reviewing our design proposal (#10321).

Sep 06 '20 09:09 liran-funaro

i have merged your code in 0.18.1 branch，i am testing it

Mar 12 '21 06:03 wangxiaobaidu11

This pull request introduces 1 alert when merging cd022f79b46fd2bb82a11fc5a2397c7249a55a07 into 1061faa6ba97106aec79d48aac158e88faa965c8 - view on LGTM.com

new alerts:

1 for Implicit narrowing conversion in compound assignment

Mar 17 '21 09:03 lgtm-com[bot]

i have merged your code in 0.18.1 branch，i am testing it

@wangxiaobaidu11 I'm looking forward to hearing your results. If you need any help, please reach out.

Mar 17 '21 10:03 liran-funaro

This looks like an amazing improvement. Can't wait for it to be merged in.

Mar 26 '21 01:03 didip

Hi @a2l007, @jihoonson, @leventov, @gianm, @ebortnik, @b-slim, and @jon-wei. We approciate your support and the time you invested so far on this extention. After introducing a new extention-point in Druid for the incremental-index in previous PRs, now we are ready to proceed with including our extention as part of Druid. We will approciate your review for this PR.

You are welcome to checkout the related issue (#9967), as well as our detailed system-level experiments for more information.

Mar 29 '21 13:03 liran-funaro

Apologies that it has been so slow getting traction on this PR, I have started to look things over and will try to do a full review soon.

Apr 16 '21 23:04 clintropolis

@clintropolis Did you have a chance to review this PR? Is there any blockage? Anything I can help with?

May 04 '21 12:05 liran-funaro

@clintropolis Did you have a chance to review this PR? Is there any blockage? Anything I can help with?

Sorry I haven't yet, only blockage has been on my side for finding time to dig in, I think I should have some time later this week though.

May 04 '21 19:05 clintropolis

@clintropolis Did you have a chance to review this PR? Is there any blockage? Anything I can help with?

Sorry I haven't yet, only blockage has been on my side for finding time to dig in, I think I should have some time later this week though.

Thanks!

May 04 '21 19:05 liran-funaro

Hi @clintropolis, @a2l007, and all. It has been nearly a year since I opened this PR. Although we had some progress merging #10335 and #10593, this PR remains without any significant review or any concrete plan towards merge. I understand that this PR is big and that you are probably swamped with important work. So to expedite the process, I would love to arrange an online meeting where I can guide you through the changes. Just pick the time.

Jun 01 '21 10:06 liran-funaro

Hi @a2l007, @jihoonson, @leventov, @gianm, @ebortnik, @b-slim, and @jon-wei. I appreciate your support and the time you have committed to this extension. To remove any blockers to the adoption of this improvement, I have published a new PR (#12122) that contains all the changes to Druid internals from this PR without the additional extension classes. I hope that this will illustrate how this extension does not alter Druid's logic or behavior. Therefore, it does not pose a threat to Druid stability and robustness and can be safely merged. Review PR #12122, and you will find that it is easy to read and review because there are no logic changes.

Jan 09 '22 15:01 liran-funaro

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

May 02 '22 19:05 stale[bot]

Please don't close this PR.

May 02 '22 19:05 didip

This issue is no longer marked as stale.

May 02 '22 19:05 stale[bot]

Index performance is very critical for us.Too many tasks means too many segments.

Jun 13 '22 08:06 exherb

Index performance is very critical for us. Too many tasks mean too many segments.

@exherb For batch ingestion, each task can ingest different time periods. Those can be merged when the task is done, similar to how we merge the results from periodic flushes. So we can benefit from better throughput without sacrificing the number of segments.

Jul 10 '22 12:07 liran-funaro

Index performance is very critical for us. Too many tasks mean too many segments.

@exherb For batch ingestion, each task can ingest different time periods. Those can be merged when the task is done, similar to how we merge the results from periodic flushes. So we can benefit from better throughput without sacrificing the number of segments.

It’s take about 3 hours to merge for us. Is this pull request ready?

Jul 20 '22 23:07 exherb

It takes about 3 hours to merge for us.

@exherb From my experiment, merging fewer, but larger segments take about the same amount of time as merging many small segments. Even so, this extension enables more data to be ingested into memory before it is flushed to disk, resulting in larger segments.

Is this pull request ready?

@exherb Yes. Except for some conflicts that I can resolve (upon demand).

Jul 21 '22 06:07 liran-funaro

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

Sep 05 '23 00:09 github-actions[bot]

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Oct 03 '23 00:10 github-actions[bot]

druid druid copied to clipboard

[Extension] oak-incremental-index: Low resource (RAM and CPU) incremental-index implementation using off-heap key/value map (OakMap)

Description

Introduce OakIncrementalIndex

Key changed/added classes in this commit

Updates

Oak vs. on-heap ("vanilla") incremental-index comparison:

druid
druid copied to clipboard

Introduce `OakIncrementalIndex`