druid
druid copied to clipboard
[Extension] oak-incremental-index: Low resource (RAM and CPU) incremental-index implementation using off-heap key/value map (OakMap)
Fixes #9967.
Description
This PR introduces an extension that improves Druid’s ingestion memory and CPU efficiency. It uses 60% less memory and 50% less CPU-time to achieve the same performance. This translated to nearly double the system's ingestion-throughput with the same memory budget, and a 75% increase in throughput with the same CPU-time budget. The experimental setup and the full results are available here.
To understand the motivation and rationale behind some of the proposed changes below, it is necessary to read the related issue: #9967.
Introduce OakIncrementalIndex
We add a new incremental index implementation: OakIncrementalIndex as a Druid extension. The implementation is mostly borrowed from OnheapIncrementalIndex and OffheapIncrementalIndex, but has a few notable differences:
- It stores both keys and values off-heap (as opposed to the off-heap implementation that stores only the values off-heap)
- It is based on
OakMapinstead of Java’sConcurrentSkipList(CSL) - It does not need to keep a mapping from row index to an actual row
- It is always ordered (as expected by
FactsHolder.persistIterable()), even in plain mode
To achieve the best performance of our implementation, we had to refactor some interfaces of IncrementalIndexRow and IncrementalIndex. This refactoring is explained in #12122.
Key changed/added classes in this commit
- Added everything under
druid/extensions-contrib/oak-incremental-index. Most notable additions:OakIncrementalIndex: follows theIncrementalIndexAPIOakIncrementalIndexRow: follows theIncrementalIndexRowAPIOakKey: handles the serialization, deserialization, and comparison of keys
- Updated benchmarks to evaluate our implementation.
This PR has:
- [X] been self-reviewed.
- [X] added documentation for new or modified features or behaviors.
- [X] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
- [X] added or updated version, license, or notice information in licenses.yaml
- [X] added comments explaining the "why" and the intent of the code wherever it would not be obvious for an unfamiliar reader.
- [X] added unit tests or modified existing tests to cover new code paths.
- [X] added integration tests.
- [X] been tested in a test Druid cluster.
This pull request introduces 3 alerts when merging eed0606a67f89b653e8cc402eb0fe36af511de30 into 45b699fa4a1330a7c002fb3dabc99c05fb4fc4cd - view on LGTM.com
new alerts:
- 3 for Result of multiplication cast to wider type
Hi @jihoonson, @leventov, @gianm, @ebortnik, @b-slim, and @jon-wei.
Since you had an interest in our previous discussions regarding OakIncrementalIndex, we wanted to update you that our new and improved implementation (presented in this PR) reduces Druid's CPU and RAM consumption by over 50% during the ingestion process. This translates to nearly double the system's ingestion-throughput with the same CPU and RAM budget. Please check out this PR and its related issue (#9967), as well as our detailed system-level experiments.
Glad you guys are still working on this! I will take a look soon.
@jihoonson Thank you. We are eager to hear your valuable feedback.
@clintropolis We noticed your recent commits modified the same benchmarks as we did in this PR (commit #a607c95). Since you are familiar with this part of Druid, we will appreciate it if you can take the time to review the benchmark part of this PR. This commit alone can contribute to Druid since it resolves issues we had with the benchmarks. You can find a summary of the modifications here.
We noticed a lot of Druid users run their workload on Amazon EC2. We want to point out that this PR will not only improve performance but will also reduce operational costs by allowing the users to choose more affordable EC2 instances without sacrificing performance.
The figure below shows the operational cost of different required ingestion throughput on Amazon EC2.

Thanks to @yuanlihan for helping us find bugs on the realtime query side. Our system experiments focused on batch ingestion, so his contribution is highly appreciated.
Hi @jihoonson, have you had a chance to check out our issue/PR? We will be happy to answer any questions you might have.
Updates
We are working with our production teams at Verizon Media toward testing our incremental-index implementation on actual production data. As part of this effort, we discovered some issues with: (1) sketches, and (2) scenarios where multiple indexes are used during ingestion in a single Peon. We just updated the PR to solves these issues. Please let us know if you encounter similar issues (or others) and if this update solves these issues.
We continue to evaluate Oak incremental-index on our (Yahoo) production workload and seeing interesting results. For example, we run Druid's middle-manager on two identical VMs that consumed events from the same Kafka feed for an entire day; one with Oak and one with the "vanilla" Druid (on-heap incremental index).
Oak vs. on-heap ("vanilla") incremental-index comparison:
- Oak required 35% less flush operations for the entire duration
- Which produced 50% larger partitions
- As a result, Oak spent 50 minutes less time in flush operations; i.e., almost 40% reduction compared to the on-heap implementation
- This also reduced the time it took to merge and push the segments by almost a minute.
In the latest Druid Summit, we've seen that these kinds of optimizations are vital. Companies invest resources in post-compaction, thus, creating larger partitions during ingestion can reduce these efforts cost.
We appreciate any effort by the community for moving this forward, starting by reviewing our design proposal (#10321).
i have merged your code in 0.18.1 branch,i am testing it
This pull request introduces 1 alert when merging cd022f79b46fd2bb82a11fc5a2397c7249a55a07 into 1061faa6ba97106aec79d48aac158e88faa965c8 - view on LGTM.com
new alerts:
- 1 for Implicit narrowing conversion in compound assignment
i have merged your code in 0.18.1 branch,i am testing it
@wangxiaobaidu11 I'm looking forward to hearing your results. If you need any help, please reach out.
This looks like an amazing improvement. Can't wait for it to be merged in.
Hi @a2l007, @jihoonson, @leventov, @gianm, @ebortnik, @b-slim, and @jon-wei. We approciate your support and the time you invested so far on this extention. After introducing a new extention-point in Druid for the incremental-index in previous PRs, now we are ready to proceed with including our extention as part of Druid. We will approciate your review for this PR.
You are welcome to checkout the related issue (#9967), as well as our detailed system-level experiments for more information.
Apologies that it has been so slow getting traction on this PR, I have started to look things over and will try to do a full review soon.
@clintropolis Did you have a chance to review this PR? Is there any blockage? Anything I can help with?
@clintropolis Did you have a chance to review this PR? Is there any blockage? Anything I can help with?
Sorry I haven't yet, only blockage has been on my side for finding time to dig in, I think I should have some time later this week though.
@clintropolis Did you have a chance to review this PR? Is there any blockage? Anything I can help with?
Sorry I haven't yet, only blockage has been on my side for finding time to dig in, I think I should have some time later this week though.
Thanks!
Hi @clintropolis, @a2l007, and all. It has been nearly a year since I opened this PR. Although we had some progress merging #10335 and #10593, this PR remains without any significant review or any concrete plan towards merge. I understand that this PR is big and that you are probably swamped with important work. So to expedite the process, I would love to arrange an online meeting where I can guide you through the changes. Just pick the time.
Hi @a2l007, @jihoonson, @leventov, @gianm, @ebortnik, @b-slim, and @jon-wei. I appreciate your support and the time you have committed to this extension. To remove any blockers to the adoption of this improvement, I have published a new PR (#12122) that contains all the changes to Druid internals from this PR without the additional extension classes. I hope that this will illustrate how this extension does not alter Druid's logic or behavior. Therefore, it does not pose a threat to Druid stability and robustness and can be safely merged. Review PR #12122, and you will find that it is easy to read and review because there are no logic changes.
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.
Please don't close this PR.
This issue is no longer marked as stale.
Index performance is very critical for us.Too many tasks means too many segments.
Index performance is very critical for us. Too many tasks mean too many segments.
@exherb For batch ingestion, each task can ingest different time periods. Those can be merged when the task is done, similar to how we merge the results from periodic flushes. So we can benefit from better throughput without sacrificing the number of segments.
Index performance is very critical for us. Too many tasks mean too many segments.
@exherb For batch ingestion, each task can ingest different time periods. Those can be merged when the task is done, similar to how we merge the results from periodic flushes. So we can benefit from better throughput without sacrificing the number of segments.
It’s take about 3 hours to merge for us. Is this pull request ready?
It takes about 3 hours to merge for us.
@exherb From my experiment, merging fewer, but larger segments take about the same amount of time as merging many small segments. Even so, this extension enables more data to be ingested into memory before it is flushed to disk, resulting in larger segments.
Is this pull request ready?
@exherb Yes. Except for some conflicts that I can resolve (upon demand).
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.
This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.