pinot
pinot copied to clipboard
Improve realtime Lucene text index freshness/cpu/disk io usage
This PR allows for better freshness/cpu/disk io usage for realtime Lucene text index.
User facing changes:
- Add config
pinot.server.lucene.min.refresh.interval.ms(default is 10, as 10ms was the previous behavior) - Add config
pinot.server.lucene.max.refresh.threads(default is 1, as a single thread was the previous behavior)
Implementation changes:
- Use scale-first
ScalingThreadPoolExecutorto allow for multiple background refresh threads to refresh Lucene indexes- All
RealtimeLuceneTextIndex._searcherManagers are evenly distributed between background refresh threads. - The refresh thread pool is
1 thread:1 RealtimeLuceneTextIndex, up to max threads configured, then each thread handles multipleRealtimeLuceneTextIndex - If tables are deleted/consuming segment rebalance occurs leaving a thread without a
RealtimeLuceneTextIndexto refresh, the thread will be removed
- All
- Refactor
RealtimeLuceneTextIndexspecific logic out ofMutableSegmentImpl- the index itself registers itself with the refresh manager, and is removed once closed - Add
LuceneNRTCachingMergePolicyto perform best effort merging of in-memory Lucene segments - each refresh causes a flush, and making refreshes more common will cause huge numbers of small files.
With configs not set/default settings, we see lower cpu/disk io/slightly better index freshness. With more aggressive configs, we see much better index freshness (we have many tables w/ text index) at the same or similar resource usage.
For testing, we've had this deployed in some of our prod clusters for a bit without issues.
Codecov Report
Attention: Patch coverage is 80.00000% with 24 lines in your changes missing coverage. Please review.
Project coverage is 62.00%. Comparing base (
59551e4) to head (624eae6). Report is 795 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #13503 +/- ##
============================================
+ Coverage 61.75% 62.00% +0.24%
+ Complexity 207 198 -9
============================================
Files 2436 2554 +118
Lines 133233 140552 +7319
Branches 20636 21868 +1232
============================================
+ Hits 82274 87145 +4871
- Misses 44911 46765 +1854
- Partials 6048 6642 +594
| Flag | Coverage Δ | |
|---|---|---|
| custom-integration1 | <0.01% <0.00%> (-0.01%) |
:arrow_down: |
| integration | <0.01% <0.00%> (-0.01%) |
:arrow_down: |
| integration1 | <0.01% <0.00%> (-0.01%) |
:arrow_down: |
| integration2 | 0.00% <0.00%> (ø) |
|
| java-11 | 61.94% <80.00%> (+0.23%) |
:arrow_up: |
| java-21 | 61.87% <80.00%> (+0.25%) |
:arrow_up: |
| skip-bytebuffers-false | 61.96% <80.00%> (+0.22%) |
:arrow_up: |
| skip-bytebuffers-true | 61.85% <80.00%> (+34.12%) |
:arrow_up: |
| temurin | 62.00% <80.00%> (+0.24%) |
:arrow_up: |
| unittests | 61.99% <80.00%> (+0.25%) |
:arrow_up: |
| unittests1 | 46.43% <0.00%> (-0.46%) |
:arrow_down: |
| unittests2 | 27.78% <80.00%> (+0.05%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Is it possible to share some numbers on how this improvement has helped with freshness and at the same time the corresponding impact on CPU / mem / IO utilization / GC etc
I think it may be useful to analyze these numbers for a prod like setting for a steady state workload.
IIRC, apart from freshness , there has also been a correctness concern with the way Lucene NRT works and the whole snapshot refresh business. Are we fixing that too ?
Is it possible to share some numbers on how this improvement has helped with freshness and at the same time the corresponding impact on CPU / mem / IO utilization / GC etc
I think it may be useful to analyze these numbers for a prod like setting for a steady state workload.
Yes. For CPU/Mem/GC, we found the queue poll/offer pattern in such a tight loop caused ~1.2gbps allocations per thread (realizing now this was w/ less than 10ms delay). We use Generational ZGC and this had an impact on % CPU spent on GC, especially when increasing refresh thread count. I can't find the flame graphs for this, but the simple change to an ArrayList solves that and the reduced allocations should be apparent even profiling locally.
For Disk IO improvement, it is mostly from taking advantage of the LuceneNRTCachingMergePolicy. We do a best effort attempt to merge only segments that are entirely in memory, which reduces FDs and avoids most of the IO.
Here's an example of reduced delay w/ 10 threads/server. For reference, this is in a production cluster with hundreds of consuming partitions per node. The narrow spikes are mostly due to server restart, the wide periods of narrow spikes are due to rollouts (I need to make a change to avoid emitting delay metrics if server is still catching up). With a single queue, we see all tables are sensitive to ingestion spikes/data pattern changes in a single table. Partitioning helps reduce the 'noisy neighbor' indexes.
Here's some host metrics around the same time frame, showing no significant change in heap, a slight disk IO reduction, and increased CPU usage (since we went from 1 to 10 threads).
IIRC, apart from freshness , there has also been a correctness concern with the way Lucene NRT works and the whole snapshot refresh business. Are we fixing that too ?
I think this is mostly a separate effort. As I understand it, the snapshot refresh business is done since it's inherently expensive to build/use Lucene like structures in memory (especially since input is not necessarily ordered). For an entire segment, this is prohibitive and part of the reason why native text index's true real-time indexing is relatively resource intensive. By reducing the indexing delay, I think we can reduce the scope of the problem so that we only require building/holding such a structure in memory for a very small portion of data (i.e., the portion that has not been refreshed yet).
I opened an issue to track this and will share a doc there with more details, pending further testing. For now, I think this is a standalone feature that is good to have regardless as it can reduce the amount of incorrect data. If you have any thoughts on this, I would love to continue discussion there