pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Improve realtime Lucene text index freshness/cpu/disk io usage

Open itschrispeck opened this issue 1 year ago • 3 comments

This PR allows for better freshness/cpu/disk io usage for realtime Lucene text index.

User facing changes:

  1. Add config pinot.server.lucene.min.refresh.interval.ms (default is 10, as 10ms was the previous behavior)
  2. Add config pinot.server.lucene.max.refresh.threads (default is 1, as a single thread was the previous behavior)

Implementation changes:

  1. Use scale-first ScalingThreadPoolExecutor to allow for multiple background refresh threads to refresh Lucene indexes
    1. All RealtimeLuceneTextIndex._searcherManagers are evenly distributed between background refresh threads.
    2. The refresh thread pool is 1 thread:1 RealtimeLuceneTextIndex, up to max threads configured, then each thread handles multiple RealtimeLuceneTextIndex
    3. If tables are deleted/consuming segment rebalance occurs leaving a thread without a RealtimeLuceneTextIndex to refresh, the thread will be removed
  2. Refactor RealtimeLuceneTextIndex specific logic out of MutableSegmentImpl - the index itself registers itself with the refresh manager, and is removed once closed
  3. Add LuceneNRTCachingMergePolicy to perform best effort merging of in-memory Lucene segments - each refresh causes a flush, and making refreshes more common will cause huge numbers of small files.

With configs not set/default settings, we see lower cpu/disk io/slightly better index freshness. With more aggressive configs, we see much better index freshness (we have many tables w/ text index) at the same or similar resource usage.

For testing, we've had this deployed in some of our prod clusters for a bit without issues.

itschrispeck avatar Jun 27 '24 21:06 itschrispeck

Codecov Report

Attention: Patch coverage is 80.00000% with 24 lines in your changes missing coverage. Please review.

Project coverage is 62.00%. Comparing base (59551e4) to head (624eae6). Report is 795 commits behind head on master.

Files Patch % Lines
...vertedindex/RealtimeLuceneIndexRefreshManager.java 81.39% 9 Missing and 7 partials :warning:
.../pinot/server/starter/helix/BaseServerStarter.java 0.00% 7 Missing :warning:
...mpl/invertedindex/LuceneNRTCachingMergePolicy.java 92.30% 0 Missing and 1 partial :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13503      +/-   ##
============================================
+ Coverage     61.75%   62.00%   +0.24%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2554     +118     
  Lines        133233   140552    +7319     
  Branches      20636    21868    +1232     
============================================
+ Hits          82274    87145    +4871     
- Misses        44911    46765    +1854     
- Partials       6048     6642     +594     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) :arrow_down:
integration <0.01% <0.00%> (-0.01%) :arrow_down:
integration1 <0.01% <0.00%> (-0.01%) :arrow_down:
integration2 0.00% <0.00%> (ø)
java-11 61.94% <80.00%> (+0.23%) :arrow_up:
java-21 61.87% <80.00%> (+0.25%) :arrow_up:
skip-bytebuffers-false 61.96% <80.00%> (+0.22%) :arrow_up:
skip-bytebuffers-true 61.85% <80.00%> (+34.12%) :arrow_up:
temurin 62.00% <80.00%> (+0.24%) :arrow_up:
unittests 61.99% <80.00%> (+0.25%) :arrow_up:
unittests1 46.43% <0.00%> (-0.46%) :arrow_down:
unittests2 27.78% <80.00%> (+0.05%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Jun 27 '24 22:06 codecov-commenter

Is it possible to share some numbers on how this improvement has helped with freshness and at the same time the corresponding impact on CPU / mem / IO utilization / GC etc

I think it may be useful to analyze these numbers for a prod like setting for a steady state workload.

IIRC, apart from freshness , there has also been a correctness concern with the way Lucene NRT works and the whole snapshot refresh business. Are we fixing that too ?

siddharthteotia avatar Jul 05 '24 19:07 siddharthteotia

Is it possible to share some numbers on how this improvement has helped with freshness and at the same time the corresponding impact on CPU / mem / IO utilization / GC etc

I think it may be useful to analyze these numbers for a prod like setting for a steady state workload.

Yes. For CPU/Mem/GC, we found the queue poll/offer pattern in such a tight loop caused ~1.2gbps allocations per thread (realizing now this was w/ less than 10ms delay). We use Generational ZGC and this had an impact on % CPU spent on GC, especially when increasing refresh thread count. I can't find the flame graphs for this, but the simple change to an ArrayList solves that and the reduced allocations should be apparent even profiling locally. image

For Disk IO improvement, it is mostly from taking advantage of the LuceneNRTCachingMergePolicy. We do a best effort attempt to merge only segments that are entirely in memory, which reduces FDs and avoids most of the IO.

Here's an example of reduced delay w/ 10 threads/server. For reference, this is in a production cluster with hundreds of consuming partitions per node. The narrow spikes are mostly due to server restart, the wide periods of narrow spikes are due to rollouts (I need to make a change to avoid emitting delay metrics if server is still catching up). With a single queue, we see all tables are sensitive to ingestion spikes/data pattern changes in a single table. Partitioning helps reduce the 'noisy neighbor' indexes. image

Here's some host metrics around the same time frame, showing no significant change in heap, a slight disk IO reduction, and increased CPU usage (since we went from 1 to 10 threads). image

IIRC, apart from freshness , there has also been a correctness concern with the way Lucene NRT works and the whole snapshot refresh business. Are we fixing that too ?

I think this is mostly a separate effort. As I understand it, the snapshot refresh business is done since it's inherently expensive to build/use Lucene like structures in memory (especially since input is not necessarily ordered). For an entire segment, this is prohibitive and part of the reason why native text index's true real-time indexing is relatively resource intensive. By reducing the indexing delay, I think we can reduce the scope of the problem so that we only require building/holding such a structure in memory for a very small portion of data (i.e., the portion that has not been refreshed yet).

I opened an issue to track this and will share a doc there with more details, pending further testing. For now, I think this is a standalone feature that is good to have regardless as it can reduce the amount of incorrect data. If you have any thoughts on this, I would love to continue discussion there

itschrispeck avatar Jul 05 '24 22:07 itschrispeck