pinot
pinot copied to clipboard
Improve realtime Lucene text index freshness/cpu/disk io usage
This PR allows for better freshness/cpu/disk io usage for realtime Lucene text index.
User facing changes:
- Add config
pinot.server.lucene.min.refresh.interval.ms
(default is 10, as 10ms was the previous behavior) - Add config
pinot.server.lucene.max.refresh.threads
(default is 1, as a single thread was the previous behavior)
Implementation changes:
- Use scale-first
ScalingThreadPoolExecutor
to allow for multiple background refresh threads to refresh Lucene indexes- All
RealtimeLuceneTextIndex._searcherManager
s are evenly distributed between background refresh threads. - The refresh thread pool is
1 thread:1 RealtimeLuceneTextIndex
, up to max threads configured, then each thread handles multipleRealtimeLuceneTextIndex
- If tables are deleted/consuming segment rebalance occurs leaving a thread without a
RealtimeLuceneTextIndex
to refresh, the thread will be removed
- All
- Refactor
RealtimeLuceneTextIndex
specific logic out ofMutableSegmentImpl
- the index itself registers itself with the refresh manager, and is removed once closed - Add
LuceneNRTCachingMergePolicy
to perform best effort merging of in-memory Lucene segments - each refresh causes a flush, and making refreshes more common will cause huge numbers of small files.
With configs not set/default settings, we see lower cpu/disk io/slightly better index freshness. With more aggressive configs, we see much better index freshness (we have many tables w/ text index) at the same or similar resource usage.
For testing, we've had this deployed in some of our prod clusters for a bit without issues.