pinot Make ingestion delay configurable: with concurrency fixes

We ran into some errors with previous version #14074 and hence it was reverted with #14127 . Here's the new PR with multiple fixes after the revert

The ingestion offset lag metric currently makes a request to kafka broker to get the latest offset on every update.

However, this has lead to an increasing amount of load on kafka brokers for multiple users.

The PR adds a way to enable/disable this metric and also configure its interval using the following properties in the server or cluster configs

pinot.server.instance.offset.lag.tracking.enable: true
pinot.server.instance.offset.lag.tracking.update.interval: 60000

Oct 02 '24 05:10 KKcorps

Codecov Report

Attention: Patch coverage is 0% with 68 lines in your changes missing coverage. Please review.

Project coverage is 34.18%. Comparing base (59551e4) to head (979c83c). Report is 1264 commits behind head on master.

Files with missing lines	Patch %	Lines
...e/data/manager/realtime/IngestionDelayTracker.java	0.00%	67 Missing :warning:
...a/manager/realtime/RealtimeSegmentDataManager.java	0.00%	1 Missing :warning:

:exclamation: There is a different number of reports uploaded between BASE (59551e4) and HEAD (979c83c). Click for more details.

HEAD has 9 uploads less than BASE

Flag BASE (59551e4) HEAD (979c83c)

integration2 3 2

temurin 12 11

skip-bytebuffers-false 7 5

unittests 5 3

unittests1 2 0

java-11 5 4

Flag	BASE (59551e4)	HEAD (979c83c)
integration2	3	2
temurin	12	11
skip-bytebuffers-false	7	5
unittests	5	3
unittests1	2	0
java-11	5	4

Additional details and impacted files

@@              Coverage Diff              @@
##             master   #14142       +/-   ##
=============================================
- Coverage     61.75%   34.18%   -27.57%     
- Complexity      207      765      +558     
=============================================
  Files          2436     2660      +224     
  Lines        133233   145739    +12506     
  Branches      20636    22297     +1661     
=============================================
- Hits          82274    49826    -32448     
- Misses        44911    91894    +46983     
+ Partials       6048     4019     -2029

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	:arrow_up:
integration	`100.00% <ø> (+99.99%)`	:arrow_up:
integration1	`100.00% <ø> (+99.99%)`	:arrow_up:
integration2	`0.00% <ø> (ø)`
java-11	`34.18% <0.00%> (-27.53%)`	:arrow_down:
java-21	`34.18% <0.00%> (-27.45%)`	:arrow_down:
skip-bytebuffers-false	`34.18% <0.00%> (-27.56%)`	:arrow_down:
skip-bytebuffers-true	`34.16% <0.00%> (+6.43%)`	:arrow_up:
temurin	`34.18% <0.00%> (-27.57%)`	:arrow_down:
unittests	`34.18% <0.00%> (-27.57%)`	:arrow_down:
unittests1	`?`
unittests2	`34.18% <0.00%> (+6.45%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Oct 02 '24 06:10 codecov-commenter

@swaminathanmanish Please help review this

Oct 07 '24 22:10 Jackie-Jiang

Before the PR, the logic is:

For every message batch, we make a separate metadata fetch request to get the latest offset

This PR:

Move the latest offset fetching logic into IngestionDelayTracker and periodically refresh IngestionInfo._latestOffset
Still use _latestOffset - _currentOffset to compute the ingestion lag
The problem of this approach is that _latestOffset and _currentOffset are not updated together

To solve the above problem:

Move the latest offset fetching logic into IngestionDelayTracker
Fetch latest offset within updateIngestionMetrics(), but with the given minimum interval (not fetch for every invocation)
Directly store _offsetLag instead of _latestOffset in the IngestionInfo

Oct 22 '24 21:10 Jackie-Jiang

Alternatively, we can actually directly calculate the lag in the RealtimeSegmentDataManager with the given minimum interval (put null when latest offset is not fetched). This way we don't need to manage metadata fetch within IngestionDelayTracker

Oct 23 '24 06:10 Jackie-Jiang

@Jackie-Jiang went with the first alternative

Oct 30 '24 16:10 KKcorps

I kind of prefer the second way, so that we don't need to manage metadata fetch within IngestionDelayTracker. It should be able to greatly simplify the logic. Can you give it a try? Please also take a look at the test failure

Nov 01 '24 06:11 Jackie-Jiang

Closing this in favor of #15831

May 19 '25 04:05 KKcorps

pinot pinot copied to clipboard

Make ingestion delay configurable: with concurrency fixes

Codecov Report

pinot
pinot copied to clipboard