pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Disable OOM protection when IHOP JVM params not active (server/broker)

Open xiangfu0 opened this issue 5 months ago • 3 comments

This change adds a JVM capability guard for OOM protection on both Server and Broker.

  • During startup, after enabling thread memory allocation tracking, we verify whether the JVM actually has thread-allocated-bytes measurement enabled via ThreadResourceUsageProvider.isThreadMemoryMeasurementEnabled().
  • If not enabled (IHOP JVM parameters not active), we proactively set pinot.query.scheduler.accounting.oom.enable.killing.query=false in the effective config, disabling OOM query killing to avoid false positives.

Files changed:

  • server: pinot-server/src/main/java/org/apache/pinot/server/starter/helix/BaseServerStarter.java
  • broker: pinot-broker/src/main/java/org/apache/pinot/broker/broker/helix/BaseBrokerStarter.java

Motivation:

  • OOM protection relies on accurate per-thread allocated-bytes telemetry. When the JVM isn’t configured to expose it, keeping OOM protection enabled can lead to incorrect throttling or query killing decisions.

Notes:

  • No behavioral change when IHOP JVM params are present and the capability is enabled.
  • Only deprecation warnings observed in lints; no new errors introduced.

xiangfu0 avatar Sep 28 '25 11:09 xiangfu0

Codecov Report

:x: Patch coverage is 52.38095% with 40 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 63.25%. Comparing base (57364b1) to head (c51a5f0). :white_check_mark: All tests successful. No failed tests found.

Files with missing lines Patch % Lines
.../apache/pinot/common/utils/OomProtectionUtils.java 53.94% 24 Missing and 11 partials :warning:
.../pinot/server/starter/helix/BaseServerStarter.java 0.00% 4 Missing :warning:
...e/pinot/broker/broker/helix/BaseBrokerStarter.java 75.00% 0 Missing and 1 partial :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #16912      +/-   ##
============================================
- Coverage     63.28%   63.25%   -0.04%     
  Complexity     1433     1433              
============================================
  Files          3134     3135       +1     
  Lines        186304   186388      +84     
  Branches      28457    28475      +18     
============================================
- Hits         117898   117893       -5     
- Misses        59312    59398      +86     
- Partials       9094     9097       +3     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 55.66% <52.63%> (-7.56%) :arrow_down:
java-21 63.22% <52.38%> (-0.03%) :arrow_down:
temurin 63.25% <52.38%> (-0.04%) :arrow_down:
unittests 63.24% <52.38%> (-0.04%) :arrow_down:
unittests1 55.69% <52.63%> (+0.01%) :arrow_up:
unittests2 33.85% <8.33%> (-0.05%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov-commenter avatar Sep 28 '25 15:09 codecov-commenter

Trying to understand the change. When -XX:-G1UseAdaptiveIHOP is not set, does it prevent per-thread memory/CPU utilization from being reported? Per the argument name, seems they are unrelated

@Jackie-Jiang I think the PR description is wrong. -G1UseAdaptiveIHOP doesn't affect whether we can estimate the amount of allocated memory. The reason we need it is that if the % of used memory is below IHOP, the JVM won't collect old generation. The OOM protection algorithm assumes the % of used memory is mostly alive memory, but as long as the % of memory is below IHOP, the used memory may be mostly garbage. As a result, if our thresholds are smaller than IHOP, we may be throttling or killing queries while the % of alive memory is actually very low. With adaptive IHOP, the actual % of memory that fires old GCs is undefined.

I think what we should do here is to verify that the IHOP is lower than our thresholds. This means:

  • If the JVM uses adaptive IHOP, the check should fail
  • If the JVM doesn't use IHOP:
    • If IHOP is higher than our thresholds, the check should fail
    • In other case, the check should succeed.

If the check fails, we should log an error message and continue working as before (ie if the check was executed during startup, run without OOM protection, if the check failed when a config was changed, log an error and continue using the old config)

gortiz avatar Dec 05 '25 15:12 gortiz

Trying to understand the change. When -XX:-G1UseAdaptiveIHOP is not set, does it prevent per-thread memory/CPU utilization from being reported? Per the argument name, seems they are unrelated

@Jackie-Jiang I think the PR description is wrong. -G1UseAdaptiveIHOP doesn't affect whether we can estimate the amount of allocated memory. The reason we need it is that if the % of used memory is below IHOP, the JVM won't collect old generation. The OOM protection algorithm assumes the % of used memory is mostly alive memory, but as long as the % of memory is below IHOP, the used memory may be mostly garbage. As a result, if our thresholds are smaller than IHOP, we may be throttling or killing queries while the % of alive memory is actually very low. With adaptive IHOP, the actual % of memory that fires old GCs is undefined.

I think what we should do here is to verify that the IHOP is lower than our thresholds. This means:

  • If the JVM uses adaptive IHOP, the check should fail

  • If the JVM doesn't use IHOP:

    • If IHOP is higher than our thresholds, the check should fail
    • In other case, the check should succeed.

If the check fails, we should log an error message and continue working as before (ie if the check was executed during startup, run without OOM protection, if the check failed when a config was changed, log an error and continue using the old config)

  • If adaptive IHOP is on (missing -XX:-G1UseAdaptiveIHOP), the check fails and we disable OOM protection.
  • With static IHOP, we parse -XX:InitiatingHeapOccupancyPercent; if it’s at/above the lowest OOM threshold, the check fails and OOM protection is disabled. Otherwise it passes.
  • Failures now log an explicit error with the detected IHOP and the OOM thresholds, but we keep running (startup continues without OOM protection or retains the prior config on change).

xiangfu0 avatar Dec 06 '25 23:12 xiangfu0