Disable OOM protection when IHOP JVM params not active (server/broker)
This change adds a JVM capability guard for OOM protection on both Server and Broker.
- During startup, after enabling thread memory allocation tracking, we verify whether the JVM actually has thread-allocated-bytes measurement enabled via
ThreadResourceUsageProvider.isThreadMemoryMeasurementEnabled(). - If not enabled (IHOP JVM parameters not active), we proactively set
pinot.query.scheduler.accounting.oom.enable.killing.query=falsein the effective config, disabling OOM query killing to avoid false positives.
Files changed:
- server:
pinot-server/src/main/java/org/apache/pinot/server/starter/helix/BaseServerStarter.java - broker:
pinot-broker/src/main/java/org/apache/pinot/broker/broker/helix/BaseBrokerStarter.java
Motivation:
- OOM protection relies on accurate per-thread allocated-bytes telemetry. When the JVM isn’t configured to expose it, keeping OOM protection enabled can lead to incorrect throttling or query killing decisions.
Notes:
- No behavioral change when IHOP JVM params are present and the capability is enabled.
- Only deprecation warnings observed in lints; no new errors introduced.
Codecov Report
:x: Patch coverage is 52.38095% with 40 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 63.25%. Comparing base (57364b1) to head (c51a5f0).
:white_check_mark: All tests successful. No failed tests found.
Additional details and impacted files
@@ Coverage Diff @@
## master #16912 +/- ##
============================================
- Coverage 63.28% 63.25% -0.04%
Complexity 1433 1433
============================================
Files 3134 3135 +1
Lines 186304 186388 +84
Branches 28457 28475 +18
============================================
- Hits 117898 117893 -5
- Misses 59312 59398 +86
- Partials 9094 9097 +3
| Flag | Coverage Δ | |
|---|---|---|
| custom-integration1 | 100.00% <ø> (ø) |
|
| integration | 100.00% <ø> (ø) |
|
| integration1 | 100.00% <ø> (ø) |
|
| integration2 | 0.00% <ø> (ø) |
|
| java-11 | 55.66% <52.63%> (-7.56%) |
:arrow_down: |
| java-21 | 63.22% <52.38%> (-0.03%) |
:arrow_down: |
| temurin | 63.25% <52.38%> (-0.04%) |
:arrow_down: |
| unittests | 63.24% <52.38%> (-0.04%) |
:arrow_down: |
| unittests1 | 55.69% <52.63%> (+0.01%) |
:arrow_up: |
| unittests2 | 33.85% <8.33%> (-0.05%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
Trying to understand the change. When -XX:-G1UseAdaptiveIHOP is not set, does it prevent per-thread memory/CPU utilization from being reported? Per the argument name, seems they are unrelated
@Jackie-Jiang I think the PR description is wrong. -G1UseAdaptiveIHOP doesn't affect whether we can estimate the amount of allocated memory. The reason we need it is that if the % of used memory is below IHOP, the JVM won't collect old generation. The OOM protection algorithm assumes the % of used memory is mostly alive memory, but as long as the % of memory is below IHOP, the used memory may be mostly garbage. As a result, if our thresholds are smaller than IHOP, we may be throttling or killing queries while the % of alive memory is actually very low. With adaptive IHOP, the actual % of memory that fires old GCs is undefined.
I think what we should do here is to verify that the IHOP is lower than our thresholds. This means:
- If the JVM uses adaptive IHOP, the check should fail
- If the JVM doesn't use IHOP:
- If IHOP is higher than our thresholds, the check should fail
- In other case, the check should succeed.
If the check fails, we should log an error message and continue working as before (ie if the check was executed during startup, run without OOM protection, if the check failed when a config was changed, log an error and continue using the old config)
Trying to understand the change. When -XX:-G1UseAdaptiveIHOP is not set, does it prevent per-thread memory/CPU utilization from being reported? Per the argument name, seems they are unrelated
@Jackie-Jiang I think the PR description is wrong.
-G1UseAdaptiveIHOPdoesn't affect whether we can estimate the amount of allocated memory. The reason we need it is that if the % of used memory is below IHOP, the JVM won't collect old generation. The OOM protection algorithm assumes the % of used memory is mostly alive memory, but as long as the % of memory is below IHOP, the used memory may be mostly garbage. As a result, if our thresholds are smaller than IHOP, we may be throttling or killing queries while the % of alive memory is actually very low. With adaptive IHOP, the actual % of memory that fires old GCs is undefined.I think what we should do here is to verify that the IHOP is lower than our thresholds. This means:
If the JVM uses adaptive IHOP, the check should fail
If the JVM doesn't use IHOP:
- If IHOP is higher than our thresholds, the check should fail
- In other case, the check should succeed.
If the check fails, we should log an error message and continue working as before (ie if the check was executed during startup, run without OOM protection, if the check failed when a config was changed, log an error and continue using the old config)
- If adaptive IHOP is on (missing -XX:-G1UseAdaptiveIHOP), the check fails and we disable OOM protection.
- With static IHOP, we parse -XX:InitiatingHeapOccupancyPercent; if it’s at/above the lowest OOM threshold, the check fails and OOM protection is disabled. Otherwise it passes.
- Failures now log an explicit error with the detected IHOP and the OOM thresholds, but we keep running (startup continues without OOM protection or retains the prior config on change).