YARN-11878. AsyncDispatcher event queue backlog with millions of STAT…
Description of PR
JIRA: YARN-11878. AsyncDispatcher event queue backlog with millions of STATUS_UPDATE events
Avoid costly ContainerStatusPBImpl.getCapability() calls in STATUS_UPDATE when Opportunistic containers are disabled
Background
This behavior was introduced by YARN-11003. to support Opportunistic containers optimization in the ResourceManager.
To implement that optimization, StatusUpdateWhenHealthyTransition calls ContainerStatusPBImpl.getCapability() during every STATUS_UPDATE event.
This ensures container resource capability info is always available for scheduling decisions
when opportunistic containers are enabled.
However, in clusters where opportunistic containers are disabled,
retrieving capability in every STATUS_UPDATE becomes unnecessary,
since the capability value is not used in most workflows.
Currently
NodeManager heartbeat: frequent STATUS_UPDATE events sent to the ResourceManager
Each STATUS_UPDATE processing: triggers ContainerStatusPBImpl.getCapability()
Problem: Even when the opportunistic container feature is off, the same costly protobuf parsing and ResourcePBImpl object construction still happens for each event. This leads to:
- High CPU usage in the AsyncDispatcher event processing thread
- Millions of repeated, unused protobuf parses in large clusters
- Increased event queue latency and slower scheduling decisions
Impact
In clusters with thousands of nodes, STATUS_UPDATE events can account for >90% of the AsyncDispatcher queue.
Profiling shows that getCapability() calls consume >90% of CPU time in StatusUpdateWhenHealthyTransition.transition() when opportunistic containers are disabled.
The overhead is pure waste under these conditions and can be entirely skipped.
Proposed Changes
- Skip capability retrieval logic when
opportunisticContainersEnabledis false. - Cache
remoteContainer.getCapability()result in a local variable to prevent multiple protobuf parsing calls within the same STATUS_UPDATE handling.
How was this patch tested?
CI
For code changes:
- [x] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'YARN-11878. Your PR title ...')?
- [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
- [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
- [ ] If applicable, have you updated the
LICENSE,LICENSE-binary,NOTICE-binaryfiles?
Performance Verification in Production
We tested this patch in a production YARN cluster and used Arthas to monitor RM node event handling performance via:
monitor -c 5 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher handle
Result:
Before patch (with original YARN-11003 behavior): average NM heartbeat handling time ≈ 1.10 ms After patch (skip/caching getCapability() when Opportunistic containers disabled): average NM heartbeat handling time ≈ 0.09 ms This shows over 12× improvement in heartbeat event processing latency, reducing RM AsyncDispatcher thread load significantly and improving scheduling responsiveness in large clusters.
Conclusion:
The patch removes unnecessary getCapability() calls when the Opportunistic container feature is disabled, reducing CPU overhead and improving event queue turnover rate. This optimization has already proven effective in production with substantial gains in RM performance.
:broken_heart: -1 overall
| Vote | Subsystem | Runtime | Logfile | Comment |
|---|---|---|---|---|
| +0 :ok: | reexec | 0m 58s | Docker mode activated. | |
| _ Prechecks _ | ||||
| +1 :green_heart: | dupname | 0m 0s | No case conflicting files found. | |
| +0 :ok: | codespell | 0m 1s | codespell was not available. | |
| +0 :ok: | detsecrets | 0m 1s | detect-secrets was not available. | |
| +1 :green_heart: | @author | 0m 0s | The patch does not contain any @author tags. | |
| -1 :x: | test4tests | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | |
| _ trunk Compile Tests _ | ||||
| +1 :green_heart: | mvninstall | 37m 23s | trunk passed | |
| +1 :green_heart: | compile | 1m 4s | trunk passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04 | |
| +1 :green_heart: | compile | 1m 14s | trunk passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 | |
| +1 :green_heart: | checkstyle | 1m 1s | trunk passed | |
| +1 :green_heart: | mvnsite | 1m 11s | trunk passed | |
| +1 :green_heart: | javadoc | 0m 56s | trunk passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04 | |
| +1 :green_heart: | javadoc | 0m 54s | trunk passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 | |
| -1 :x: | spotbugs | 1m 15s | /branch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | hadoop-yarn-server-resourcemanager in trunk failed. |
| +1 :green_heart: | shadedclient | 33m 36s | branch has no errors when building and testing our client artifacts. | |
| _ Patch Compile Tests _ | ||||
| +1 :green_heart: | mvninstall | 1m 0s | the patch passed | |
| +1 :green_heart: | compile | 0m 57s | the patch passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04 | |
| +1 :green_heart: | javac | 0m 57s | the patch passed | |
| +1 :green_heart: | compile | 0m 56s | the patch passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 | |
| +1 :green_heart: | javac | 0m 56s | the patch passed | |
| +1 :green_heart: | blanks | 0m 0s | The patch has no blanks issues. | |
| +1 :green_heart: | checkstyle | 0m 35s | the patch passed | |
| +1 :green_heart: | mvnsite | 1m 18s | the patch passed | |
| +1 :green_heart: | javadoc | 0m 45s | the patch passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04 | |
| +1 :green_heart: | javadoc | 0m 45s | the patch passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 | |
| -1 :x: | spotbugs | 1m 2s | /patch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | hadoop-yarn-server-resourcemanager in the patch failed. |
| +1 :green_heart: | shadedclient | 33m 45s | patch has no errors when building and testing our client artifacts. | |
| _ Other Tests _ | ||||
| -1 :x: | unit | 111m 44s | /patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | hadoop-yarn-server-resourcemanager in the patch passed. |
| +1 :green_heart: | asflicense | 0m 35s | The patch does not generate ASF License warnings. | |
| 224m 27s |
| Reason | Tests |
|---|---|
| Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler |
| hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry | |
| hadoop.yarn.server.resourcemanager.resourcetracker.TestNMReconnect | |
| hadoop.yarn.server.resourcemanager.TestResourceTrackerService | |
| hadoop.yarn.server.resourcemanager.resourcetracker.TestRMNMRPCResponseId | |
| hadoop.yarn.server.resourcemanager.TestRMNodeTransitions | |
| hadoop.yarn.server.resourcemanager.logaggregationstatus.TestRMAppLogAggregationStatus |
| Subsystem | Report/Notes |
|---|---|
| Docker | ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8026/1/artifact/out/Dockerfile |
| GITHUB PR | https://github.com/apache/hadoop/pull/8026 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
| uname | Linux bc2c114e91d1 5.15.0-156-generic #166-Ubuntu SMP Sat Aug 9 00:02:46 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/bin/hadoop.sh |
| git revision | trunk / c5764a91c2dd89e5c6971b6f5b3cee7100da3b72 |
| Default Java | Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 |
| Multi-JDK versions | /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 |
| Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8026/1/testReport/ |
| Max. process+thread count | 930 (vs. ulimit of 5500) |
| modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager |
| Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8026/1/console |
| versions | git=2.25.1 maven=3.9.11 |
| Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
This message was automatically generated.
:broken_heart: -1 overall
| Vote | Subsystem | Runtime | Logfile | Comment |
|---|---|---|---|---|
| +0 :ok: | reexec | 0m 56s | Docker mode activated. | |
| _ Prechecks _ | ||||
| +1 :green_heart: | dupname | 0m 0s | No case conflicting files found. | |
| +0 :ok: | codespell | 0m 1s | codespell was not available. | |
| +0 :ok: | detsecrets | 0m 1s | detect-secrets was not available. | |
| +1 :green_heart: | @author | 0m 0s | The patch does not contain any @author tags. | |
| -1 :x: | test4tests | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | |
| _ trunk Compile Tests _ | ||||
| +1 :green_heart: | mvninstall | 38m 27s | trunk passed | |
| +1 :green_heart: | compile | 1m 5s | trunk passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04 | |
| +1 :green_heart: | compile | 1m 4s | trunk passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 | |
| +1 :green_heart: | checkstyle | 0m 47s | trunk passed | |
| +1 :green_heart: | mvnsite | 1m 8s | trunk passed | |
| +1 :green_heart: | javadoc | 0m 56s | trunk passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04 | |
| +1 :green_heart: | javadoc | 0m 54s | trunk passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 | |
| -1 :x: | spotbugs | 3m 4s | /branch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-warnings.html | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager in trunk has 1298 extant spotbugs warnings. |
| +1 :green_heart: | shadedclient | 29m 43s | branch has no errors when building and testing our client artifacts. | |
| _ Patch Compile Tests _ | ||||
| +1 :green_heart: | mvninstall | 0m 56s | the patch passed | |
| +1 :green_heart: | compile | 0m 55s | the patch passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04 | |
| +1 :green_heart: | javac | 0m 55s | the patch passed | |
| +1 :green_heart: | compile | 0m 55s | the patch passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 | |
| +1 :green_heart: | javac | 0m 55s | the patch passed | |
| +1 :green_heart: | blanks | 0m 0s | The patch has no blanks issues. | |
| +1 :green_heart: | checkstyle | 0m 36s | the patch passed | |
| +1 :green_heart: | mvnsite | 0m 59s | the patch passed | |
| +1 :green_heart: | javadoc | 0m 43s | the patch passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04 | |
| +1 :green_heart: | javadoc | 0m 47s | the patch passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 | |
| +1 :green_heart: | spotbugs | 3m 4s | the patch passed | |
| +1 :green_heart: | shadedclient | 29m 25s | patch has no errors when building and testing our client artifacts. | |
| _ Other Tests _ | ||||
| -1 :x: | unit | 111m 9s | /patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | hadoop-yarn-server-resourcemanager in the patch passed. |
| +1 :green_heart: | asflicense | 0m 35s | The patch does not generate ASF License warnings. | |
| 225m 44s |
| Reason | Tests |
|---|---|
| Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler |
| hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry | |
| hadoop.yarn.server.resourcemanager.resourcetracker.TestNMReconnect | |
| hadoop.yarn.server.resourcemanager.TestResourceTrackerService | |
| hadoop.yarn.server.resourcemanager.resourcetracker.TestRMNMRPCResponseId | |
| hadoop.yarn.server.resourcemanager.TestRMNodeTransitions | |
| hadoop.yarn.server.resourcemanager.logaggregationstatus.TestRMAppLogAggregationStatus |
| Subsystem | Report/Notes |
|---|---|
| Docker | ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8026/3/artifact/out/Dockerfile |
| GITHUB PR | https://github.com/apache/hadoop/pull/8026 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
| uname | Linux 1136ef71a13e 5.15.0-156-generic #166-Ubuntu SMP Sat Aug 9 00:02:46 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/bin/hadoop.sh |
| git revision | trunk / c5764a91c2dd89e5c6971b6f5b3cee7100da3b72 |
| Default Java | Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 |
| Multi-JDK versions | /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 |
| Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8026/3/testReport/ |
| Max. process+thread count | 951 (vs. ulimit of 5500) |
| modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager |
| Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8026/3/console |
| versions | git=2.25.1 maven=3.9.11 spotbugs=4.9.7 |
| Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
This message was automatically generated.