flink
flink copied to clipboard
[FLINK-38703][runtime] Update slot manager metrics in thread-safety manner
What is the purpose of the change
This PR resolves a race condition that leads to a ConcurrentModificationException when OpenTelemetry (OTel) metrics are collected simultaneously with updates to the tracked task managers in the SlotManager.
Brief change log
- The new fields for the metrics are introduced
- These fields are periodically updated on the main thread every 1s
- OpenTelemetry reads these static fields
Verifying this change
This change added tests and can be verified as follows:
- New test was added that verifies the fields are periodically updated and values are propagated to the OpenTelemetry collector.
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
@Public(Evolving): no - The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: no
- The S3 file system connector: no
Documentation
- Does this pull request introduce a new feature? no
- If yes, how is the feature documented? not applicable
CI report:
- 065a7f85eed832cbc44af1ac4eb054cdfa0908f5 Azure: FAILURE
Bot commands
The @flinkbot bot supports the following commands:@flinkbot run azurere-run the last Azure build
CI is not passing due to a failing integration test.
Failures:
Nov 20 17:36:29 17:36:29.834 [ERROR] JobManagerMetricsITCase.testJobManagerMetrics:136 expected:<1> but was:<0>
The reason is that the failing test reads metrics before the fields were updated (interval 1s). Update: I added a sleep to the test.
@flinkbot run azur
fyi: to have green ci, rebase to the latest master e2e was fixed at FLINK-38700