flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-38703][runtime] Update slot manager metrics in thread-safety manner

Open ztison opened this issue 1 month ago • 4 comments

What is the purpose of the change

This PR resolves a race condition that leads to a ConcurrentModificationException when OpenTelemetry (OTel) metrics are collected simultaneously with updates to the tracked task managers in the SlotManager.

Brief change log

  • The new fields for the metrics are introduced
  • These fields are periodically updated on the main thread every 1s
  • OpenTelemetry reads these static fields

Verifying this change

This change added tests and can be verified as follows:

  • New test was added that verifies the fields are periodically updated and values are propagated to the OpenTelemetry collector.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

ztison avatar Nov 20 '25 15:11 ztison

CI report:

  • 065a7f85eed832cbc44af1ac4eb054cdfa0908f5 Azure: FAILURE
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Nov 20 '25 15:11 flinkbot

CI is not passing due to a failing integration test.

Failures: 
Nov 20 17:36:29 17:36:29.834 [ERROR]   JobManagerMetricsITCase.testJobManagerMetrics:136 expected:<1> but was:<0>

The reason is that the failing test reads metrics before the fields were updated (interval 1s). Update: I added a sleep to the test.

ztison avatar Nov 21 '25 11:11 ztison

@flinkbot run azur

ztison avatar Nov 21 '25 14:11 ztison

fyi: to have green ci, rebase to the latest master e2e was fixed at FLINK-38700

snuyanzin avatar Nov 21 '25 17:11 snuyanzin