flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-38584][metrics] Support checkpoint path as Prometheus info-style metric

Open sohurdc opened this issue 1 month ago • 1 comments

What is the purpose of the change

This pull request enhances the Prometheus reporter to export the lastCheckpointExternalPath metric as an info-style metric, making it compatible with Prometheus and VictoriaMetrics.

Current Problem:

  • The lastCheckpointExternalPath metric is currently exported as a string-valued Gauge
  • Prometheus and VictoriaMetrics only support numeric values, making it impossible to store checkpoint paths
  • Users must use additional storage systems (e.g., InfluxDB) to track checkpoint paths, increasing operational complexity

Solution:

  • Export lastCheckpointExternalPath as a Prometheus info-style metric with _info suffix
  • Store the checkpoint path in a path label instead of as a metric value
  • Set the metric value to 1.0 (following Prometheus convention for info metrics)

This approach follows Prometheus best practices (similar to node_uname_info from node_exporter) and enables users to:

  1. Store checkpoint paths directly in Prometheus/VictoriaMetrics
  2. Join checkpoint paths with other checkpoint metrics via PromQL
  3. Create monitoring dashboards and alerts based on checkpoint paths

Brief change log

  • Added CHECKPOINT_PATH_METRIC_NAME constant to identify the checkpoint path metric
  • Modified createCollector() method in AbstractPrometheusReporter to detect and handle checkpoint path metrics specially
  • Added CheckpointPathInfoCollector inner class to export checkpoint path as an info-style metric
    • Appends _info suffix to the metric name
    • Stores checkpoint path in a path label
    • Sets metric value to 1.0
    • Handles null and empty path values gracefully
  • Added comprehensive unit tests in CheckpointPathInfoCollectorTest with 4 test cases

Verifying this change

This change added tests and can be verified as follows:

Unit Tests:

  • Added CheckpointPathInfoCollectorTest with 4 test cases:
    • testCheckpointPathExportedAsInfoMetric: Verifies checkpoint path is correctly exported as an info metric with path in label
    • testNullCheckpointPathReturnsEmptyList: Verifies null path values are handled correctly (returns empty list)
    • testEmptyCheckpointPathReturnsEmptyList: Verifies empty string path values are handled correctly
    • testCheckpointPathWithSpecialCharacters: Verifies special characters in paths (e.g., S3 URLs with query parameters) are preserved correctly

Integration Verification: All existing Prometheus reporter tests pass (27/27 tests):

  • PrometheusReporterTest: 14 tests
  • PrometheusReporterTaskScopeTest: 5 tests
  • PrometheusPushGatewayReporterTest: 4 tests
  • CheckpointPathInfoCollectorTest: 4 tests (new)

Manual Verification: The change can be manually verified by:

  1. Starting a Flink cluster with Prometheus reporter enabled
  2. Running a job with checkpointing enabled
  3. Querying the Prometheus metrics endpoint (e.g., curl http://localhost:9249/metrics)
  4. Verifying the output contains:
    flink_jobmanager_job_lastCheckpointExternalPath_info{job_id="...",job_name="...",path="hdfs://..."} 1.0
    
  5. Using PromQL to join with other metrics:
    flink_jobmanager_job_lastCheckpointSize 
      * on(job_id) group_left(path) 
      flink_jobmanager_job_lastCheckpointExternalPath_info
    

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no (only affects metric reporting, not checkpoint functionality)
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? JavaDocs

Documentation Details:

  • Comprehensive JavaDoc added to CheckpointPathInfoCollector class explaining:
    • Purpose: Export checkpoint path as Prometheus info-style metric
    • Behavior: Path stored in label, value always 1.0
    • Example output format
  • Inline code comments explaining the special handling logic
  • Unit test documentation demonstrating usage patterns

Additional Documentation (if requested): If the community requires, I can add documentation to docs/content/docs/deployment/metric_reporters.md explaining:

  • The info-style metric format for checkpoint paths
  • PromQL query examples for joining with other metrics
  • Use cases for monitoring and alerting

sohurdc avatar Oct 29 '25 08:10 sohurdc

CI report:

  • aaea4c7f1699015c2bb855b0ee2143404fa76c78 Azure: FAILURE
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Oct 29 '25 08:10 flinkbot