[FLINK-38584][metrics] Support checkpoint path as Prometheus info-style metric
What is the purpose of the change
This pull request enhances the Prometheus reporter to export the lastCheckpointExternalPath metric as an info-style metric, making it compatible with Prometheus and VictoriaMetrics.
Current Problem:
- The
lastCheckpointExternalPathmetric is currently exported as a string-valued Gauge - Prometheus and VictoriaMetrics only support numeric values, making it impossible to store checkpoint paths
- Users must use additional storage systems (e.g., InfluxDB) to track checkpoint paths, increasing operational complexity
Solution:
- Export
lastCheckpointExternalPathas a Prometheus info-style metric with_infosuffix - Store the checkpoint path in a
pathlabel instead of as a metric value - Set the metric value to 1.0 (following Prometheus convention for info metrics)
This approach follows Prometheus best practices (similar to node_uname_info from node_exporter) and enables users to:
- Store checkpoint paths directly in Prometheus/VictoriaMetrics
- Join checkpoint paths with other checkpoint metrics via PromQL
- Create monitoring dashboards and alerts based on checkpoint paths
Brief change log
- Added
CHECKPOINT_PATH_METRIC_NAMEconstant to identify the checkpoint path metric - Modified createCollector() method in AbstractPrometheusReporter to detect and handle checkpoint path metrics specially
- Added CheckpointPathInfoCollector inner class to export checkpoint path as an info-style metric
- Appends
_infosuffix to the metric name - Stores checkpoint path in a
pathlabel - Sets metric value to 1.0
- Handles null and empty path values gracefully
- Appends
- Added comprehensive unit tests in CheckpointPathInfoCollectorTest with 4 test cases
Verifying this change
This change added tests and can be verified as follows:
Unit Tests:
- Added CheckpointPathInfoCollectorTest with 4 test cases:
- testCheckpointPathExportedAsInfoMetric: Verifies checkpoint path is correctly exported as an info metric with path in label
- testNullCheckpointPathReturnsEmptyList: Verifies null path values are handled correctly (returns empty list)
- testEmptyCheckpointPathReturnsEmptyList: Verifies empty string path values are handled correctly
- testCheckpointPathWithSpecialCharacters: Verifies special characters in paths (e.g., S3 URLs with query parameters) are preserved correctly
Integration Verification: All existing Prometheus reporter tests pass (27/27 tests):
- PrometheusReporterTest: 14 tests
PrometheusReporterTaskScopeTest: 5 testsPrometheusPushGatewayReporterTest: 4 tests- CheckpointPathInfoCollectorTest: 4 tests (new)
Manual Verification: The change can be manually verified by:
- Starting a Flink cluster with Prometheus reporter enabled
- Running a job with checkpointing enabled
- Querying the Prometheus metrics endpoint (e.g.,
curl http://localhost:9249/metrics) - Verifying the output contains:
flink_jobmanager_job_lastCheckpointExternalPath_info{job_id="...",job_name="...",path="hdfs://..."} 1.0 - Using PromQL to join with other metrics:
flink_jobmanager_job_lastCheckpointSize * on(job_id) group_left(path) flink_jobmanager_job_lastCheckpointExternalPath_info
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
@Public(Evolving): no - The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no (only affects metric reporting, not checkpoint functionality)
- The S3 file system connector: no
Documentation
- Does this pull request introduce a new feature? yes
- If yes, how is the feature documented? JavaDocs
Documentation Details:
- Comprehensive JavaDoc added to CheckpointPathInfoCollector class explaining:
- Purpose: Export checkpoint path as Prometheus info-style metric
- Behavior: Path stored in label, value always 1.0
- Example output format
- Inline code comments explaining the special handling logic
- Unit test documentation demonstrating usage patterns
Additional Documentation (if requested):
If the community requires, I can add documentation to docs/content/docs/deployment/metric_reporters.md explaining:
- The info-style metric format for checkpoint paths
- PromQL query examples for joining with other metrics
- Use cases for monitoring and alerting
CI report:
- aaea4c7f1699015c2bb855b0ee2143404fa76c78 Azure: FAILURE
Bot commands
The @flinkbot bot supports the following commands:@flinkbot run azurere-run the last Azure build