[#8912] feat(iceberg-rest): Expose Iceberg client metrics through Gravitino MetricsSystem
What changes were proposed in this pull request?
This PR adds observability for Iceberg client operations by bridging Iceberg's metrics reporting to Gravitino's MetricsSystem.
Key Changes:
IcebergClientMetricsSource: New metrics source with iceberg-client namespace (separate from iceberg-rest-server HTTP metrics) IcebergRestMetricsStore: Implements MetricsStore to parse and record Iceberg commit/scan metrics using Iceberg's public APIs Configuration: Enable with metricsStore = rest
Why are the changes needed?
Metrics sent to /v1/{prefix}/namespaces/{namespace}/tables/{table}/metrics are silently dropped when using dummy store. This PR enables monitoring of: Iceberg table operations (commits, scans) Data file operations (added/removed files, sizes) Query performance metrics sent through the metrics API
Fix: #(issue)
Does this PR introduce any user-facing change?
Yes, new configuration and metrics:
# Server configuration
gravitino.iceberg-rest.metricsStore = rest
# Client configuration (Spark)
spark.sql.catalog.<catalog-name>.rest-metrics-impl = org.apache.iceberg.rest.RESTMetricsReporter
Exposed metrics (under iceberg-client namespace): commit reports, scan reports, data files added/removed, file sizes, scan/commit durations, and 27+ additional metrics.
How was this patch tested?
- Unit tests:
./gradlew :iceberg:iceberg-rest-server:test --tests TestIcebergRestMetricsStore
- Production verification: Deployed to K8s with Spark SQL workload, confirmed 32 metrics tracked correctly
curl -s http://localhost:9001/metrics | jq '.histograms | with_entries(select(.key | startswith("iceberg-client")))'
{
"iceberg-client.iceberg.total-duration": {
"count": 3,
"max": 0,
"mean": 0,
"min": 0,
"p50": 0,
"p75": 0,
"p95": 0,
"p98": 0,
"p99": 0,
"p999": 0,
"stddev": 0
},
"iceberg-client.iceberg.total-planning-duration": {
"count": 9,
"max": 0,
"mean": 0,
"min": 0,
"p50": 0,
"p75": 0,
"p95": 0,
"p98": 0,
"p99": 0,
"p999": 0,
"stddev": 0
}
}
curl -s http://localhost:9001/metrics | jq '.counters | with_entries(select(.key | startswith("iceberg-client")))'
{
"iceberg-client.iceberg.added-data-files": {
"count": 1
},
"iceberg-client.iceberg.added-files-size-bytes": {
"count": 960
},
"iceberg-client.iceberg.added-records": {
"count": 1
},
"iceberg-client.iceberg.attempts": {
"count": 3
},
"iceberg-client.iceberg.dvs": {
"count": 0
},
"iceberg-client.iceberg.equality-delete-files": {
"count": 0
},
"iceberg-client.iceberg.indexed-delete-files": {
"count": 0
},
"iceberg-client.iceberg.positional-delete-files": {
"count": 0
},
"iceberg-client.iceberg.removed-data-files": {
"count": 1
},
"iceberg-client.iceberg.removed-files-size-bytes": {
"count": 923
},
"iceberg-client.iceberg.removed-records": {
"count": 1
},
"iceberg-client.iceberg.reports.commit": {
"count": 3
},
"iceberg-client.iceberg.reports.scan": {
"count": 9
},
"iceberg-client.iceberg.result-data-files": {
"count": 5
},
"iceberg-client.iceberg.result-delete-files": {
"count": 0
},
"iceberg-client.iceberg.scanned-data-manifests": {
"count": 5
},
"iceberg-client.iceberg.scanned-delete-manifests": {
"count": 0
},
"iceberg-client.iceberg.skipped-data-files": {
"count": 0
},
"iceberg-client.iceberg.skipped-data-manifests": {
"count": 2
},
"iceberg-client.iceberg.skipped-delete-files": {
"count": 0
},
"iceberg-client.iceberg.skipped-delete-manifests": {
"count": 0
},
"iceberg-client.iceberg.total-data-files": {
"count": 1
},
"iceberg-client.iceberg.total-data-manifests": {
"count": 7
},
"iceberg-client.iceberg.total-delete-file-size-in-bytes": {
"count": 0
},
"iceberg-client.iceberg.total-delete-files": {
"count": 0
},
"iceberg-client.iceberg.total-delete-manifests": {
"count": 0
},
"iceberg-client.iceberg.total-equality-deletes": {
"count": 0
},
"iceberg-client.iceberg.total-file-size-in-bytes": {
"count": 4615
},
"iceberg-client.iceberg.total-files-size-bytes": {
"count": 960
},
"iceberg-client.iceberg.total-positional-deletes": {
"count": 0
},
"iceberg-client.iceberg.total-records": {
"count": 1
}
}
@FANNG1 can you please help review this?
@bharos , thanks for the PR, the current implementation exporting Iceberg client metrics though IRC, this may dropping the detailed information. have you considered another solution to push the IRC metrics to promethues gateway?
anks for the PR, the current implementation exporting Iceberg client metrics though IRC, this may dropping the detailed information. have you considered another solution to push the IRC metrics to promethues gateway?
Is push gateway necessary? or is it enough to use tagged metrics, with labels including table_name etc..