gravitino icon indicating copy to clipboard operation
gravitino copied to clipboard

[#8912] feat(iceberg-rest): Expose Iceberg client metrics through Gravitino MetricsSystem

Open bharos opened this issue 2 months ago • 3 comments

What changes were proposed in this pull request?

This PR adds observability for Iceberg client operations by bridging Iceberg's metrics reporting to Gravitino's MetricsSystem.

Key Changes:

IcebergClientMetricsSource: New metrics source with iceberg-client namespace (separate from iceberg-rest-server HTTP metrics) IcebergRestMetricsStore: Implements MetricsStore to parse and record Iceberg commit/scan metrics using Iceberg's public APIs Configuration: Enable with metricsStore = rest

Why are the changes needed?

Metrics sent to /v1/{prefix}/namespaces/{namespace}/tables/{table}/metrics are silently dropped when using dummy store. This PR enables monitoring of: Iceberg table operations (commits, scans) Data file operations (added/removed files, sizes) Query performance metrics sent through the metrics API

Fix: #(issue)

Does this PR introduce any user-facing change?

Yes, new configuration and metrics:

# Server configuration
gravitino.iceberg-rest.metricsStore = rest
# Client configuration (Spark)
spark.sql.catalog.<catalog-name>.rest-metrics-impl = org.apache.iceberg.rest.RESTMetricsReporter

Exposed metrics (under iceberg-client namespace): commit reports, scan reports, data files added/removed, file sizes, scan/commit durations, and 27+ additional metrics.

How was this patch tested?

  • Unit tests:
./gradlew :iceberg:iceberg-rest-server:test --tests TestIcebergRestMetricsStore
  • Production verification: Deployed to K8s with Spark SQL workload, confirmed 32 metrics tracked correctly
 curl -s http://localhost:9001/metrics | jq '.histograms | with_entries(select(.key | startswith("iceberg-client")))'
{
  "iceberg-client.iceberg.total-duration": {
    "count": 3,
    "max": 0,
    "mean": 0,
    "min": 0,
    "p50": 0,
    "p75": 0,
    "p95": 0,
    "p98": 0,
    "p99": 0,
    "p999": 0,
    "stddev": 0
  },
  "iceberg-client.iceberg.total-planning-duration": {
    "count": 9,
    "max": 0,
    "mean": 0,
    "min": 0,
    "p50": 0,
    "p75": 0,
    "p95": 0,
    "p98": 0,
    "p99": 0,
    "p999": 0,
    "stddev": 0
  }
}
curl -s http://localhost:9001/metrics | jq '.counters | with_entries(select(.key | startswith("iceberg-client")))'
{
  "iceberg-client.iceberg.added-data-files": {
    "count": 1
  },
  "iceberg-client.iceberg.added-files-size-bytes": {
    "count": 960
  },
  "iceberg-client.iceberg.added-records": {
    "count": 1
  },
  "iceberg-client.iceberg.attempts": {
    "count": 3
  },
  "iceberg-client.iceberg.dvs": {
    "count": 0
  },
  "iceberg-client.iceberg.equality-delete-files": {
    "count": 0
  },
  "iceberg-client.iceberg.indexed-delete-files": {
    "count": 0
  },
  "iceberg-client.iceberg.positional-delete-files": {
    "count": 0
  },
  "iceberg-client.iceberg.removed-data-files": {
    "count": 1
  },
  "iceberg-client.iceberg.removed-files-size-bytes": {
    "count": 923
  },
  "iceberg-client.iceberg.removed-records": {
    "count": 1
  },
  "iceberg-client.iceberg.reports.commit": {
    "count": 3
  },
  "iceberg-client.iceberg.reports.scan": {
    "count": 9
  },
  "iceberg-client.iceberg.result-data-files": {
    "count": 5
  },
  "iceberg-client.iceberg.result-delete-files": {
    "count": 0
  },
  "iceberg-client.iceberg.scanned-data-manifests": {
    "count": 5
  },
  "iceberg-client.iceberg.scanned-delete-manifests": {
    "count": 0
  },
  "iceberg-client.iceberg.skipped-data-files": {
    "count": 0
  },
  "iceberg-client.iceberg.skipped-data-manifests": {
    "count": 2
  },
  "iceberg-client.iceberg.skipped-delete-files": {
    "count": 0
  },
  "iceberg-client.iceberg.skipped-delete-manifests": {
    "count": 0
  },
  "iceberg-client.iceberg.total-data-files": {
    "count": 1
  },
  "iceberg-client.iceberg.total-data-manifests": {
    "count": 7
  },
  "iceberg-client.iceberg.total-delete-file-size-in-bytes": {
    "count": 0
  },
  "iceberg-client.iceberg.total-delete-files": {
    "count": 0
  },
  "iceberg-client.iceberg.total-delete-manifests": {
    "count": 0
  },
  "iceberg-client.iceberg.total-equality-deletes": {
    "count": 0
  },
  "iceberg-client.iceberg.total-file-size-in-bytes": {
    "count": 4615
  },
  "iceberg-client.iceberg.total-files-size-bytes": {
    "count": 960
  },
  "iceberg-client.iceberg.total-positional-deletes": {
    "count": 0
  },
  "iceberg-client.iceberg.total-records": {
    "count": 1
  }
}

bharos avatar Oct 25 '25 00:10 bharos

@FANNG1 can you please help review this?

jerryshao avatar Oct 29 '25 09:10 jerryshao

@bharos , thanks for the PR, the current implementation exporting Iceberg client metrics though IRC, this may dropping the detailed information. have you considered another solution to push the IRC metrics to promethues gateway?

FANNG1 avatar Oct 31 '25 01:10 FANNG1

anks for the PR, the current implementation exporting Iceberg client metrics though IRC, this may dropping the detailed information. have you considered another solution to push the IRC metrics to promethues gateway?

Is push gateway necessary? or is it enough to use tagged metrics, with labels including table_name etc..

bharos avatar Nov 05 '25 22:11 bharos