serving icon indicating copy to clipboard operation
serving copied to clipboard

Autoscaler is reporting stale metrics for services/revisions

Open norman465 opened this issue 2 months ago • 3 comments

Since the switch of the metrics backend the autoscaler is reporting stale metrics for services/revisions. We run:

  metrics-protocol: prometheus
  request-metrics-protocol: prometheus

What version of Knative?

v1.19.6

Expected Behavior

Metrics of deleted revisions should no longer be reported.

Actual Behavior

Metrics of deleted revisions are reported with their last value.

Steps to Reproduce the Problem

Create a revision and delete it. These services/revisions all no longer exist and yet all metrics still report values like kn_revision_pods_count and kn_revision_pods_requested. All of which are stale:

kn_revision_pods_count{k8s_namespace_name="21uj92vxm973",kn_configuration_name="my-app",kn_revision_name="my-app-00001",kn_service_name="my-app",otel_scope_name="knative.dev/serving/pkg/autoscaler",otel_scope_schema_url="",otel_scope_version=""} 1
kn_revision_pods_count{k8s_namespace_name="21ujeqd9uia7",kn_configuration_name="my-app",kn_revision_name="my-app-00001",kn_service_name="my-app",otel_scope_name="knative.dev/serving/pkg/autoscaler",otel_scope_schema_url="",otel_scope_version=""} 1
kn_revision_pods_count{k8s_namespace_name="21ujiqdz3bpr",kn_configuration_name="my-app",kn_revision_name="my-app-00001",kn_service_name="my-app",otel_scope_name="knative.dev/serving/pkg/autoscaler",otel_scope_schema_url="",otel_scope_version=""} 1
kn_revision_pods_count{k8s_namespace_name="21uklm90kaa7",kn_configuration_name="my-app",kn_revision_name="my-app-00001",kn_service_name="my-app",otel_scope_name="knative.dev/serving/pkg/autoscaler",otel_scope_schema_url="",otel_scope_version=""} 1
kn_revision_pods_requested{k8s_namespace_name="21uh7qgrjzf3",kn_configuration_name="my-app",kn_revision_name="my-app-00001",kn_service_name="my-app",otel_scope_name="knative.dev/serving/pkg/autoscaler",otel_scope_schema_url="",otel_scope_version=""} 1
kn_revision_pods_requested{k8s_namespace_name="21uhb52o4rnz",kn_configuration_name="my-app",kn_revision_name="my-app-00001",kn_service_name="my-app",otel_scope_name="knative.dev/serving/pkg/autoscaler",otel_scope_schema_url="",otel_scope_version=""} 1

norman465 avatar Oct 23 '25 12:10 norman465

FYI - I marked this 1.21 but we'll want to cherry-pick into 1.20 & 1.19 release branch.

Also asked the OTel folks what's the best practice here to purge older metrics - https://cloud-native.slack.com/archives/C01NPAXACKT/p1761226563692549

dprotaso avatar Oct 23 '25 13:10 dprotaso

What's timely is there is upstream work to remove stale metrics/attributes

ref: https://github.com/open-telemetry/opentelemetry-go/pull/7541

If this is imminent why might be better off removing the metrics for a release or two and adding them back when OTel changes.

dprotaso avatar Oct 28 '25 00:10 dprotaso

What's timely is there is upstream work to remove stale metrics/attributes

ref: open-telemetry/opentelemetry-go#7541

If this is imminent why might be better off removing the metrics for a release or two and adding them back when OTel changes.

Are you thinking of removing all metrics? Would this be covered by an opt-in/opt-out flag?

We depend on those metrics and would like to keep them until a fix is available and rather regularly restart components to mitigate.

norman465 avatar Oct 28 '25 10:10 norman465