serving icon indicating copy to clipboard operation
serving copied to clipboard

queue-proxy not sending some tags on metrics when using opencensus

Open jasonaliyetti opened this issue 4 years ago • 12 comments

What version of Knative?

0.25.0

Expected Behavior

Opencensus telemetry from the queue-proxy should include configuration_name, revision_name, service_name tags as documented.

Actual Behavior

When I follow the documented setup for OpenTelemetry Collector and export the metrics using a prometheus exporter, the resulting timeseries do not include these expected labels.

Steps to Reproduce the Problem

  • Setup knative-serving to use opencensus for request metrics
  • Send a request and wait for the metric to show up in otel collector's prometheus exporter endpoint
  • Observe on the prometheus exporter endpoint that these labels are missing.

I would note that I am only using opencensus for the request metrics, but I wouldn't expect that to impact this:

metrics.backend-destination: prometheus
metrics.request-metrics-backend-destination: opencensus

jasonaliyetti avatar Oct 01 '21 18:10 jasonaliyetti

cc @skonto

julz avatar Oct 13 '21 12:10 julz

@jasonaliyetti This an issue I faced before check here along with the rest of the issues I found in the past when evaluating OTEL for Knative. Since then there were some improvements https://github.com/open-telemetry/opentelemetry-collector/pull/2899, https://github.com/open-telemetry/opentelemetry-collector/issues/2465. Could you check if the following config fixes the issue?

exporters:
  prometheus:
   ...
    resource_to_telemetry_conversion:
      enabled: true

skonto avatar Oct 13 '21 18:10 skonto

I did some debugging at the collector side, so by default we ship resource labels as expected (but by default Prometheus exporter at the OTEL collector side does not show the resource labels, resource labels are there though):

# TYPE knative_dev_internal_serving_revision_app_request_count counter
knative_dev_internal_serving_revision_app_request_count{container_name="queue-proxy",pod_name="helloworld-go-00001-deployment-7c577d85bc-x664b",response_code="200",response_code_class="2xx"} 24

2021-10-14T13:42:04.348Z	DEBUG	loggingexporter/logging_exporter.go:66	ResourceMetrics #0
Resource labels:
     -> service.name: STRING(revision)
     -> opencensus.starttime: STRING(2021-10-14T13:21:51.036876613Z)
     -> host.name: STRING(helloworld-go-00001-deployment-7c577d85bc-x664b)
     -> process.pid: INT(1)
     -> telemetry.sdk.version: STRING(0.23.0)
     -> opencensus.exporterversion: STRING(0.0.1)
     -> telemetry.sdk.language: STRING(go)
     -> namespace_name: STRING(default)
     -> service_name: STRING(helloworld-go)
     -> configuration_name: STRING(helloworld-go)
     -> revision_name: STRING(helloworld-go-00001)
     -> opencensus.resourcetype: STRING(knative_revision)
InstrumentationLibraryMetrics #0
InstrumentationLibrary  
Metric #0
Descriptor:
     -> Name: knative.dev/internal/serving/revision/scrape_time
     -> Description: The time to scrape metrics in milliseconds
     -> Unit: ms
     -> DataType: Histogram
     -> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
Metric #1
Descriptor:
     -> Name: knative.dev/internal/serving/revision/app_request_count
     -> Description: The number of requests that are routed to user-container
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
NumberDataPoints #0
Data point attributes:
     -> container_name: STRING(queue-proxy)
     -> pod_name: STRING(helloworld-go-00001-deployment-7c577d85bc-x664b)
     -> response_code: STRING(200)
     -> response_code_class: STRING(2xx)
StartTimestamp: 2021-10-14 13:40:02.346089686 +0000 UTC
Timestamp: 2021-10-14 13:42:02.346635646 +0000 UTC

However if I try to enable that config above I get an error for duplicate labels:

2021-10-14T12:54:21.803Z error [email protected]/collector.go:220 failed to convert metric knative.dev/internal/serving/revision/request_count: duplicate label names {"kind": "exporter", "name": "prometheus"}

This is due to label sanitization (I can provide the full call graph if needed): https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/f89d7a10d861bf584302511f7db463247cdc3fca/exporter/prometheusexporter/sanitize.go#L25

sanitize replaces non-alphanumeric characters with underscores in s.

As you can see service.name (comes from otel) and service_name (comes from knative) will end up to be the same metric label.

skonto avatar Oct 14 '21 13:10 skonto

By renaming our tag here to knative_service_name and enabling the above config resource_to_telemetry_conversion I was able to get all resource labels:

# TYPE knative_dev_internal_serving_revision_app_request_count counter
knative_dev_internal_serving_revision_app_request_count{configuration_name="helloworld-go",container_name="queue-proxy",host_name="helloworld-go-00001-deployment-868846d854-hschn",knative_service_name="helloworld-go",namespace_name="default",opencensus_exporterversion="0.0.1",opencensus_resourcetype="knative_revision",opencensus_starttime="2021-10-14T14:26:30.253916233Z",pod_name="helloworld-go-00001-deployment-868846d854-hschn",process_pid="1",response_code="200",response_code_class="2xx",revision_name="helloworld-go-00001",service_name="revision",telemetry_sdk_language="go",telemetry_sdk_version="0.23.0"} 16

2021-10-14T14:27:37.410Z	DEBUG	loggingexporter/logging_exporter.go:66	ResourceMetrics #0
Resource labels:
     -> service.name: STRING(revision)
     -> opencensus.starttime: STRING(2021-10-14T14:26:30.253916233Z)
     -> host.name: STRING(helloworld-go-00001-deployment-868846d854-hschn)
     -> process.pid: INT(1)
     -> telemetry.sdk.version: STRING(0.23.0)
     -> opencensus.exporterversion: STRING(0.0.1)
     -> telemetry.sdk.language: STRING(go)
     -> knative_service_name: STRING(helloworld-go)
     -> configuration_name: STRING(helloworld-go)
     -> revision_name: STRING(helloworld-go-00001)
     -> namespace_name: STRING(default)
     -> opencensus.resourcetype: STRING(knative_revision)
InstrumentationLibraryMetrics #0
InstrumentationLibrary  
Metric #0
Descriptor:
     -> Name: knative.dev/internal/serving/revision/scrape_time
     -> Description: The time to scrape metrics in milliseconds
     -> Unit: ms
     -> DataType: Histogram
     -> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
Metric #1
Descriptor:
     -> Name: knative.dev/internal/serving/revision/app_request_count
     -> Description: The number of requests that are routed to user-container
     -> Unit: 1
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
NumberDataPoints #0
Data point attributes:
     -> container_name: STRING(queue-proxy)
     -> pod_name: STRING(helloworld-go-00001-deployment-868846d854-hschn)
     -> response_code: STRING(200)
     -> response_code_class: STRING(2xx)


skonto avatar Oct 14 '21 14:10 skonto

Another path would be to omit the component name from the knative side when opencencus is used (written here with component="revision" for queue proxy) as it seems the otel only sets the service.name attribute if the shipped metric contains a name. I am not sure if the default otel resource attributes can be omitted but I guess if we want to comply with the otel spec, metrics need to have a service.name attribute and we should rename the label for the service instance eg. helloworld-go to avoid confusion. /cc @dprotaso @evankanderson

skonto avatar Oct 14 '21 15:10 skonto

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Jan 13 '22 01:01 github-actions[bot]

/reopen

skonto avatar Nov 08 '22 10:11 skonto

@skonto: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow[bot] avatar Nov 08 '22 10:11 knative-prow[bot]

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Feb 10 '23 01:02 github-actions[bot]

Is there any workaround now that can solve this issue? This bug makes the most important metrics of the system useless because they don't have any tag. I tested the first fix that you suggested (changing service_name to knative_service_name) and it works nicely. Is there any reason why that change hasn't been implemented yet?

davidggz avatar Mar 24 '23 12:03 davidggz

/reopen

skonto avatar Sep 10 '24 08:09 skonto

@skonto: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

knative-prow[bot] avatar Sep 10 '24 08:09 knative-prow[bot]

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Jan 06 '25 01:01 github-actions[bot]