serving
serving copied to clipboard
queue-proxy not sending some tags on metrics when using opencensus
What version of Knative?
0.25.0
Expected Behavior
Opencensus telemetry from the queue-proxy should include configuration_name, revision_name, service_name tags as documented.
Actual Behavior
When I follow the documented setup for OpenTelemetry Collector and export the metrics using a prometheus exporter, the resulting timeseries do not include these expected labels.
Steps to Reproduce the Problem
- Setup knative-serving to use opencensus for request metrics
- Send a request and wait for the metric to show up in otel collector's prometheus exporter endpoint
- Observe on the prometheus exporter endpoint that these labels are missing.
I would note that I am only using opencensus for the request metrics, but I wouldn't expect that to impact this:
metrics.backend-destination: prometheus
metrics.request-metrics-backend-destination: opencensus
cc @skonto
@jasonaliyetti This an issue I faced before check here along with the rest of the issues I found in the past when evaluating OTEL for Knative. Since then there were some improvements https://github.com/open-telemetry/opentelemetry-collector/pull/2899, https://github.com/open-telemetry/opentelemetry-collector/issues/2465. Could you check if the following config fixes the issue?
exporters:
prometheus:
...
resource_to_telemetry_conversion:
enabled: true
I did some debugging at the collector side, so by default we ship resource labels as expected (but by default Prometheus exporter at the OTEL collector side does not show the resource labels, resource labels are there though):
# TYPE knative_dev_internal_serving_revision_app_request_count counter
knative_dev_internal_serving_revision_app_request_count{container_name="queue-proxy",pod_name="helloworld-go-00001-deployment-7c577d85bc-x664b",response_code="200",response_code_class="2xx"} 24
2021-10-14T13:42:04.348Z DEBUG loggingexporter/logging_exporter.go:66 ResourceMetrics #0
Resource labels:
-> service.name: STRING(revision)
-> opencensus.starttime: STRING(2021-10-14T13:21:51.036876613Z)
-> host.name: STRING(helloworld-go-00001-deployment-7c577d85bc-x664b)
-> process.pid: INT(1)
-> telemetry.sdk.version: STRING(0.23.0)
-> opencensus.exporterversion: STRING(0.0.1)
-> telemetry.sdk.language: STRING(go)
-> namespace_name: STRING(default)
-> service_name: STRING(helloworld-go)
-> configuration_name: STRING(helloworld-go)
-> revision_name: STRING(helloworld-go-00001)
-> opencensus.resourcetype: STRING(knative_revision)
InstrumentationLibraryMetrics #0
InstrumentationLibrary
Metric #0
Descriptor:
-> Name: knative.dev/internal/serving/revision/scrape_time
-> Description: The time to scrape metrics in milliseconds
-> Unit: ms
-> DataType: Histogram
-> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
Metric #1
Descriptor:
-> Name: knative.dev/internal/serving/revision/app_request_count
-> Description: The number of requests that are routed to user-container
-> Unit: 1
-> DataType: Sum
-> IsMonotonic: true
-> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
NumberDataPoints #0
Data point attributes:
-> container_name: STRING(queue-proxy)
-> pod_name: STRING(helloworld-go-00001-deployment-7c577d85bc-x664b)
-> response_code: STRING(200)
-> response_code_class: STRING(2xx)
StartTimestamp: 2021-10-14 13:40:02.346089686 +0000 UTC
Timestamp: 2021-10-14 13:42:02.346635646 +0000 UTC
However if I try to enable that config above I get an error for duplicate labels:
2021-10-14T12:54:21.803Z error [email protected]/collector.go:220 failed to convert metric knative.dev/internal/serving/revision/request_count: duplicate label names {"kind": "exporter", "name": "prometheus"}
This is due to label sanitization (I can provide the full call graph if needed): https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/f89d7a10d861bf584302511f7db463247cdc3fca/exporter/prometheusexporter/sanitize.go#L25
sanitize replaces non-alphanumeric characters with underscores in s.
As you can see service.name (comes from otel) and service_name (comes from knative) will end up to be the same metric label.
By renaming our tag here to knative_service_name and enabling the above config resource_to_telemetry_conversion I was able to get all resource labels:
# TYPE knative_dev_internal_serving_revision_app_request_count counter
knative_dev_internal_serving_revision_app_request_count{configuration_name="helloworld-go",container_name="queue-proxy",host_name="helloworld-go-00001-deployment-868846d854-hschn",knative_service_name="helloworld-go",namespace_name="default",opencensus_exporterversion="0.0.1",opencensus_resourcetype="knative_revision",opencensus_starttime="2021-10-14T14:26:30.253916233Z",pod_name="helloworld-go-00001-deployment-868846d854-hschn",process_pid="1",response_code="200",response_code_class="2xx",revision_name="helloworld-go-00001",service_name="revision",telemetry_sdk_language="go",telemetry_sdk_version="0.23.0"} 16
2021-10-14T14:27:37.410Z DEBUG loggingexporter/logging_exporter.go:66 ResourceMetrics #0
Resource labels:
-> service.name: STRING(revision)
-> opencensus.starttime: STRING(2021-10-14T14:26:30.253916233Z)
-> host.name: STRING(helloworld-go-00001-deployment-868846d854-hschn)
-> process.pid: INT(1)
-> telemetry.sdk.version: STRING(0.23.0)
-> opencensus.exporterversion: STRING(0.0.1)
-> telemetry.sdk.language: STRING(go)
-> knative_service_name: STRING(helloworld-go)
-> configuration_name: STRING(helloworld-go)
-> revision_name: STRING(helloworld-go-00001)
-> namespace_name: STRING(default)
-> opencensus.resourcetype: STRING(knative_revision)
InstrumentationLibraryMetrics #0
InstrumentationLibrary
Metric #0
Descriptor:
-> Name: knative.dev/internal/serving/revision/scrape_time
-> Description: The time to scrape metrics in milliseconds
-> Unit: ms
-> DataType: Histogram
-> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
Metric #1
Descriptor:
-> Name: knative.dev/internal/serving/revision/app_request_count
-> Description: The number of requests that are routed to user-container
-> Unit: 1
-> DataType: Sum
-> IsMonotonic: true
-> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
NumberDataPoints #0
Data point attributes:
-> container_name: STRING(queue-proxy)
-> pod_name: STRING(helloworld-go-00001-deployment-868846d854-hschn)
-> response_code: STRING(200)
-> response_code_class: STRING(2xx)
Another path would be to omit the component name from the knative side when opencencus is used (written here with component="revision" for queue proxy) as it seems the otel only sets the service.name attribute if the shipped metric contains a name. I am not sure if the default otel resource attributes can be omitted but I guess if we want to comply with the otel spec, metrics need to have a service.name attribute and we should rename the label for the service instance eg. helloworld-go to avoid confusion.
/cc @dprotaso @evankanderson
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
/reopen
@skonto: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
Is there any workaround now that can solve this issue? This bug makes the most important metrics of the system useless because they don't have any tag. I tested the first fix that you suggested (changing service_name to knative_service_name) and it works nicely. Is there any reason why that change hasn't been implemented yet?
/reopen
@skonto: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.