serving icon indicating copy to clipboard operation
serving copied to clipboard

Switch tracing service name to be ServingRevision rather than ServingPod in queue-proxy

Open ajp-lsq opened this issue 2 years ago • 15 comments

Describe the feature

Currently, queue-proxy uses the pod name as the service name in the traces it sends out. Link to line where this happens here: https://github.com/knative/serving/blob/33dbc31dd93d4ec1aed2d026d3763fb81a1cc807/cmd/queue/main.go#L357

This is hard-coded and unchangeable. In a realistic APM scenario, this creates a proliferation of thousands of services in the APM UI which creates a lot of noise. It would be better to provide a more descriptive service name that reduces noise and provides a more useful list of knative services.

I see three options:

  1. Change the service name to be ServingRevision or ServingService, probably ServingRevision.
  2. Do number 1 but also make the value overrideable via label/env/config/etc so the kservice can dictate its own service name
  3. Use "queue-proxy" as the service name and let the user-container be the only one that outputs traces with the actual service name

Open to discussion on what people prefer.

ajp-lsq avatar May 26 '22 13:05 ajp-lsq

this creates a proliferation of thousands of services in the APM UI which creates a lot of noise

Do you have a sample of this you could post, to help illustrate the issue? (not questioning whether this is an issue, it would just help me see the problem a bit more clearly)

psschwei avatar May 27 '22 15:05 psschwei

I have the same issue and it looks like this in datadog: image

essentially, it sees each pod as a new service.

edude03 avatar May 30 '22 16:05 edude03

What I'm curious to know is how these APM tools break down/rollout services and their instances. What would be an analogous example is how would you map traces from a deployment that has multiple pods?

I'm leaning towards making the service name the name of the revision - and the pod name being an 'instance' (not sure exactly how you add this to the trace and have it appear)

Secondly, we probably need to broadcast (maybe the mailing list) and see if anyone is using queue-proxy tracing as is - we don't want to break existing users.

dprotaso avatar May 30 '22 19:05 dprotaso

/triage needs-user-input

dprotaso avatar May 30 '22 19:05 dprotaso

Hi one concern with aggregations is the following: in debugging scenarios people need to link pod info with specific traces? If one in 100 pods fails and there are some request traces that also fail would people be able to make the connection? Actually I have seen customers dealing with the problem of disconnection between infra and app observability info. As a side note downstream at the Openshift Serverless people still use tracing per pod integrating with Jaeger and other tools directly. It is however noisy and the same applies to metrics (high cardinality issues).

skonto avatar May 30 '22 19:05 skonto

information about the specific pod/ip/instance can be added to the span/trace attributes. The OTel auto instrumentation k8s operator for example adds k8s.container.name, k8s.deployment.name, k8s.namespace.name, k8s.node.name, k8s.pod.name etc. to all spans it creates. This makes it possible to identify the pod which created the span, but also allow to group and filter spans into different topologies, without creating noise in the service name. Also Grafana and jaeger use the service name to group traces and currently it is not possible to group by knative service, because each replica is its own group.

Legion2 avatar Jul 15 '22 10:07 Legion2

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Oct 14 '22 01:10 github-actions[bot]

I created an opentelemetry processor to truncate the service name, if somebody is interested:

processors:
  transform/knative:
    traces:
      queries:
      - replace_pattern(resource.attributes["service.name"], "-[0-9]{5}-deployment-[a-fA-F0-9]{8,10}-[a-zA-Z0-9]{5}", "")

Legion2 avatar Oct 14 '22 16:10 Legion2

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Jan 14 '23 01:01 github-actions[bot]

I agree with previous comments. I think the service.name should be the name of the service as returned by kn service list The revision and pod info are metadata that should be attached to the spans belonging to the trace.

This issue makes it more difficult for non-knative users to search for their traces because the Jaeger service dropdown is polluted with lots of knative noise.

paulgrav avatar Jan 17 '23 17:01 paulgrav

/lifecycle frozen

dprotaso avatar Jan 19 '23 21:01 dprotaso

/triage accepted

dprotaso avatar Jan 19 '23 21:01 dprotaso

We are currently have multiple teams looking into using Knative. I’ve had one team view this issue as a blocker to adopting Knative and subsequently stopped investigating.

paulgrav avatar Aug 25 '23 13:08 paulgrav

I have the same issue, fixed with OTEL processor:

processors:
  k8sattributes/knative:
    auth_type: "serviceAccount"
    passthrough: false
    filter:
      node_from_env_var: KUBE_NODE_NAME
    extract:
        - tag_name: knative.serving.configuration
          key: serving.knative.dev/configuration
          from: pod
  resource/knative:
    attributes:
      - key: service.name
        from_attribute: knative.serving.configuration
        action: upsert
service:
  pipelines:
    traces/knative:
      receivers: [zipkin]
      exporters: [datadog]
      processors: [k8sattributes/knative, resource/knative]

VladislavGeneDx avatar Apr 04 '24 13:04 VladislavGeneDx