serving
serving copied to clipboard
Switch tracing service name to be ServingRevision rather than ServingPod in queue-proxy
Describe the feature
Currently, queue-proxy uses the pod name as the service name in the traces it sends out. Link to line where this happens here: https://github.com/knative/serving/blob/33dbc31dd93d4ec1aed2d026d3763fb81a1cc807/cmd/queue/main.go#L357
This is hard-coded and unchangeable. In a realistic APM scenario, this creates a proliferation of thousands of services in the APM UI which creates a lot of noise. It would be better to provide a more descriptive service name that reduces noise and provides a more useful list of knative services.
I see three options:
- Change the service name to be ServingRevision or ServingService, probably ServingRevision.
- Do number 1 but also make the value overrideable via label/env/config/etc so the kservice can dictate its own service name
- Use "queue-proxy" as the service name and let the user-container be the only one that outputs traces with the actual service name
Open to discussion on what people prefer.
this creates a proliferation of thousands of services in the APM UI which creates a lot of noise
Do you have a sample of this you could post, to help illustrate the issue? (not questioning whether this is an issue, it would just help me see the problem a bit more clearly)
I have the same issue and it looks like this in datadog:
essentially, it sees each pod as a new service.
What I'm curious to know is how these APM tools break down/rollout services and their instances. What would be an analogous example is how would you map traces from a deployment that has multiple pods?
I'm leaning towards making the service name the name of the revision - and the pod name being an 'instance' (not sure exactly how you add this to the trace and have it appear)
Secondly, we probably need to broadcast (maybe the mailing list) and see if anyone is using queue-proxy tracing as is - we don't want to break existing users.
/triage needs-user-input
Hi one concern with aggregations is the following: in debugging scenarios people need to link pod info with specific traces? If one in 100 pods fails and there are some request traces that also fail would people be able to make the connection? Actually I have seen customers dealing with the problem of disconnection between infra and app observability info. As a side note downstream at the Openshift Serverless people still use tracing per pod integrating with Jaeger and other tools directly. It is however noisy and the same applies to metrics (high cardinality issues).
information about the specific pod/ip/instance can be added to the span/trace attributes. The OTel auto instrumentation k8s operator for example adds k8s.container.name
, k8s.deployment.name
, k8s.namespace.name
, k8s.node.name
, k8s.pod.name
etc. to all spans it creates. This makes it possible to identify the pod which created the span, but also allow to group and filter spans into different topologies, without creating noise in the service name. Also Grafana and jaeger use the service name to group traces and currently it is not possible to group by knative service, because each replica is its own group.
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
I created an opentelemetry processor to truncate the service name, if somebody is interested:
processors:
transform/knative:
traces:
queries:
- replace_pattern(resource.attributes["service.name"], "-[0-9]{5}-deployment-[a-fA-F0-9]{8,10}-[a-zA-Z0-9]{5}", "")
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
I agree with previous comments. I think the service.name should be the name of the service as returned by kn service list
The revision and pod info are metadata that should be attached to the spans belonging to the trace.
This issue makes it more difficult for non-knative users to search for their traces because the Jaeger service dropdown is polluted with lots of knative noise.
/lifecycle frozen
/triage accepted
We are currently have multiple teams looking into using Knative. I’ve had one team view this issue as a blocker to adopting Knative and subsequently stopped investigating.
I have the same issue, fixed with OTEL processor:
processors:
k8sattributes/knative:
auth_type: "serviceAccount"
passthrough: false
filter:
node_from_env_var: KUBE_NODE_NAME
extract:
- tag_name: knative.serving.configuration
key: serving.knative.dev/configuration
from: pod
resource/knative:
attributes:
- key: service.name
from_attribute: knative.serving.configuration
action: upsert
service:
pipelines:
traces/knative:
receivers: [zipkin]
exporters: [datadog]
processors: [k8sattributes/knative, resource/knative]