gloo
gloo copied to clipboard
OTEL Telemetry Exporter isn't setting service.name
Gloo Edge Version
1.13.x (latest stable)
Kubernetes Version
1.23.x
Describe the bug
When using the opentelemetry configuration with gloo the resource attribute isn't having the service.name
set. It's currently being set as unknown
Steps to reproduce the bug
Relevant configuration:
httpGateway:
options:
httpConnectionManagerSettings:
tracing:
openTelemetryConfig:
collectorUpstreamRef:
name: <otel-collector>
namespace: <namespace>
Expected Behavior
Spans and traces should have the resource attribute of service.name
properly.
Additional Context
No response
Potentially related to #8494
Perhaps I'm over-simplifying this, so please correct me if I'm wrong.
Per https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/trace/v3/opentelemetry.proto.html:
The name for the service. This will be populated in the ResourceSpan Resource attributes. If it is not provided, it will default to “unknown_service:envoy”.
We should be populating the service.name
field with the name of the gateway-proxy service that is being run, correct?
Should this be the name of the service or the name of the gateway that this is defined on?
I'd say the gateway name fits well for this value as it's defined in the Gateway
resource. In case sharing is used (https://docs.solo.io/gloo-edge/main/introduction/architecture/deployment_arch/#sharded-api-gateway), the value would be different for each Gateway
resources which makes sense IMO.
@sam-heilbron Feedback from another customer is that setting this to the gateway name also fits their use-case and requirements. So I agree with @anessi, let's set this to the gateway name, keeping in mind that we need to support a shared environment with multiple gateways.
I want to confirm the expected behavior with a specific example, using the example from https://docs.solo.io/gloo-edge/latest/guides/observability/tracing/otel/
This is the gateway config:
apiVersion: gateway.solo.io/v1
kind: Gateway
metadata:
labels:
app: gloo
name: gateway-proxy
namespace: gloo-system
spec:
bindAddress: '::'
bindPort: 8080
httpGateway:
options:
httpConnectionManagerSettings:
tracing:
openTelemetryConfig:
collectorUpstreamRef:
namespace: "gloo-system"
name: "opentelemetry-collector"
Where right now we are displaying
Resource attributes:
-> service.name: Str(unknown_service:envoy)
We would instead display:
Resource attributes:
-> service.name: Str(gateway-proxy)
Should namespace
be included in any way?
I have a solution that depends upon the gateway populating some "parent" metadata with the gateway name.
This happens in the "classic" gloo gateway, and I suspect it is not happening in ggv2 (investigation in progress). I also think that since no one is on ggv2 yet, there's value in this approach, and we can add the translation to ggv2 at the appropriate time (if its not currently implemented).
I expect to have a PR for this approach ready Monday.
Also some questions for how to handle some edge cases (which shouldn't occur, but we should protect against). The edge cases:
- We are not able to find the gateway name because of how the how the metadata is formatted a. The deprecated, opaque data format is used (deprecated in v11.0) b. No gateway metadata is passed
- Multiple gateways are defined
These edge cases wouldn't be expected to be a matter of bad user configuration (at least for classic gateway), and would be more of a matter of bad translation/application logic.
The two main approaches to handling this would be:
- Stop translation of the plugin and return an error. This
- Log an error/warning and use either an empty string (which would result in the current
unknown_service:envoy
) or a string reflecting the data situation ("deprecated_metadata", "gateway_not_defined", "multiple_gateways_defined" or "gateway_1, gateway_2...")
Any opinions on which approach to use?
IMO a namespace
is not needed as this is commonly added as generic attributes to the traces (see resource.attributes.k8s@namespace@name
in the example below).
See example-traces.json for two example traces, one from rate-limit and one from gateway-proxy.
If a trace can't be clearly related to a gateway, then a more sensible values like the type of service should be used. Stop translating is not an option IMO. It should be possible to always put a sensible value there (hopefully :) ).
Maybe rate-limiting is a good example if multiple gateways are using the same rate limit server, I think it would be perfectly fine to have a value like rate-limit
in the serviceName
field. Basically the field is there to understand where the trace comes from.
@anessi , thanks for the feedback. I agree we shouldn't stop translating on these edge cases.
However, in most of these cases I don't know that there will be a way to find a more sensible value. At the point in translation where we create the envoy Open Telemetry Configuration, we have already translated Gateways and VirtualServices to Proxys and VirtualHosts, and we don't have access to the runtime information/span definitions. The internal OpenTelemetry configuration only refers to the upstream ref.
These edge cases can't be reached by updating configuration, and would only be encountered if changes to the gateway (or a new gateway) were introduced. If I wanted to create them in a running cluster, I would start by intentionally writing some bad gateway translation code, I don't think I could do it with just yaml and out-of-the-box gloo. So if the metadata for the gateway is wrong, all of our metadata is likely wrong (as in case 1a/1b). The "multiple gateways" use case would also not occur when multiple gateways refer to the same upstream collector. Current translation would create separate Otel Configurations that would each refer to a single gateway.
This might be something easier discussed in a quick call/huddle with interested parties.
@sheidkamp : thanks for the details. I don't know the internals, so if it's not possible to set any meaningful value, setting a string as you proposed above and logging additional details is the best that can be done I guess. If those are edge cases, it's maybe not such a big deal anyway.
~~We're going to update the API by adding serviceNameSource
to the openTelemetryConfig
that will accept gatewayName: {}
as a value, which will be used as the default if omitted.~~
openTelemetryConfig:
collectorUpstreamRef:
name: opentelemetry-collector
namespace: gloo-system
serviceNameSource:
gatewayName: {}
Since there was only one option, we decided to wait to introduce the API until there was a second option to choose from.
Merged into 1.17/main
Zendesk ticket #3478 has been linked to this issue.