gloo icon indicating copy to clipboard operation
gloo copied to clipboard

OTEL Telemetry Exporter isn't setting service.name

Open knechtionscoding opened this issue 2 years ago • 2 comments

Gloo Edge Version

1.13.x (latest stable)

Kubernetes Version

1.23.x

Describe the bug

When using the opentelemetry configuration with gloo the resource attribute isn't having the service.name set. It's currently being set as unknown

Steps to reproduce the bug

Relevant configuration:

  httpGateway:
    options:
      httpConnectionManagerSettings:
        tracing:
          openTelemetryConfig:
            collectorUpstreamRef:
              name: <otel-collector>
              namespace: <namespace>

Expected Behavior

Spans and traces should have the resource attribute of service.name properly.

Additional Context

No response

knechtionscoding avatar Feb 03 '23 20:02 knechtionscoding

Potentially related to #8494

SantoDE avatar Jul 27 '23 15:07 SantoDE

Perhaps I'm over-simplifying this, so please correct me if I'm wrong.

Per https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/trace/v3/opentelemetry.proto.html:

The name for the service. This will be populated in the ResourceSpan Resource attributes. If it is not provided, it will default to “unknown_service:envoy”.

We should be populating the service.name field with the name of the gateway-proxy service that is being run, correct?

Should this be the name of the service or the name of the gateway that this is defined on?

sam-heilbron avatar Feb 23 '24 15:02 sam-heilbron

I'd say the gateway name fits well for this value as it's defined in the Gateway resource. In case sharing is used (https://docs.solo.io/gloo-edge/main/introduction/architecture/deployment_arch/#sharded-api-gateway), the value would be different for each Gateway resources which makes sense IMO.

anessi avatar Feb 29 '24 12:02 anessi

@sam-heilbron Feedback from another customer is that setting this to the gateway name also fits their use-case and requirements. So I agree with @anessi, let's set this to the gateway name, keeping in mind that we need to support a shared environment with multiple gateways.

DuncanDoyle avatar Mar 01 '24 13:03 DuncanDoyle

I want to confirm the expected behavior with a specific example, using the example from https://docs.solo.io/gloo-edge/latest/guides/observability/tracing/otel/

This is the gateway config:

apiVersion: gateway.solo.io/v1
kind: Gateway
metadata:
  labels:
    app: gloo
  name: gateway-proxy
  namespace: gloo-system
spec:
  bindAddress: '::'
  bindPort: 8080
  httpGateway:
    options:
      httpConnectionManagerSettings:
        tracing:
          openTelemetryConfig:
            collectorUpstreamRef:
              namespace: "gloo-system"
              name: "opentelemetry-collector"

Where right now we are displaying

Resource attributes:
     -> service.name: Str(unknown_service:envoy)

We would instead display:

Resource attributes:
     -> service.name: Str(gateway-proxy)

Should namespace be included in any way?

sheidkamp avatar Mar 13 '24 17:03 sheidkamp

I have a solution that depends upon the gateway populating some "parent" metadata with the gateway name.

This happens in the "classic" gloo gateway, and I suspect it is not happening in ggv2 (investigation in progress). I also think that since no one is on ggv2 yet, there's value in this approach, and we can add the translation to ggv2 at the appropriate time (if its not currently implemented).

I expect to have a PR for this approach ready Monday.

Also some questions for how to handle some edge cases (which shouldn't occur, but we should protect against). The edge cases:

  1. We are not able to find the gateway name because of how the how the metadata is formatted a. The deprecated, opaque data format is used (deprecated in v11.0) b. No gateway metadata is passed
  2. Multiple gateways are defined

These edge cases wouldn't be expected to be a matter of bad user configuration (at least for classic gateway), and would be more of a matter of bad translation/application logic.

The two main approaches to handling this would be:

  • Stop translation of the plugin and return an error. This
  • Log an error/warning and use either an empty string (which would result in the current unknown_service:envoy) or a string reflecting the data situation ("deprecated_metadata", "gateway_not_defined", "multiple_gateways_defined" or "gateway_1, gateway_2...")

Any opinions on which approach to use?

sheidkamp avatar Mar 15 '24 16:03 sheidkamp

IMO a namespace is not needed as this is commonly added as generic attributes to the traces (see resource.attributes.k8s@namespace@name in the example below).

See example-traces.json for two example traces, one from rate-limit and one from gateway-proxy.

If a trace can't be clearly related to a gateway, then a more sensible values like the type of service should be used. Stop translating is not an option IMO. It should be possible to always put a sensible value there (hopefully :) ). Maybe rate-limiting is a good example if multiple gateways are using the same rate limit server, I think it would be perfectly fine to have a value like rate-limit in the serviceName field. Basically the field is there to understand where the trace comes from.

anessi avatar Mar 15 '24 16:03 anessi

@anessi , thanks for the feedback. I agree we shouldn't stop translating on these edge cases.

However, in most of these cases I don't know that there will be a way to find a more sensible value. At the point in translation where we create the envoy Open Telemetry Configuration, we have already translated Gateways and VirtualServices to Proxys and VirtualHosts, and we don't have access to the runtime information/span definitions. The internal OpenTelemetry configuration only refers to the upstream ref.

These edge cases can't be reached by updating configuration, and would only be encountered if changes to the gateway (or a new gateway) were introduced. If I wanted to create them in a running cluster, I would start by intentionally writing some bad gateway translation code, I don't think I could do it with just yaml and out-of-the-box gloo. So if the metadata for the gateway is wrong, all of our metadata is likely wrong (as in case 1a/1b). The "multiple gateways" use case would also not occur when multiple gateways refer to the same upstream collector. Current translation would create separate Otel Configurations that would each refer to a single gateway.

This might be something easier discussed in a quick call/huddle with interested parties.

sheidkamp avatar Mar 17 '24 17:03 sheidkamp

@sheidkamp : thanks for the details. I don't know the internals, so if it's not possible to set any meaningful value, setting a string as you proposed above and logging additional details is the best that can be done I guess. If those are edge cases, it's maybe not such a big deal anyway.

anessi avatar Mar 18 '24 07:03 anessi

~~We're going to update the API by adding serviceNameSource to the openTelemetryConfig that will accept gatewayName: {} as a value, which will be used as the default if omitted.~~

  openTelemetryConfig:
    collectorUpstreamRef:
      name: opentelemetry-collector
      namespace: gloo-system
    serviceNameSource:
      gatewayName: {}

Since there was only one option, we decided to wait to introduce the API until there was a second option to choose from.

sheidkamp avatar Mar 22 '24 14:03 sheidkamp

Merged into 1.17/main

sheidkamp avatar Mar 27 '24 13:03 sheidkamp

Zendesk ticket #3478 has been linked to this issue.

soloio-bot avatar Apr 02 '24 20:04 soloio-bot