gloo icon indicating copy to clipboard operation
gloo copied to clipboard

OTEL exporter isn't setting the Status code and Span name is no longer being set to the API path

Open sadieleob opened this issue 10 months ago • 2 comments

Gloo Edge Product

Enterprise

Gloo Edge Version

1.16.3

Kubernetes Version

1.27

Describe the bug

The collector is not setting the Status code and Name (name is et to Ingress instead of the API path).

In the example below, the Span code does not reflect the request with response Codes 200, 404 or 503.

ScopeSpans #0 ScopeSpans SchemaURL: InstrumentationScope Span #0 Trace ID : f8ec84b4621ca212283fb21538d20148 Parent ID : ID : 9ff651b21612cc51 Name : ingress Kind : Server Start time : 2024-04-02 20:04:10.917227 +0000 UTC End time : 2024-04-02 20:04:10.918204 +0000 UTC Status code : Unset Status message : Attributes: -> node_id: Str(gateway-proxy-66658d8644-27j5t.gloo-system-new) -> zone: Str() -> guid:x-request-id: Str(89d85fd0-346c-9381-b38f-3ad4fe78fa80) -> http.url: Str(http://localhost:8080/helloworld) -> http.method: Str(GET) -> downstream_cluster: Str(-) -> user_agent: Str(curl/8.4.0) -> http.protocol: Str(HTTP/1.1) -> peer.address: Str(127.0.0.1) -> request_size: Str(0) -> response_size: Str(11) -> component: Str(proxy) -> http.status_code: Str(404) -> response_flags: Str(-) {"kind": "exporter", "data_type": "traces", "name": "logging"}

Expected Behavior

Span "Status code" should reflect the Status code in the response (OK or Error when appropriate). Span "Name" should be set to the API path.

Steps to reproduce the bug

Follow the docs: https://docs.solo.io/gloo-edge/latest/guides/observability/tracing/otel/

Deploy Otel and Zipkin, configure VS with directResponseAction and try different statuses: 200, 404, and 503 to confirm that the Span Status code stays as Unset, instead of Ok or Error as per https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status

Additional Environment Detail

No response

Additional Context

We are not adversely impacted by staying on Zipkin, but it slows down our adoption of OTLP which is our standard for distributed tracing. We may get pushback from teams for us staying on Zipkin for an extended period of time, but this is not causing any major production issues. At most, it is considered a blocker for enhancements to observability tools.

sadieleob avatar Apr 03 '24 00:04 sadieleob

Zendesk ticket #3478 has been linked to this issue.

soloio-bot avatar Apr 03 '24 00:04 soloio-bot

This bug seems to have been already fixed in envoy https://github.com/envoyproxy/envoy/pull/33674

asayah avatar Jun 28 '24 22:06 asayah

@nfuden to look at this issue for scope/estimation.

nrjpoddar avatar Jul 02 '24 19:07 nrjpoddar

@nfuden to confirm, after backporting the fix, we still need to work on the span name that is defaulting today to ingress

asayah avatar Jul 11 '24 16:07 asayah

@asayah We cherry-picked the patch you mentioned in this comment from upstream envoy into envoy-gloo and it has been released in OSS gloo version 1.17.1. Would it be possible for you to try testing again against that version and let me know if the issue still persists?

Thanks!

ashishb-solo avatar Jul 24 '24 14:07 ashishb-solo

We've had a lot of discussion on this in multiple locations, but to summarize where we stand, we merged the patch in Gloo and released it in OSS Gloo 1.17.1.

I used the following opentelemetry configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-agent-conf
  labels:
    app: opentelemetry
    component: otel-agent-conf
data:
  otel-agent-config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    exporters:
      otlp:
        endpoint: "otel-collector.default:4317"
        tls:
          insecure: true
        sending_queue:
          num_consumers: 4
          queue_size: 100
        retry_on_failure:
          enabled: true
    processors:
      batch:
      memory_limiter:
        # 80% of maximum memory up to 2G
        limit_mib: 400
        # 25% of limit up to 2G
        spike_limit_mib: 100
        check_interval: 5s
    extensions:
      zpages: {}
      memory_ballast:
        # Memory Ballast size should be max 1/3 to 1/2 of memory.
        size_mib: 165
    service:
      extensions: [zpages, memory_ballast]
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlp]
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-agent
  labels:
    app: opentelemetry
    component: otel-agent
spec:
  selector:
    matchLabels:
      app: opentelemetry
      component: otel-agent
  template:
    metadata:
      labels:
        app: opentelemetry
        component: otel-agent
    spec:
      containers:
      - command:
          - "/otelcol"
          - "--config=/conf/otel-agent-config.yaml"
        image: otel/opentelemetry-collector:0.68.0
        name: otel-agent
        resources:
          limits:
            cpu: 500m
            memory: 500Mi
          requests:
            cpu: 100m
            memory: 100Mi
        ports:
        - containerPort: 55679 # ZPages endpoint.
        - containerPort: 4317 # Default OpenTelemetry receiver port.
        - containerPort: 8888  # Metrics.
        volumeMounts:
        - name: otel-agent-config-vol
          mountPath: /conf
      volumes:
        - configMap:
            name: otel-agent-conf
            items:
              - key: otel-agent-config
                path: otel-agent-config.yaml
          name: otel-agent-config-vol
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-conf
  labels:
    app: opentelemetry
    component: otel-collector-conf
data:
  otel-collector-config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
      batch:
      memory_limiter:
        # 80% of maximum memory up to 2G
        limit_mib: 1500
        # 25% of limit up to 2G
        spike_limit_mib: 512
        check_interval: 5s
    extensions:
      zpages: {}
      memory_ballast:
        # Memory Ballast size should be max 1/3 to 1/2 of memory.
        size_mib: 683
    exporters:
      logging:
        loglevel: debug
      zipkin:
        endpoint: "http://zipkin:9411/api/v2/spans"
        tls:
          insecure: true
    service:
      extensions: [zpages, memory_ballast]
      pipelines:
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [logging]
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [logging, zipkin]
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  labels:
    app: opentelemetry
    component: otel-collector
spec:
  ports:
  - name: otlp-grpc # Default endpoint for OpenTelemetry gRPC receiver.
    port: 4317
    protocol: TCP
    targetPort: 4317
  - name: otlp-http # Default endpoint for OpenTelemetry HTTP receiver.
    port: 4318
    protocol: TCP
    targetPort: 4318
  - name: metrics # Default endpoint for querying metrics.
    port: 8888
  selector:
    component: otel-collector
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  labels:
    app: opentelemetry
    component: otel-collector
spec:
  selector:
    matchLabels:
      app: opentelemetry
      component: otel-collector
  minReadySeconds: 5
  progressDeadlineSeconds: 120
  replicas: 1 #TODO - adjust this to your own requirements
  template:
    metadata:
      labels:
        app: opentelemetry
        component: otel-collector
    spec:
      containers:
      - command:
          - "/otelcol"
          - "--config=/conf/otel-collector-config.yaml"
        image: otel/opentelemetry-collector:0.68.0
        name: otel-collector
        resources:
          limits:
            cpu: 1
            memory: 2Gi
          requests:
            cpu: 200m
            memory: 400Mi
        ports: # Comment out ports for platforms as needed.
        - containerPort: 55679 # Default endpoint for ZPages.
        - containerPort: 4317 # Default endpoint for OpenTelemetry receiver.
        - containerPort: 14250 # Default endpoint for Jaeger gRPC receiver.
        - containerPort: 14268 # Default endpoint for Jaeger HTTP receiver.
        - containerPort: 9411 # Default endpoint for Zipkin receiver.
        - containerPort: 8888  # Default endpoint for querying metrics.
        volumeMounts:
        - name: otel-collector-config-vol
          mountPath: /conf
      volumes:
        - configMap:
            name: otel-collector-conf
            items:
              - key: otel-collector-config
                path: otel-collector-config.yaml
          name: otel-collector-config-vol

And this gloo config:

apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
  name: "opentelemetry-collector"
  namespace: gloo-system
spec:
  # REQUIRED FOR OPENTELEMETRY COLLECTION
  useHttp2: true
  static:
    hosts:
      - addr: "otel-collector"
        port: 4317
---
apiVersion: gateway.solo.io/v1
kind: Gateway
metadata:
  labels:
    app: gloo
  name: gateway-proxy
  namespace: gloo-system
spec:
  bindAddress: '::'
  bindPort: 8080
  httpGateway:
    options:
      httpConnectionManagerSettings:
        tracing:
          openTelemetryConfig:
            collectorUpstreamRef:
              namespace: "gloo-system"
              name: "opentelemetry-collector"
---
apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: default
  namespace: gloo-system
spec:
  virtualHost:
    domains:
      - '*'
    routes:
      - matchers:
         - prefix: /
        directResponseAction:
          status: 500
          body: 'hello world'

With the patch, the status does get set to Error if the direct response action is 500. However, all other requests (ie. status code = 200) still result in an Unset behaviour. This is intended by design as documented in the Opentelemetry patch (see here). And the semantic conventions for Status also corroborate this behaviour (see here and here). They state:

For HTTP status codes in the 4xx range span status MUST be left unset in case of SpanKind.SERVER and MUST be set to Error in case of SpanKind.CLIENT.

For HTTP status codes in the 5xx range, as well as any other code the client failed to interpret, span status MUST be set to Error.

Regarding the Name, the patch in question does not alter that at all, so picking up that piece of work would require a fairly non-trivial patch to upstream Envoy. We need to make sure that this is in scope for the issue because it appears to be more work than we had initially planned to take on here. There doesn't seem to be an open issue addressing this in upstream Envoy at this time.

ashishb-solo avatar Jul 25 '24 16:07 ashishb-solo

From user feedback (see this thread), it seems that the existing behaviour regarding Status is consistent with other observability products, so this fix seems to fit user needs and should be applied.

From the same comment thread, it seems that we do not yet know exactly what kind of behaviour we expect from the Name field. But it seems that existing workarounds will suffice for the time being, so in my opinion, it will probably be okay to simply release Gloo with the patch that fixes the Status behaviour and then we can close this. We can create a new issue once we've established exactly what we desire regarding the Name field.

ashishb-solo avatar Jul 26 '24 16:07 ashishb-solo

Since we have successfully backported the patch in question and released it (OSS Gloo 1.17.1 and Gloo-EE 1.17.0), I feel that it's safe to close this issue. The Status code behaviour is fixed by the patch, and while the Name field remains unfixed, we can open a new issue once we have a clearer direction on how we want to do it.

I am in the process of updating this in our documentation, but outside of that, I feel that it is safe to mark this as resolved.

ashishb-solo avatar Jul 26 '24 19:07 ashishb-solo