gloo
gloo copied to clipboard
OTEL exporter isn't setting the Status code and Span name is no longer being set to the API path
Gloo Edge Product
Enterprise
Gloo Edge Version
1.16.3
Kubernetes Version
1.27
Describe the bug
The collector is not setting the Status code and Name (name is et to Ingress instead of the API path).
In the example below, the Span code does not reflect the request with response Codes 200, 404 or 503.
ScopeSpans #0 ScopeSpans SchemaURL: InstrumentationScope Span #0 Trace ID : f8ec84b4621ca212283fb21538d20148 Parent ID : ID : 9ff651b21612cc51 Name : ingress Kind : Server Start time : 2024-04-02 20:04:10.917227 +0000 UTC End time : 2024-04-02 20:04:10.918204 +0000 UTC Status code : Unset Status message : Attributes: -> node_id: Str(gateway-proxy-66658d8644-27j5t.gloo-system-new) -> zone: Str() -> guid:x-request-id: Str(89d85fd0-346c-9381-b38f-3ad4fe78fa80) -> http.url: Str(http://localhost:8080/helloworld) -> http.method: Str(GET) -> downstream_cluster: Str(-) -> user_agent: Str(curl/8.4.0) -> http.protocol: Str(HTTP/1.1) -> peer.address: Str(127.0.0.1) -> request_size: Str(0) -> response_size: Str(11) -> component: Str(proxy) -> http.status_code: Str(404) -> response_flags: Str(-) {"kind": "exporter", "data_type": "traces", "name": "logging"}
Expected Behavior
Span "Status code" should reflect the Status code in the response (OK or Error when appropriate). Span "Name" should be set to the API path.
Steps to reproduce the bug
Follow the docs: https://docs.solo.io/gloo-edge/latest/guides/observability/tracing/otel/
Deploy Otel and Zipkin, configure VS with directResponseAction and try different statuses: 200, 404, and 503 to confirm that the Span Status code stays as Unset, instead of Ok or Error as per https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status
Additional Environment Detail
No response
Additional Context
We are not adversely impacted by staying on Zipkin, but it slows down our adoption of OTLP which is our standard for distributed tracing. We may get pushback from teams for us staying on Zipkin for an extended period of time, but this is not causing any major production issues. At most, it is considered a blocker for enhancements to observability tools.
Zendesk ticket #3478 has been linked to this issue.
This bug seems to have been already fixed in envoy https://github.com/envoyproxy/envoy/pull/33674
@nfuden to look at this issue for scope/estimation.
@nfuden to confirm, after backporting the fix, we still need to work on the span name that is defaulting today to ingress
@asayah We cherry-picked the patch you mentioned in this comment from upstream envoy into envoy-gloo and it has been released in OSS gloo version 1.17.1. Would it be possible for you to try testing again against that version and let me know if the issue still persists?
Thanks!
We've had a lot of discussion on this in multiple locations, but to summarize where we stand, we merged the patch in Gloo and released it in OSS Gloo 1.17.1.
I used the following opentelemetry configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-agent-conf
labels:
app: opentelemetry
component: otel-agent-conf
data:
otel-agent-config: |
receivers:
otlp:
protocols:
grpc:
http:
exporters:
otlp:
endpoint: "otel-collector.default:4317"
tls:
insecure: true
sending_queue:
num_consumers: 4
queue_size: 100
retry_on_failure:
enabled: true
processors:
batch:
memory_limiter:
# 80% of maximum memory up to 2G
limit_mib: 400
# 25% of limit up to 2G
spike_limit_mib: 100
check_interval: 5s
extensions:
zpages: {}
memory_ballast:
# Memory Ballast size should be max 1/3 to 1/2 of memory.
size_mib: 165
service:
extensions: [zpages, memory_ballast]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-agent
labels:
app: opentelemetry
component: otel-agent
spec:
selector:
matchLabels:
app: opentelemetry
component: otel-agent
template:
metadata:
labels:
app: opentelemetry
component: otel-agent
spec:
containers:
- command:
- "/otelcol"
- "--config=/conf/otel-agent-config.yaml"
image: otel/opentelemetry-collector:0.68.0
name: otel-agent
resources:
limits:
cpu: 500m
memory: 500Mi
requests:
cpu: 100m
memory: 100Mi
ports:
- containerPort: 55679 # ZPages endpoint.
- containerPort: 4317 # Default OpenTelemetry receiver port.
- containerPort: 8888 # Metrics.
volumeMounts:
- name: otel-agent-config-vol
mountPath: /conf
volumes:
- configMap:
name: otel-agent-conf
items:
- key: otel-agent-config
path: otel-agent-config.yaml
name: otel-agent-config-vol
---
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-conf
labels:
app: opentelemetry
component: otel-collector-conf
data:
otel-collector-config: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
memory_limiter:
# 80% of maximum memory up to 2G
limit_mib: 1500
# 25% of limit up to 2G
spike_limit_mib: 512
check_interval: 5s
extensions:
zpages: {}
memory_ballast:
# Memory Ballast size should be max 1/3 to 1/2 of memory.
size_mib: 683
exporters:
logging:
loglevel: debug
zipkin:
endpoint: "http://zipkin:9411/api/v2/spans"
tls:
insecure: true
service:
extensions: [zpages, memory_ballast]
pipelines:
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [logging]
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [logging, zipkin]
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
labels:
app: opentelemetry
component: otel-collector
spec:
ports:
- name: otlp-grpc # Default endpoint for OpenTelemetry gRPC receiver.
port: 4317
protocol: TCP
targetPort: 4317
- name: otlp-http # Default endpoint for OpenTelemetry HTTP receiver.
port: 4318
protocol: TCP
targetPort: 4318
- name: metrics # Default endpoint for querying metrics.
port: 8888
selector:
component: otel-collector
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
labels:
app: opentelemetry
component: otel-collector
spec:
selector:
matchLabels:
app: opentelemetry
component: otel-collector
minReadySeconds: 5
progressDeadlineSeconds: 120
replicas: 1 #TODO - adjust this to your own requirements
template:
metadata:
labels:
app: opentelemetry
component: otel-collector
spec:
containers:
- command:
- "/otelcol"
- "--config=/conf/otel-collector-config.yaml"
image: otel/opentelemetry-collector:0.68.0
name: otel-collector
resources:
limits:
cpu: 1
memory: 2Gi
requests:
cpu: 200m
memory: 400Mi
ports: # Comment out ports for platforms as needed.
- containerPort: 55679 # Default endpoint for ZPages.
- containerPort: 4317 # Default endpoint for OpenTelemetry receiver.
- containerPort: 14250 # Default endpoint for Jaeger gRPC receiver.
- containerPort: 14268 # Default endpoint for Jaeger HTTP receiver.
- containerPort: 9411 # Default endpoint for Zipkin receiver.
- containerPort: 8888 # Default endpoint for querying metrics.
volumeMounts:
- name: otel-collector-config-vol
mountPath: /conf
volumes:
- configMap:
name: otel-collector-conf
items:
- key: otel-collector-config
path: otel-collector-config.yaml
name: otel-collector-config-vol
And this gloo config:
apiVersion: gloo.solo.io/v1
kind: Upstream
metadata:
name: "opentelemetry-collector"
namespace: gloo-system
spec:
# REQUIRED FOR OPENTELEMETRY COLLECTION
useHttp2: true
static:
hosts:
- addr: "otel-collector"
port: 4317
---
apiVersion: gateway.solo.io/v1
kind: Gateway
metadata:
labels:
app: gloo
name: gateway-proxy
namespace: gloo-system
spec:
bindAddress: '::'
bindPort: 8080
httpGateway:
options:
httpConnectionManagerSettings:
tracing:
openTelemetryConfig:
collectorUpstreamRef:
namespace: "gloo-system"
name: "opentelemetry-collector"
---
apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
name: default
namespace: gloo-system
spec:
virtualHost:
domains:
- '*'
routes:
- matchers:
- prefix: /
directResponseAction:
status: 500
body: 'hello world'
With the patch, the status does get set to Error
if the direct response action is 500
. However, all other requests (ie. status code = 200) still result in an Unset
behaviour. This is intended by design as documented in the Opentelemetry patch (see here). And the semantic conventions for Status
also corroborate this behaviour (see here and here). They state:
For HTTP status codes in the 4xx range span status MUST be left unset in case of SpanKind.SERVER and MUST be set to Error in case of SpanKind.CLIENT.
For HTTP status codes in the 5xx range, as well as any other code the client failed to interpret, span status MUST be set to Error.
Regarding the Name
, the patch in question does not alter that at all, so picking up that piece of work would require a fairly non-trivial patch to upstream Envoy. We need to make sure that this is in scope for the issue because it appears to be more work than we had initially planned to take on here. There doesn't seem to be an open issue addressing this in upstream Envoy at this time.
From user feedback (see this thread), it seems that the existing behaviour regarding Status
is consistent with other observability products, so this fix seems to fit user needs and should be applied.
From the same comment thread, it seems that we do not yet know exactly what kind of behaviour we expect from the Name
field. But it seems that existing workarounds will suffice for the time being, so in my opinion, it will probably be okay to simply release Gloo with the patch that fixes the Status
behaviour and then we can close this. We can create a new issue once we've established exactly what we desire regarding the Name
field.
Since we have successfully backported the patch in question and released it (OSS Gloo 1.17.1 and Gloo-EE 1.17.0), I feel that it's safe to close this issue. The Status code
behaviour is fixed by the patch, and while the Name
field remains unfixed, we can open a new issue once we have a clearer direction on how we want to do it.
I am in the process of updating this in our documentation, but outside of that, I feel that it is safe to mark this as resolved.