opentelemetry-operator
opentelemetry-operator copied to clipboard
Add serviceName Field to OpenTelemetryCollector CRD for StatefulSet Deployments
Component(s)
collector
Is your feature request related to a problem? Please describe.
When deploying the OpenTelemetry Collector as a StatefulSet using the OpenTelemetry Operator in a GKE cluster, the OpenTelemetryCollector CRD does not allow setting the serviceName to match the headless Service name, causing a mismatch with statefulset.serviceName. Kubernetes requires statefulset.serviceName to match the headless Service name for pods to register DNS hostnames (e.g.,
error: couldn't find the exporter for the endpoint ""
when configured as
exporters:
loadbalancing:
routing_key: "traceID"
protocol:
otlp:
timeout: 1s
tls:
insecure: true
resolver:
k8s:
service: opentelemetry-backend-collector-headless.plat-observe-dev
timeout: 3s
return_hostnames: true
For example, my headless Service (opentelemetry-backend-collector-headless.plat-observe-dev) doesn’t match the StatefulSet’s default serviceName (opentelemetry-backend-collector), preventing hostname resolution:
kubectl get statefulset opentelemetry-backend-collector -n plat-observe-dev -o jsonpath='{.spec.serviceName}'
opentelemetry-backend-collector
nslookup opentelemetry-backend-collector-0.plat-observe-dev.svc.cluster.local
** server can't find ...: NXDOMAIN
This breaks traceID routing, critical for tail-based sampling in my setup.
Describe the solution you'd like
Add a serviceName field to the OpenTelemetryCollector CRD to specify the headless Service name when mode: statefulset. The Operator should set statefulset.serviceName to this value, ensuring DNS hostname registration.
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-backend
namespace: plat-observe-dev
spec:
name: "opentelemetry-backend-collector"
serviceName: "opentelemetry-backend-collector-headless"
mode: statefulset
replicas: 9
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
send_batch_size: 20000
timeout: 5s
send_batch_max_size: 25000
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
Describe alternatives you've considered
Manual Workaround: Deploy the StatefulSet outside the Operator (using official collector helm chart) with a matching serviceName. This is successful, but loses Operator benefits.
Static Resolver: Failed due to incorrect DNS domain (plat-observe.dev vs. svc.cluster.local).
cert-manager: Issued certificates with Service DNS name, resolving TLS errors, but doesn’t fix hostname resolution.
Istio Gateway: Considered for ingress, but doesn’t support pod-specific routing needed for traceID.
Additional context
Environment: GKE, Istio 1.19.10-asm.33 (CSM auto mode), mTLS STRICT, OpenTelemetry Collector 0.126.0, Helm chart open-telemetry/opentelemetry-collector.
Setup: StatefulSet with replicas: 9, autoscaling (minReplicas: 3, maxReplicas: 10), headless Service (opentelemetry-backend-collector-headless), loadbalancing exporter (routing_key: "traceID", kubernetes resolver).
Issue: Mismatch between statefulset.serviceName (opentelemetry-backend-collector) and headless Service name prevents DNS hostname registration, breaking kubernetes resolver.
TLS: Certificates lack Service DNS name, addressed with cert-manager, but hostname resolution remains critical.
serviceName being set to the normal Service name instead of the headless Service we create is definitely a bug. This should be very straightforward to fix.
Exposing serviceName as a field for statefulset collectors also doesn't sound like a problem to me.
I'm not sure I understand your ask about injecting the serviceName into the collector configuration though. Could you elaborate on that? I assume you want this to be the DNS name of a different collector CR than the one you're configuring, in which case there's no way for the operator to know which one you want.
Sorry for confusion, I believe the only requirement is ability to specify statefulset.serviceName in the collector CR. I've updated the requested solution to match.
Thanks for confirming the bug!
@swiatekm is this something we aiming to fix? I can work on adding this new serviceName field.
Would be great if you could assign this issue to me
Kubernetes requires statefulset.serviceName to match the headless Service name for pods to register DNS hostnames
Could we make it implicit as opposed to ask people to configure it correctly?
Kubernetes requires statefulset.serviceName to match the headless Service name for pods to register DNS hostnames
Could we make it implicit as opposed to ask people to configure it correctly?
Currently in mode: deployment the statefulset.servicename appears to inherit the name of the statefulset, so if this were to be specified automatically, maybe it's as simple as:
when mode: statefulset use k8s headless svc name for statfulset.servicename
Kubernetes requires statefulset.serviceName to match the headless Service name for pods to register DNS hostnames
Could we make it implicit as opposed to ask people to configure it correctly?
We should make the default correct, and we can also make it configurable - the user may want to supply their own headless Service.
I think there seems to be a bug. With the following CR
kubectl apply -f - <<EOF
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: simplest
spec:
mode: statefulset
serviceName: foo
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
exporters:
debug: {}
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter]
exporters: [debug]
EOF
It creates
k get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 14m
simplest-collector ClusterIP 10.96.139.157 <none> 4317/TCP,4318/TCP 5m3s
simplest-collector-headless ClusterIP None <none> 4317/TCP,4318/TCP 5m3s
simplest-collector-monitoring ClusterIP 10.96.129.34 <none> 8888/TCP 5m3s
In other words, the PR https://github.com/open-telemetry/opentelemetry-operator/pull/4041 allows setting the serviceName on the statefulset but the name of the service name is not changed.
I think there seems to be a bug. With the following CR
kubectl apply -f - <<EOF apiVersion: opentelemetry.io/v1beta1 kind: OpenTelemetryCollector metadata: name: simplest spec: mode: statefulset serviceName: foo config: receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: memory_limiter: check_interval: 1s limit_percentage: 75 spike_limit_percentage: 15
exporters: debug: {} service: pipelines: traces: receivers: [otlp] processors: [memory_limiter] exporters: [debug]EOF
It creates
k get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1
443/TCP 14m simplest-collector ClusterIP 10.96.139.157 4317/TCP,4318/TCP 5m3s simplest-collector-headless ClusterIP None 4317/TCP,4318/TCP 5m3s simplest-collector-monitoring ClusterIP 10.96.129.34 8888/TCP 5m3s In other words, the PR #4041 allows setting the
serviceNameon the statefulset but the name of the service name is not changed.
I don't consider that a bug, but I'm also fine not creating the headless Service if the user sets serviceName.