opentelemetry-operator icon indicating copy to clipboard operation
opentelemetry-operator copied to clipboard

Add serviceName Field to OpenTelemetryCollector CRD for StatefulSet Deployments

Open rewt opened this issue 6 months ago • 6 comments

Component(s)

collector

Is your feature request related to a problem? Please describe.

When deploying the OpenTelemetry Collector as a StatefulSet using the OpenTelemetry Operator in a GKE cluster, the OpenTelemetryCollector CRD does not allow setting the serviceName to match the headless Service name, causing a mismatch with statefulset.serviceName. Kubernetes requires statefulset.serviceName to match the headless Service name for pods to register DNS hostnames (e.g., .<serviceName>..svc.cluster.local). Without this, the loadbalancing exporter’s kubernetes resolver (return_hostnames: true) fails to resolve pod hostnames, resulting in:

error: couldn't find the exporter for the endpoint ""

when configured as

exporters:
  loadbalancing:
    routing_key: "traceID"
    protocol:
      otlp:
        timeout: 1s
        tls:
          insecure: true
    resolver:
      k8s:
        service: opentelemetry-backend-collector-headless.plat-observe-dev
        timeout: 3s
        return_hostnames: true

For example, my headless Service (opentelemetry-backend-collector-headless.plat-observe-dev) doesn’t match the StatefulSet’s default serviceName (opentelemetry-backend-collector), preventing hostname resolution:

kubectl get statefulset opentelemetry-backend-collector -n plat-observe-dev -o jsonpath='{.spec.serviceName}'
opentelemetry-backend-collector

nslookup opentelemetry-backend-collector-0.plat-observe-dev.svc.cluster.local
** server can't find ...: NXDOMAIN

This breaks traceID routing, critical for tail-based sampling in my setup.

Describe the solution you'd like

Add a serviceName field to the OpenTelemetryCollector CRD to specify the headless Service name when mode: statefulset. The Operator should set statefulset.serviceName to this value, ensuring DNS hostname registration.

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-backend
  namespace: plat-observe-dev
spec:
  name: "opentelemetry-backend-collector"
  serviceName: "opentelemetry-backend-collector-headless"
  mode: statefulset
  replicas: 9
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 10
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    processors:
      batch:
        send_batch_size: 20000
        timeout: 5s
        send_batch_max_size: 25000
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]

Describe alternatives you've considered

Manual Workaround: Deploy the StatefulSet outside the Operator (using official collector helm chart) with a matching serviceName. This is successful, but loses Operator benefits.

Static Resolver: Failed due to incorrect DNS domain (plat-observe.dev vs. svc.cluster.local).

cert-manager: Issued certificates with Service DNS name, resolving TLS errors, but doesn’t fix hostname resolution.

Istio Gateway: Considered for ingress, but doesn’t support pod-specific routing needed for traceID.

Additional context

Environment: GKE, Istio 1.19.10-asm.33 (CSM auto mode), mTLS STRICT, OpenTelemetry Collector 0.126.0, Helm chart open-telemetry/opentelemetry-collector.

Setup: StatefulSet with replicas: 9, autoscaling (minReplicas: 3, maxReplicas: 10), headless Service (opentelemetry-backend-collector-headless), loadbalancing exporter (routing_key: "traceID", kubernetes resolver).

Issue: Mismatch between statefulset.serviceName (opentelemetry-backend-collector) and headless Service name prevents DNS hostname registration, breaking kubernetes resolver.

TLS: Certificates lack Service DNS name, addressed with cert-manager, but hostname resolution remains critical.

rewt avatar May 24 '25 18:05 rewt

serviceName being set to the normal Service name instead of the headless Service we create is definitely a bug. This should be very straightforward to fix.

Exposing serviceName as a field for statefulset collectors also doesn't sound like a problem to me.

I'm not sure I understand your ask about injecting the serviceName into the collector configuration though. Could you elaborate on that? I assume you want this to be the DNS name of a different collector CR than the one you're configuring, in which case there's no way for the operator to know which one you want.

swiatekm avatar May 25 '25 12:05 swiatekm

Sorry for confusion, I believe the only requirement is ability to specify statefulset.serviceName in the collector CR. I've updated the requested solution to match.

Thanks for confirming the bug!

rewt avatar May 25 '25 13:05 rewt

@swiatekm is this something we aiming to fix? I can work on adding this new serviceName field.

Would be great if you could assign this issue to me

vignesh-codes avatar May 26 '25 00:05 vignesh-codes

Kubernetes requires statefulset.serviceName to match the headless Service name for pods to register DNS hostnames

Could we make it implicit as opposed to ask people to configure it correctly?

pavolloffay avatar May 26 '25 12:05 pavolloffay

Kubernetes requires statefulset.serviceName to match the headless Service name for pods to register DNS hostnames

Could we make it implicit as opposed to ask people to configure it correctly?

Currently in mode: deployment the statefulset.servicename appears to inherit the name of the statefulset, so if this were to be specified automatically, maybe it's as simple as:

when mode: statefulset use k8s headless svc name for statfulset.servicename

rewt avatar May 26 '25 13:05 rewt

Kubernetes requires statefulset.serviceName to match the headless Service name for pods to register DNS hostnames

Could we make it implicit as opposed to ask people to configure it correctly?

We should make the default correct, and we can also make it configurable - the user may want to supply their own headless Service.

swiatekm avatar May 26 '25 13:05 swiatekm

I think there seems to be a bug. With the following CR

kubectl apply -f - <<EOF
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: simplest
spec:
  mode: statefulset
  serviceName: foo
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    processors:
      memory_limiter:
        check_interval: 1s
        limit_percentage: 75
        spike_limit_percentage: 15

    exporters:
      debug: {}

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter]
          exporters: [debug]
EOF

It creates

k get svc
NAME                            TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
kubernetes                      ClusterIP   10.96.0.1       <none>        443/TCP             14m
simplest-collector              ClusterIP   10.96.139.157   <none>        4317/TCP,4318/TCP   5m3s
simplest-collector-headless     ClusterIP   None            <none>        4317/TCP,4318/TCP   5m3s
simplest-collector-monitoring   ClusterIP   10.96.129.34    <none>        8888/TCP            5m3s

In other words, the PR https://github.com/open-telemetry/opentelemetry-operator/pull/4041 allows setting the serviceName on the statefulset but the name of the service name is not changed.

pavolloffay avatar Sep 30 '25 09:09 pavolloffay

I think there seems to be a bug. With the following CR

kubectl apply -f - <<EOF apiVersion: opentelemetry.io/v1beta1 kind: OpenTelemetryCollector metadata: name: simplest spec: mode: statefulset serviceName: foo config: receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: memory_limiter: check_interval: 1s limit_percentage: 75 spike_limit_percentage: 15

exporters:
  debug: {}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter]
      exporters: [debug]

EOF

It creates

k get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 443/TCP 14m simplest-collector ClusterIP 10.96.139.157 4317/TCP,4318/TCP 5m3s simplest-collector-headless ClusterIP None 4317/TCP,4318/TCP 5m3s simplest-collector-monitoring ClusterIP 10.96.129.34 8888/TCP 5m3s

In other words, the PR #4041 allows setting the serviceName on the statefulset but the name of the service name is not changed.

I don't consider that a bug, but I'm also fine not creating the headless Service if the user sets serviceName.

swiatekm avatar Sep 30 '25 10:09 swiatekm