opentelemetry-operator icon indicating copy to clipboard operation
opentelemetry-operator copied to clipboard

Enable TLS between the collector and application pods

Open lisguo opened this issue 1 year ago • 7 comments

Component(s)

auto-instrumentation

Is your feature request related to a problem? Please describe.

The OpenTelemetry Collector's otlpreceiver supports a TLS configuration which can be configured like so:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: mysite.local:55690
        tls:
          cert_file: server.crt
          key_file: server.key

The OpenTelemetry instrumentation libraries have certificates which can be configurable via environment variables:

OTEL_EXPORTER_OTLP_CERTIFICATE
OTEL_EXPORTER_OTLP_TRACES_CERTIFICATE
OTEL_EXPORTER_OTLP_METRICS_CERTIFICATE
OTEL_EXPORTER_OTLP_LOGS_CERTIFICATE

Can we have a feature in the opentelemetry-operator to enable TLS between the collector and my application pods using the above information?

Documentation for otel specification, java, python

Describe the solution you'd like

It would be great if I could install the operator via helm chart and provide the certificate resource and then the operator could inject the TLS configuration in the otlp receiver of the collector, and inject the required environment variables in my application pods.

If my certificate gets refreshed or rotated in the resource, then the operator would be able to update the certificates used for TLS on my cluster.

Describe alternatives you've considered

Currently this can be enabled manually by setting the collector config to use a volume mount with the certificate, and then setting the environment variable in my application pod pointing to the same volume mount. Was looking to see if there is an easier way of enabling this without modifying my application spec.

Additional context

Not sure what the right approach here would be. There might be some complexities with refreshing the certificate and potentially having to restart application pods? Open to ideas and suggestions.

lisguo avatar May 29 '24 16:05 lisguo

We're actually currently working on this for the target allocator and collector connections (pr here). I think once that lands, it would be simpler to evaluate the lift to enable it for the instrumentations and collectors. Definitely a great idea, and something that would be valuable for users.

Do you expect this would only work automatically if the operator's user has cert-manager installed? Would you like a solution where a user can supply a certificate to be used for that connection?

jaronoff97 avatar Jun 06 '24 16:06 jaronoff97

Yes, I think the optimal solution is for a way to automatically refresh certificates for clusters using cert-manager. Otherwise customers would have to supply a certificate and manage the rotation themselves.

lisguo avatar Jun 24 '24 19:06 lisguo

And to confirm, you wouldn't want a way for a user to supply and manage the rotation themselves (at least not initially)?

jaronoff97 avatar Jun 24 '24 19:06 jaronoff97

Yes, I think for clusters with cert manager, we can be opinionated to have the certificates rotate automatically -- maybe have a default expiry on the cert.

lisguo avatar Jun 24 '24 19:06 lisguo

sounds good! Any chance you would be able to / want to work on this? If not, I can ask around the SIG / slack and see if anyone has some cycles.

jaronoff97 avatar Jun 24 '24 20:06 jaronoff97

Thanks for the help @jaronoff97. I would be open to working on this, however I am not sure how the certificate refresh would work with the auto instrumentation SDKs without restarting application pods (or maybe that's the only approach?)

It would be great to see if anyone with more knowledge can work on this.

lisguo avatar Jun 25 '24 18:06 lisguo

I think we could to set the rotation time to be 1/2 the lifetime of a cert (this is what istio did). From there we could probably give feedback to users via webhooks about workloads that need rotation. We've had a few conversations about restarting application pods and I think we've decided that for now we won't be in the business of doing that (it can be a concern for some users).

@pavolloffay @swiatekm-sumo what do you both think?

jaronoff97 avatar Jun 25 '24 18:06 jaronoff97

I think we could to set the rotation time to be 1/2 the lifetime of a cert

I think the SDKs should watch the cert for changes are reload it on change, we should avoid restarting the workloads. I am not sure if this is implemented in OTEL, but I will double check on this.

The reloading is done via reload_interval https://github.com/open-telemetry/opentelemetry-collector/blob/main/config/configtls/README.md - this is for collector.

I am starting to work on this. I will probably roll this out in phases:

  1. custom cert injection - allow specifying secrets/configmaps in the instrumentation CR. The operator will mount them to the application container and configure SDK to use them
  2. auto-provisioning of certs by OTEL operator. The OTEL operator would provision certs and configure mTLS between instrumentation and collector

Phase 1 use-cases:

1. Certs are provided in a file already present in the app container

/var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt

2. Certs are provided in secret/config map

CA cert can be in configmap/secret key/cert should be in secret

OpenShift service CA

  • The ca is provisioned to a config map via service.beta.openshift.io/inject-cabundle: true
  • The key and cert are provisioned to a secret via service.beta.openshift.io/serving-cert-secret-name: <secret name>
  • Exporter TLS docs https://github.com/open-telemetry/opentelemetry-collector/blob/main/config/configtls/README.md

Other docs

  • https://opentelemetry.io/docs/specs/otel/protocol/exporter/
  • https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/

Instrumentation docs

  • java - https://github.com/open-telemetry/opentelemetry-java/blob/cdcc58cb87cd9ae546f285aa9e53428db1c7f9d8/sdk-extensions/autoconfigure/README.md?plain=1#L101
  • python https://opentelemetry-python.readthedocs.io/en/latest/sdk/environment_variables.html#opentelemetry.sdk.environment_variables.OTEL_EXPORTER_OTLP_CERTIFICATE

pavolloffay avatar Oct 07 '24 15:10 pavolloffay