opentelemetry-operator icon indicating copy to clipboard operation
opentelemetry-operator copied to clipboard

fix(target-allocator): Extend target allocator CA certificate duration to prevent renewal race

Open schahal opened this issue 1 month ago • 6 comments

Description:

See https://github.com/open-telemetry/opentelemetry-operator/issues/4441 for details

The root cause (from the 3 separate accounts in the thread) seems to be that all three TargetAllocator certificates (CA, server, client) were created at the same time with the same 90-day duration.

When they all came up for renewal simultaneously (around day 60), there was a race condition where, *for example:

  • One of either the Server or Client cert gets renewed and signed by the original CA
  • The CA gets renewed second (new CA cert + key)
  • The other (e.g., Client) cert gets renewed and signed by the new CA
  • This causes "certificate signed by unknown authority" errors

By giving the CA a 1-year duration while client/server certs keep the default 90-day duration, we ensure:

  • The CA remains stable (e.g., longer period between renewals) while client/server certs renew
  • Client and server certs are always signed by the same CA (unless there's a very tiny chance that they renew at same time as CA's 11-month mark)

Link to tracking Issue(s):

  • Resolves: https://github.com/open-telemetry/opentelemetry-operator/issues/4441

Testing:

  • Updated test to validate duration
go test ./internal/manifests/targetallocator/ -run Certificate -v
=== RUN   TestCACertificate
=== RUN   TestCACertificate/Default_CA_Certificate
--- PASS: TestCACertificate (0.00s)
    --- PASS: TestCACertificate/Default_CA_Certificate (0.00s)
=== RUN   TestServingCertificate
=== RUN   TestServingCertificate/Default_Serving_Certificate
--- PASS: TestServingCertificate (0.00s)
    --- PASS: TestServingCertificate/Default_Serving_Certificate (0.00s)
=== RUN   TestClientCertificate
=== RUN   TestClientCertificate/Default_Client_Certificate
--- PASS: TestClientCertificate (0.00s)
    --- PASS: TestClientCertificate/Default_Client_Certificate (0.00s)
PASS
ok  	github.com/open-telemetry/opentelemetry-operator/internal/manifests/targetallocator	(cached)

Documentation: <Describe the documentation added.>

Added a changelog gen entry

schahal avatar Nov 24 '25 21:11 schahal

E2E Test Results

 34 files  ±0  227 suites  ±0   1h 58m 24s ⏱️ - 1m 23s  90 tests ±0   90 ✅ ±0  0 💤 ±0  0 ❌ ±0  231 runs  ±0  231 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 1186ee27. ± Comparison against base commit a8ba3092.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Nov 24 '25 21:11 github-actions[bot]

Client and server certs are always signed by the same CA (unless there's a very tiny chance that they renew at same time as CA's 11-month mark)

Would it make sense to set the CA certificate grace period (spec.renewBefore) to be longer than the client certficiate duration? Then it'll be impossible to renew a client certificate with a CA certificate with a shorter remaining duration.

swiatekm avatar Nov 25 '25 15:11 swiatekm

Updated!

I'd accept if you compile the operator with these values set very low and verify that it works in a test cluster.

Looks like cert-manager has a minimum duration of 1 hour enforced by its admission webhook:

"admission webhook \"webhook.cert-manager.io\" denied the request: spec.duration: Invalid value: 24m0s: certificate duration must be greater than 1h0m0s"

So made it 1h, updated the image and side-loaded into my KIND cluster... not waiting for renewal because of the time, but validated certs created with expected renewal dates:

$ kubectl get certificate -n cert-test -o json | jq -r '.items[] | "\(.metadata.name):\n  Duration: \(.spec.duration)\n  RenewBefore: \(.spec.renewBefore // "not set")\n  NotBefore: \(.status.notBefore)\n  NotAfter: \(.status.notAfter)\n  RenewalTime: \(.status.renewalTime)\n"'
cert-test-collector-ca-cert:
  Duration: 12h0m0s
  RenewBefore: 4h0m0s
  NotBefore: 2025-11-30T16:46:04Z
  NotAfter: 2025-12-01T04:46:04Z
  RenewalTime: 2025-12-01T00:46:04Z

cert-test-collector-ta-client-cert:
  Duration: 1h30m0s
  RenewBefore: not set
  NotBefore: 2025-11-30T16:46:07Z
  NotAfter: 2025-11-30T18:16:07Z
  RenewalTime: 2025-11-30T17:46:07Z

cert-test-collector-ta-server-cert:
  Duration: 1h30m0s
  RenewBefore: not set
  NotBefore: 2025-11-30T16:46:07Z
  NotAfter: 2025-11-30T18:16:07Z
  RenewalTime: 2025-11-30T17:46:07Z

schahal avatar Nov 30 '25 17:11 schahal

Thanks, all! Anything extra I need to get this merged?

(I was personally affected by this in September and then again couple weeks ago, and would love for this to be in the next version of the operator... the manual workaround is quite disruptive)

schahal avatar Dec 02 '25 17:12 schahal

I'd like @pavolloffay to approve too before we merge.

swiatekm avatar Dec 02 '25 17:12 swiatekm

@pavolloffay 👍 ? (see this comment for summary to your original comment)

schahal avatar Dec 05 '25 20:12 schahal