fix(target-allocator): Extend target allocator CA certificate duration to prevent renewal race
Description:
See https://github.com/open-telemetry/opentelemetry-operator/issues/4441 for details
The root cause (from the 3 separate accounts in the thread) seems to be that all three TargetAllocator certificates (CA, server, client) were created at the same time with the same 90-day duration.
When they all came up for renewal simultaneously (around day 60), there was a race condition where, *for example:
- One of either the Server or Client cert gets renewed and signed by the original CA
- The CA gets renewed second (new CA cert + key)
- The other (e.g., Client) cert gets renewed and signed by the new CA
- This causes "certificate signed by unknown authority" errors
By giving the CA a 1-year duration while client/server certs keep the default 90-day duration, we ensure:
- The CA remains stable (e.g., longer period between renewals) while client/server certs renew
- Client and server certs are always signed by the same CA (unless there's a very tiny chance that they renew at same time as CA's 11-month mark)
Link to tracking Issue(s):
- Resolves: https://github.com/open-telemetry/opentelemetry-operator/issues/4441
Testing:
- Updated test to validate duration
go test ./internal/manifests/targetallocator/ -run Certificate -v
=== RUN TestCACertificate
=== RUN TestCACertificate/Default_CA_Certificate
--- PASS: TestCACertificate (0.00s)
--- PASS: TestCACertificate/Default_CA_Certificate (0.00s)
=== RUN TestServingCertificate
=== RUN TestServingCertificate/Default_Serving_Certificate
--- PASS: TestServingCertificate (0.00s)
--- PASS: TestServingCertificate/Default_Serving_Certificate (0.00s)
=== RUN TestClientCertificate
=== RUN TestClientCertificate/Default_Client_Certificate
--- PASS: TestClientCertificate (0.00s)
--- PASS: TestClientCertificate/Default_Client_Certificate (0.00s)
PASS
ok github.com/open-telemetry/opentelemetry-operator/internal/manifests/targetallocator (cached)
Documentation: <Describe the documentation added.>
Added a changelog gen entry
E2E Test Results
34 files ±0 227 suites ±0 1h 58m 24s ⏱️ - 1m 23s 90 tests ±0 90 ✅ ±0 0 💤 ±0 0 ❌ ±0 231 runs ±0 231 ✅ ±0 0 💤 ±0 0 ❌ ±0
Results for commit 1186ee27. ± Comparison against base commit a8ba3092.
:recycle: This comment has been updated with latest results.
Client and server certs are always signed by the same CA (unless there's a very tiny chance that they renew at same time as CA's 11-month mark)
Would it make sense to set the CA certificate grace period (spec.renewBefore) to be longer than the client certficiate duration? Then it'll be impossible to renew a client certificate with a CA certificate with a shorter remaining duration.
Updated!
I'd accept if you compile the operator with these values set very low and verify that it works in a test cluster.
Looks like cert-manager has a minimum duration of 1 hour enforced by its admission webhook:
"admission webhook \"webhook.cert-manager.io\" denied the request: spec.duration: Invalid value: 24m0s: certificate duration must be greater than 1h0m0s"
So made it 1h, updated the image and side-loaded into my KIND cluster... not waiting for renewal because of the time, but validated certs created with expected renewal dates:
$ kubectl get certificate -n cert-test -o json | jq -r '.items[] | "\(.metadata.name):\n Duration: \(.spec.duration)\n RenewBefore: \(.spec.renewBefore // "not set")\n NotBefore: \(.status.notBefore)\n NotAfter: \(.status.notAfter)\n RenewalTime: \(.status.renewalTime)\n"'
cert-test-collector-ca-cert:
Duration: 12h0m0s
RenewBefore: 4h0m0s
NotBefore: 2025-11-30T16:46:04Z
NotAfter: 2025-12-01T04:46:04Z
RenewalTime: 2025-12-01T00:46:04Z
cert-test-collector-ta-client-cert:
Duration: 1h30m0s
RenewBefore: not set
NotBefore: 2025-11-30T16:46:07Z
NotAfter: 2025-11-30T18:16:07Z
RenewalTime: 2025-11-30T17:46:07Z
cert-test-collector-ta-server-cert:
Duration: 1h30m0s
RenewBefore: not set
NotBefore: 2025-11-30T16:46:07Z
NotAfter: 2025-11-30T18:16:07Z
RenewalTime: 2025-11-30T17:46:07Z
Thanks, all! Anything extra I need to get this merged?
(I was personally affected by this in September and then again couple weeks ago, and would love for this to be in the next version of the operator... the manual workaround is quite disruptive)
I'd like @pavolloffay to approve too before we merge.
@pavolloffay 👍 ? (see this comment for summary to your original comment)