calico-apiserver TLS errors due to cert not being reissued after 3.31 upgrade namespace change
We use the tigera-operator to manage our Calico installation. After upgrading our production environment from Calico version 3.30.3 to 3.31.0 (version 1.38.6 to 1.40.0 of the tigera-operator), we began getting TLS errors from the calico-apiserver. Here is one of the related errors from the kube-apiserver log:
loading OpenAPI spec for "v3.projectcalico.org" failed with: failed to download v3.projectcalico.org: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: error trying to reach service: tls: failed to verify certificate: x509: certificate is valid for calico-api, calico-api.calico-apiserver, calico-api.calico-apiserver.svc, calico-api.calico-apiserver.svc.cluster.local, not calico-api.calico-system.svc
As mentioned by the release notes, this upgrade moved calico-apiserver from the calico-apiserver namespace to the calico-system namespace.
I looked at the cert in the calico-apiserver-certs secret, in both the tigera-operator namespace and the calico-system namespace, and verified the Subject Alternative Names in the certificate were in fact for the old calico-apiserver namespace, as indicated by the error log message. (The secret in calico-system had just been created, and I believe copied from the one in tigera-operator).
To force the certificate to be reissued, I deleted the calico-apiserver-certs secret in the tigera-operator namespace, and a short time later it was recreated with a new certificate where the Subject Alternative Names now correctly contain the calico-system namespace.
This issue did not occur in our dev environment. Our dev environment is destroyed and recreated from scratch on a regular basis, so when we did the upgrade test there, it was upgrading a fresh calico 3.30.3 installation to 3.31.0. In our production environment, the calico installation is several years old and has been upgraded repeatedly over time. At least one other time, we had certificate-related problem that ended up being related to certificates being created differently in the past, and I'm taking a guess that could be the case here as well (but certainly could be mistaken). The CA that signed the old certificate was tigera-operator-signer@xxxxxxxxxx, but after forcing it to be reissued is now just tigera-operator-signer (which I point out just to show that our certificates were the older ones which were generated differently in the past, as also discussed on that other issue).
Now that our production environment is upgraded, I unfortunately can no longer reproduce this state, but wanted to post in case someone else also runs into this.
Your Environment
- Calico version: 3.31.0 open source edition
- Calico dataplane (bpf, nftables, iptables, windows etc.): iptables
- Orchestrator version (e.g. kubernetes, openshift, etc.): EKS 1.34
@kashook thanks for raising this and sorry you hit this. I'll get someone to take a look at this.
Thank you for raising this issue, you are helping us make the code more resilient for yourself and other users.
When we developed the operator, we decided we do not want to delete secrets that are brought to the cluster by users. We refer to them as BYO. We refer to legacy certificates if they were signed prior to introducing a signer to the operator. Those certificates have a signer like this: tigera-operator-signer@xxxxxxxxxx.
We do have a check for whether the required DNS names are present: https://github.com/tigera/operator/blob/b4a96ae0c7c3817dff827e3e54d72e19bccfc830/pkg/controller/certificatemanager/certificatemanager.go#L331
I really thought that we would recycle legacy certs if there were missing DNS names, but it looks like legacy certs are treated the same way as BYO. The operator would log that there are missing DNS names, but won't do anything else. I have made a ticket so we can repro and fix things on our end. Similar to the code for missing key usages, we should recycle the legacy cert.