percona-server-mongodb-operator
percona-server-mongodb-operator copied to clipboard
K8SPSMDB-1014: update cert-manager certs and issuers
https://jira.percona.com/browse/K8SPSMDB-1014
DESCRIPTION
Problem:
After the update from crVersion 1.14.0 to 1.15.0, after certificate renewal, the operator is stuck failing when .spec.updateStrategy is set to SmartUpdate.
When updateStrategy is set to SmartUpdate and the cluster is updated from version 1.14.0 to 1.15.0, after the certificate renewal cluster is stuck on smart update.
Cause: In version 1.15.0 we switched to the new certificate schema. For more info check the description of this PR: https://github.com/percona/percona-server-mongodb-operator/pull/1287. In this PR we didn't implement the update to the new certificate schema.
Certificates are not updated and we will still have the same problem we had in https://jira.percona.com/browse/K8SPSMDB-956.
Solution: First of all, the operator should update the certificates. To do that, we should check if the cert-manager is installed. If it is, we should try to apply our changes.
After the changes, the operator will still face issues with smartUpdate, so it is recommended to create a migration mechanism as described in this guide if there are any changes made to the CA.: https://docs.percona.com/percona-operator-for-mongodb/TLS.html#update-certificates-without-downtime.
So, the migration will consist of the following actions:
- Check if the cert-manager exists.
- If true, check if any changes will be applied to the certificates.
- If true, then we should create copies of
cluster1-sslandcluster1-ssl-internalsecrets namedcluster1-ssl-oldandcluster1-ssl-internal-old. - Apply the changes to the certificates and wait for new secrets.
- Get
ca.crtfrom both old secrets and merge them into new secrets. Set values oftls.keyandtls.crtfrom old secrets to the new ones. - Wait until the next reconcile.
- On the next reconcile, we will check if any changes will be applied to the certificates.
- If certificates remain untouched, the operator will check if
ca.crtwas merged from old secrets. - If true, it will delete old secrets.
- Wait until all statefulsets are ready.
- Compare the
ca.crtof current secrets with theca.crtfromcluster1-ca-cert - If it's different, recreate the secrets by deleting them. Cert-manager will recreate them.
CHECKLIST
Jira
- [x] Is the Jira ticket created and referenced properly?
- [x] Does the Jira ticket have the proper statuses for documentation (
Needs Doc) and QA (Needs QA)? - [x] Does the Jira ticket link to the proper milestone (Fix Version field)?
Tests
- [x] Is an E2E test/test case added for the new feature/change?
- [x] Are unit tests added where appropriate?
- [x] Are OpenShift compare files changed for E2E tests (
compare/*-oc.yml)?
Config/Logging/Testability
- [x] Are all needed new/changed options added to default YAML files?
- [x] Are the manifests (crd/bundle) regenerated if needed?
- [x] Did we add proper logging messages for operator actions?
- [x] Did we ensure compatibility with the previous version or cluster upgrade process?
- [x] Does the change support oldest and newest supported MongoDB version?
- [x] Does the change support oldest and newest supported Kubernetes version?
It seems that the https://docs.percona.com/percona-operator-for-mongodb/TLS.html#update-certificates-without-downtime approach doesn't work with mongos.
After the final recreation of secrets (step 12), the operator updates the cfg pods with new secrets. After all cfg pods have been updated, all mongos pods become unready with the following error in the logs:
{"t":{"$date":"2024-04-22T07:35:34.003+00:00"},"s":"W","c":"NETWORK","id":23235,"ctx":"conn2449","msg":"SSL peer certificate validation failed","attr":{"reason":"self-signed certificate"}}
This is why I removed lines in this discussion: https://github.com/percona/percona-server-mongodb-operator/pull/1383#discussion_r1567191104. We shouldn't remove them. But we also need to find a way to update mongos correctly.
My guess is that mongos only accepts the first part of the CA.
The issue mentioned here: https://github.com/percona/percona-server-mongodb-operator/pull/1383#issuecomment-2068705855 has been fixed in https://github.com/percona/percona-server-mongodb-operator/pull/1383/commits/82643909bf55fb76717a939fed4a229fc6811aac
Description has been updated.
| Test name | Status |
|---|---|
| arbiter | passed |
| balancer | passed |
| custom-replset-name | passed |
| cross-site-sharded | passed |
| data-at-rest-encryption | passed |
| data-sharded | passed |
| demand-backup | passed |
| demand-backup-eks-credentials | passed |
| demand-backup-physical | passed |
| demand-backup-physical-sharded | passed |
| demand-backup-sharded | passed |
| expose-sharded | passed |
| ignore-labels-annotations | passed |
| init-deploy | passed |
| finalizer | passed |
| ldap | passed |
| ldap-tls | passed |
| limits | passed |
| liveness | passed |
| mongod-major-upgrade | passed |
| mongod-major-upgrade-sharded | passed |
| monitoring-2-0 | passed |
| multi-cluster-service | passed |
| non-voting | passed |
| one-pod | passed |
| operator-self-healing-chaos | passed |
| pitr | passed |
| pitr-sharded | passed |
| pitr-physical | passed |
| pvc-resize | passed |
| recover-no-primary | passed |
| rs-shard-migration | passed |
| scaling | passed |
| scheduled-backup | passed |
| security-context | passed |
| self-healing-chaos | passed |
| service-per-pod | passed |
| serviceless-external-nodes | passed |
| smart-update | passed |
| split-horizon | passed |
| storage | passed |
| tls-issue-cert-manager | passed |
| upgrade | passed |
| upgrade-consistency | passed |
| upgrade-consistency-sharded-tls | passed |
| upgrade-sharded | passed |
| users | passed |
| version-service | passed |
| We run 48 out of 48 |
commit: https://github.com/percona/percona-server-mongodb-operator/pull/1383/commits/f7f2d8d27a1e472eec79713e30a4aef45907395a
image: perconalab/percona-server-mongodb-operator:PR-1383-f7f2d8d2