percona-server-mongodb-operator K8SPSMDB-1014: update cert-manager certs and issuers

https://jira.percona.com/browse/K8SPSMDB-1014

DESCRIPTION

Problem: After the update from crVersion 1.14.0 to 1.15.0, after certificate renewal, the operator is stuck failing when .spec.updateStrategy is set to SmartUpdate.

When updateStrategy is set to SmartUpdate and the cluster is updated from version 1.14.0 to 1.15.0, after the certificate renewal cluster is stuck on smart update.

Cause: In version 1.15.0 we switched to the new certificate schema. For more info check the description of this PR: https://github.com/percona/percona-server-mongodb-operator/pull/1287. In this PR we didn't implement the update to the new certificate schema.

Certificates are not updated and we will still have the same problem we had in https://jira.percona.com/browse/K8SPSMDB-956.

Solution: First of all, the operator should update the certificates. To do that, we should check if the cert-manager is installed. If it is, we should try to apply our changes.

After the changes, the operator will still face issues with smartUpdate, so it is recommended to create a migration mechanism as described in this guide if there are any changes made to the CA.: https://docs.percona.com/percona-operator-for-mongodb/TLS.html#update-certificates-without-downtime.

So, the migration will consist of the following actions:

Check if the cert-manager exists.
If true, check if any changes will be applied to the certificates.
If true, then we should create copies of cluster1-ssl and cluster1-ssl-internal secrets named cluster1-ssl-old and cluster1-ssl-internal-old.
Apply the changes to the certificates and wait for new secrets.
Get ca.crt from both old secrets and merge them into new secrets. Set values of tls.key and tls.crt from old secrets to the new ones.
Wait until the next reconcile.
On the next reconcile, we will check if any changes will be applied to the certificates.
If certificates remain untouched, the operator will check if ca.crt was merged from old secrets.
If true, it will delete old secrets.
Wait until all statefulsets are ready.
Compare the ca.crt of current secrets with the ca.crt from cluster1-ca-cert
If it's different, recreate the secrets by deleting them. Cert-manager will recreate them.

CHECKLIST

Jira

[x] Is the Jira ticket created and referenced properly?
[x] Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
[x] Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

[x] Is an E2E test/test case added for the new feature/change?
[x] Are unit tests added where appropriate?
[x] Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

[x] Are all needed new/changed options added to default YAML files?
[x] Are the manifests (crd/bundle) regenerated if needed?
[x] Did we add proper logging messages for operator actions?
[x] Did we ensure compatibility with the previous version or cluster upgrade process?
[x] Does the change support oldest and newest supported MongoDB version?
[x] Does the change support oldest and newest supported Kubernetes version?

Nov 27 '23 22:11 pooknull

It seems that the https://docs.percona.com/percona-operator-for-mongodb/TLS.html#update-certificates-without-downtime approach doesn't work with mongos.

After the final recreation of secrets (step 12), the operator updates the cfg pods with new secrets. After all cfg pods have been updated, all mongos pods become unready with the following error in the logs:

{"t":{"$date":"2024-04-22T07:35:34.003+00:00"},"s":"W","c":"NETWORK","id":23235,"ctx":"conn2449","msg":"SSL peer certificate validation failed","attr":{"reason":"self-signed certificate"}}

This is why I removed lines in this discussion: https://github.com/percona/percona-server-mongodb-operator/pull/1383#discussion_r1567191104. We shouldn't remove them. But we also need to find a way to update mongos correctly.

My guess is that mongos only accepts the first part of the CA.

Apr 22 '24 07:04 pooknull

The issue mentioned here: https://github.com/percona/percona-server-mongodb-operator/pull/1383#issuecomment-2068705855 has been fixed in https://github.com/percona/percona-server-mongodb-operator/pull/1383/commits/82643909bf55fb76717a939fed4a229fc6811aac

Description has been updated.

Apr 22 '24 13:04 pooknull

Test name Status

arbiter passed

balancer passed

custom-replset-name passed

cross-site-sharded passed

data-at-rest-encryption passed

data-sharded passed

demand-backup passed

demand-backup-eks-credentials passed

demand-backup-physical passed

demand-backup-physical-sharded passed

demand-backup-sharded passed

expose-sharded passed

ignore-labels-annotations passed

init-deploy passed

finalizer passed

ldap passed

ldap-tls passed

limits passed

liveness passed

mongod-major-upgrade passed

mongod-major-upgrade-sharded passed

monitoring-2-0 passed

multi-cluster-service passed

non-voting passed

one-pod passed

operator-self-healing-chaos passed

pitr passed

pitr-sharded passed

pitr-physical passed

pvc-resize passed

recover-no-primary passed

rs-shard-migration passed

scaling passed

scheduled-backup passed

security-context passed

self-healing-chaos passed

service-per-pod passed

serviceless-external-nodes passed

smart-update passed

split-horizon passed

storage passed

tls-issue-cert-manager passed

upgrade passed

upgrade-consistency passed

upgrade-consistency-sharded-tls passed

upgrade-sharded passed

users passed

version-service passed

We run 48 out of 48

Test name	Status
arbiter	passed
balancer	passed
custom-replset-name	passed
cross-site-sharded	passed
data-at-rest-encryption	passed
data-sharded	passed
demand-backup	passed
demand-backup-eks-credentials	passed
demand-backup-physical	passed
demand-backup-physical-sharded	passed
demand-backup-sharded	passed
expose-sharded	passed
ignore-labels-annotations	passed
init-deploy	passed
finalizer	passed
ldap	passed
ldap-tls	passed
limits	passed
liveness	passed
mongod-major-upgrade	passed
mongod-major-upgrade-sharded	passed
monitoring-2-0	passed
multi-cluster-service	passed
non-voting	passed
one-pod	passed
operator-self-healing-chaos	passed
pitr	passed
pitr-sharded	passed
pitr-physical	passed
pvc-resize	passed
recover-no-primary	passed
rs-shard-migration	passed
scaling	passed
scheduled-backup	passed
security-context	passed
self-healing-chaos	passed
service-per-pod	passed
serviceless-external-nodes	passed
smart-update	passed
split-horizon	passed
storage	passed
tls-issue-cert-manager	passed
upgrade	passed
upgrade-consistency	passed
upgrade-consistency-sharded-tls	passed
upgrade-sharded	passed
users	passed
version-service	passed
We run 48 out of 48

commit: https://github.com/percona/percona-server-mongodb-operator/pull/1383/commits/f7f2d8d27a1e472eec79713e30a4aef45907395a image: perconalab/percona-server-mongodb-operator:PR-1383-f7f2d8d2

Apr 23 '24 08:04 JNKPercona

percona-server-mongodb-operator percona-server-mongodb-operator copied to clipboard

K8SPSMDB-1014: update cert-manager certs and issuers

DESCRIPTION

CHECKLIST

percona-server-mongodb-operator
percona-server-mongodb-operator copied to clipboard