Random TLS certificate verification failure when calling the percona xtradb cluster validating webhook
Report
Random TLS certificate verification failure when calling the percona xtradb cluster validating webhook
More about the problem
When we deploy the pxc operator in cluster wide mode (watchAllNamespaces=true) and with more than one replica (replicaCount>1), tls certificate verification failure appears at random on validating webhook call. These errors can be seen when a user try to apply or edit a CR definition of a pxc cluster, or from the operator logs during reconciliation operations. The logs are the following :*
"Internal error occured: failed calling webhook "validationwebhook.pxc.percona.com": failed to call webhook: Post "[https://percona-xtradb-cluster-operator.namespace.svc:443/validate-percona-xtradbcluster?timeout=10s](https://percona-xtradb-cluster-operator.namespace.svc/validate-percona-xtradbcluster?timeout=10s)": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "Root CA")
After some investigations, I noticed that the ca bundle configured in the validating webhook change each time a pxc-operator replica pod take the lead of the operations and only this pod has valid tls certificate.
This can be checked by recovering the ca-bundle from the validating webhook and the tls.crt from the pxc-operator leader pod and verify the signature with openssl :
kubectl get validatingwebhookconfiguration percona-xtradbcluster-webhook -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d > ca-bundle.crt
kubectl exec pxc-operator-6bc5fb656b-2grl7 -- cat /tmp/k8s-webhook-server/serving-certs/tls.crt > leader-tls.crt
openssl verify -CAfile ca-bundle.crt leader-tls.crt
leader-tls.crt: OK
But if we extract the tls.crt from another pxc-operator replica pod, the verification fails :
openssl verify -CAfile ca-bundle.crt replica-tls.crt
error 7 at 0 depth lookup:certificate signature failure
139771764295568:error:0407008A:rsa routines:RSA_padding_check_PKCS1_type_1:invalid padding:rsa_pk1.c:116:
139771764295568:error:04067072:rsa routines:RSA_EAY_PUBLIC_DECRYPT:padding check failed:rsa_eay.c:761:
139771764295568:error:0D0C5006:asn1 encoding routines:ASN1_item_verify:EVP lib:a_verify.c:249:
And if we delete the leader pod, the ca-bundle configured in the validating webhook change to match the certificate of the new leader. As the percona-xtradb-cluster-operator k8s service point to any of the pxc-operator replica pods, this explains why the error appears at random if the validation webhook call is redirected to a non-leader pxc-operator replica pod. This was also confirm by the fact that the problem disappears when we scale down the operator to only one replica.
Steps to reproduce
- Deploy the pxc operator in cluster wide mode with more than one replica (helm values watchAllNamespaces=true and replicaCount>1), the more replica the easier the bug to reproduce.
- Deploy a pxc cluster with any valid configuration
- Wait a bit of time a check the operator logs : tls verification failure should appear at random during reconciliation operation. You can also check that the signature is valid for only one of the operator pod certificate.
Versions
- Kubernetes - v1.27.6
- Operator - Percona Operator for MySQL based on Percona XtraDB Cluster 1.13.0
Anything else?
No response
we have the exact same behaviour with just 1 operator pod.
2024-09-12T08:20:50.978Z ERROR Update status {"controller": "pxc-controller", "namespace": "helpdesk", "name": "sys-stat-db-cluster", "reconcileID": "00dbb033-c1fa-46c8-8683-f217ed05fd9d", "error": "write status: Internal error occurred: failed calling webhook \"validationwebhook.pxc.percona.com\": failed to call webhook: Post \"https://percona-xtradb-cluster-operator.pxc-operator.svc:443/validate-percona-xtradbcluster?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"Root CA\")", "errorVerbose": "Internal error occurred: failed calling webhook \"validationwebhook.pxc.percona.com\": failed to call webhook: Post \"https://percona-xtradb-cluster-operator.pxc-operator.svc:443/validate-percona-xtradbcluster?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"Root CA\")\nwrite status\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).writeStatus\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/status.go:158\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).updateStatus\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/status.go:43\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile.func1\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:204\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:327\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"}
github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile.func1
/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:206
github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile
/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:327
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:261
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222
kind: Deployment
apiVersion: apps/v1
metadata:
name: percona-xtradb-cluster-operator
namespace: pxc-operator
annotations:
deployment.kubernetes.io/revision: '1'
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/component: operator
app.kubernetes.io/instance: percona-xtradb-cluster-operator
app.kubernetes.io/name: percona-xtradb-cluster-operator
app.kubernetes.io/part-of: percona-xtradb-cluster-operator
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/component: operator
app.kubernetes.io/instance: percona-xtradb-cluster-operator
app.kubernetes.io/name: percona-xtradb-cluster-operator
app.kubernetes.io/part-of: percona-xtradb-cluster-operator
spec:
containers:
- resources:
limits:
cpu: 200m
memory: 500Mi
requests:
cpu: 100m
memory: 20Mi
terminationMessagePath: /dev/termination-log
name: percona-xtradb-cluster-operator
command:
- percona-xtradb-cluster-operator
livenessProbe:
httpGet:
path: /metrics
port: metrics
scheme: HTTP
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
env:
- name: LOG_STRUCTURED
value: 'false'
- name: LOG_LEVEL
value: INFO
- name: WATCH_NAMESPACE
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: OPERATOR_NAME
value: percona-xtradb-cluster-operator
- name: DISABLE_TELEMETRY
value: 'false'
ports:
- name: metrics
containerPort: 8080
protocol: TCP
imagePullPolicy: Always
terminationMessagePolicy: File
image: 'perconalab/percona-xtradb-cluster-operator:main'
restartPolicy: Always
terminationGracePeriodSeconds: 600
dnsPolicy: ClusterFirst
serviceAccountName: percona-xtradb-cluster-operator
serviceAccount: percona-xtradb-cluster-operator
securityContext: {}
schedulerName: default-scheduler
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 25%
revisionHistoryLimit: 10
progressDeadlineSeconds: 600
me too
Hi, thanks for reporting the problem.
Delayed, but created a Jira issue for developers to explore in more detail https://perconadev.atlassian.net/browse/K8SPXC-1637
We have the same problem. We have scaled from 3 to 1 replica, and it seems that this solves the problem. We have not had any more bad certificate logs.
We are monitoring.