percona-xtradb-cluster-operator Random TLS certificate verification failure when calling the percona xtradb cluster validating webhook

Report

Random TLS certificate verification failure when calling the percona xtradb cluster validating webhook

More about the problem

When we deploy the pxc operator in cluster wide mode (watchAllNamespaces=true) and with more than one replica (replicaCount>1), tls certificate verification failure appears at random on validating webhook call. These errors can be seen when a user try to apply or edit a CR definition of a pxc cluster, or from the operator logs during reconciliation operations. The logs are the following :*

"Internal error occured: failed calling webhook "validationwebhook.pxc.percona.com": failed to call webhook: Post "[https://percona-xtradb-cluster-operator.namespace.svc:443/validate-percona-xtradbcluster?timeout=10s](https://percona-xtradb-cluster-operator.namespace.svc/validate-percona-xtradbcluster?timeout=10s)": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "Root CA")

After some investigations, I noticed that the ca bundle configured in the validating webhook change each time a pxc-operator replica pod take the lead of the operations and only this pod has valid tls certificate.

This can be checked by recovering the ca-bundle from the validating webhook and the tls.crt from the pxc-operator leader pod and verify the signature with openssl :

kubectl get validatingwebhookconfiguration percona-xtradbcluster-webhook -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d > ca-bundle.crt
kubectl exec pxc-operator-6bc5fb656b-2grl7 -- cat /tmp/k8s-webhook-server/serving-certs/tls.crt > leader-tls.crt
openssl verify -CAfile ca-bundle.crt leader-tls.crt
leader-tls.crt: OK

But if we extract the tls.crt from another pxc-operator replica pod, the verification fails :

openssl verify -CAfile ca-bundle.crt replica-tls.crt
error 7 at 0 depth lookup:certificate signature failure
139771764295568:error:0407008A:rsa routines:RSA_padding_check_PKCS1_type_1:invalid padding:rsa_pk1.c:116:
139771764295568:error:04067072:rsa routines:RSA_EAY_PUBLIC_DECRYPT:padding check failed:rsa_eay.c:761:
139771764295568:error:0D0C5006:asn1 encoding routines:ASN1_item_verify:EVP lib:a_verify.c:249:

And if we delete the leader pod, the ca-bundle configured in the validating webhook change to match the certificate of the new leader. As the percona-xtradb-cluster-operator k8s service point to any of the pxc-operator replica pods, this explains why the error appears at random if the validation webhook call is redirected to a non-leader pxc-operator replica pod. This was also confirm by the fact that the problem disappears when we scale down the operator to only one replica.

Steps to reproduce

Deploy the pxc operator in cluster wide mode with more than one replica (helm values watchAllNamespaces=true and replicaCount>1), the more replica the easier the bug to reproduce.
Deploy a pxc cluster with any valid configuration
Wait a bit of time a check the operator logs : tls verification failure should appear at random during reconciliation operation. You can also check that the signature is valid for only one of the operator pod certificate.

Versions

Kubernetes - v1.27.6
Operator - Percona Operator for MySQL based on Percona XtraDB Cluster 1.13.0

Anything else?

No response

Mar 15 '24 08:03 konoox

we have the exact same behaviour with just 1 operator pod.

2024-09-12T08:20:50.978Z	ERROR	Update status	{"controller": "pxc-controller", "namespace": "helpdesk", "name": "sys-stat-db-cluster", "reconcileID": "00dbb033-c1fa-46c8-8683-f217ed05fd9d", "error": "write status: Internal error occurred: failed calling webhook \"validationwebhook.pxc.percona.com\": failed to call webhook: Post \"https://percona-xtradb-cluster-operator.pxc-operator.svc:443/validate-percona-xtradbcluster?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"Root CA\")", "errorVerbose": "Internal error occurred: failed calling webhook \"validationwebhook.pxc.percona.com\": failed to call webhook: Post \"https://percona-xtradb-cluster-operator.pxc-operator.svc:443/validate-percona-xtradbcluster?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"Root CA\")\nwrite status\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).writeStatus\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/status.go:158\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).updateStatus\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/status.go:43\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile.func1\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:204\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:327\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"}
github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile.func1
	/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:206
github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile
	/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:327
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:261
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222

kind: Deployment
apiVersion: apps/v1
metadata:
  name: percona-xtradb-cluster-operator
  namespace: pxc-operator
  annotations:
    deployment.kubernetes.io/revision: '1'
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: operator
      app.kubernetes.io/instance: percona-xtradb-cluster-operator
      app.kubernetes.io/name: percona-xtradb-cluster-operator
      app.kubernetes.io/part-of: percona-xtradb-cluster-operator
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: operator
        app.kubernetes.io/instance: percona-xtradb-cluster-operator
        app.kubernetes.io/name: percona-xtradb-cluster-operator
        app.kubernetes.io/part-of: percona-xtradb-cluster-operator
    spec:
      containers:
        - resources:
            limits:
              cpu: 200m
              memory: 500Mi
            requests:
              cpu: 100m
              memory: 20Mi
          terminationMessagePath: /dev/termination-log
          name: percona-xtradb-cluster-operator
          command:
            - percona-xtradb-cluster-operator
          livenessProbe:
            httpGet:
              path: /metrics
              port: metrics
              scheme: HTTP
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          env:
            - name: LOG_STRUCTURED
              value: 'false'
            - name: LOG_LEVEL
              value: INFO
            - name: WATCH_NAMESPACE
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: OPERATOR_NAME
              value: percona-xtradb-cluster-operator
            - name: DISABLE_TELEMETRY
              value: 'false'
          ports:
            - name: metrics
              containerPort: 8080
              protocol: TCP
          imagePullPolicy: Always
          terminationMessagePolicy: File
          image: 'perconalab/percona-xtradb-cluster-operator:main'
      restartPolicy: Always
      terminationGracePeriodSeconds: 600
      dnsPolicy: ClusterFirst
      serviceAccountName: percona-xtradb-cluster-operator
      serviceAccount: percona-xtradb-cluster-operator
      securityContext: {}
      schedulerName: default-scheduler
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 25%
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600

Sep 12 '24 08:09 Elyytscha

me too

Nov 07 '24 11:11 wuxinchao011

Hi, thanks for reporting the problem.

Delayed, but created a Jira issue for developers to explore in more detail https://perconadev.atlassian.net/browse/K8SPXC-1637

Apr 29 '25 13:04 dbazhenov

We have the same problem. We have scaled from 3 to 1 replica, and it seems that this solves the problem. We have not had any more bad certificate logs.

We are monitoring.

Oct 09 '25 11:10 dev-ago