istio MariaDB Galera cluster crashes with istio mTLS mode STRICT

Bug Description

Hello,

I'm trying to run a mariadb galera cluster with istio. I have to mention that i'm not using bitnami chart but i have my own mariadb image in a statefulset.

galera config:

    [galera]
    # Galera cluster configuration
    wsrep_on                 = ON
    wsrep_provider           = /usr/local/mysql/lib/libgalera_smm.so
    wsrep_cluster_address    = "gcomm://mariadb-0,mariadb-1,mariadb-2"
    wsrep_cluster_name       = "mariadb-cluster"
    wsrep_sst_method         = "rsync"
    binlog_format            = row
    default_storage_engine   = InnoDB
    innodb_autoinc_lock_mode = 2
    
    # Cluster node configuration
    bind-address         = 0.0.0.0
    wsrep_log_conflicts  = ON
    wsrep_debug          = NONE

I also have set services for each mariadb pod, according to "wsrep_cluster_address":

apiVersion: v1
kind: Service
metadata:
  name: mariadb
spec:
  type: ClusterIP
  ports:
    - name: tcp-mariadb
      port: 3306
      protocol: TCP
      targetPort: 3306
    - name: tcp-grp
      port: 4567
      protocol: TCP
      targetPort: 4567
    - name: tcp-ist
      port: 4568
      protocol: TCP
      targetPort: 4568
    - name: tcp-sst
      port: 4444
      protocol: TCP
      targetPort: 4444
  selector:
    app: mariadb

---
apiVersion: v1
kind: Service
metadata:
  name: mariadb-0
spec:
  type: ClusterIP
  ports:
    - name: tcp-mariadb
      port: 3306
      protocol: TCP
      targetPort: 3306
    - name: tcp-grp
      port: 4567
      protocol: TCP
      targetPort: 4567
    - name: tcp-ist
      port: 4568
      protocol: TCP
      targetPort: 4568
    - name: tcp-sst
      port: 4444
      protocol: TCP
      targetPort: 4444
  selector:
    statefulset.kubernetes.io/pod-name: mariadb-0
---
apiVersion: v1
kind: Service
metadata:
  name: mariadb-1
spec:
  type: ClusterIP
  ports:
    - name: tcp-mariadb
      port: 3306
      protocol: TCP
      targetPort: 3306
    - name: tcp-grp
      port: 4567
      protocol: TCP
      targetPort: 4567
    - name: tcp-ist
      port: 4568
      protocol: TCP
      targetPort: 4568
    - name: tcp-sst
      port: 4444
      protocol: TCP
      targetPort: 4444
  selector:
    statefulset.kubernetes.io/pod-name: mariadb-1
---
apiVersion: v1
kind: Service
metadata:
  name: mariadb-2
spec:
  type: ClusterIP
  ports:
    - name: tcp-mariadb
      port: 3306
      protocol: TCP
      targetPort: 3306
    - name: tcp-grp
      port: 4567
      protocol: TCP
      targetPort: 4567
    - name: tcp-ist
      port: 4568
      protocol: TCP
      targetPort: 4568
    - name: tcp-sst
      port: 4444
      protocol: TCP
      targetPort: 4444
  selector:
    statefulset.kubernetes.io/pod-name: mariadb-2

If i deploy this statefulset in a namespace without the label istio-injection=enabled, everything works, the pods connect each to another and they form a fully functioning cluster.

If i set istio-injection=enabled AND i set PeerAuthentication to mtls: mode: STRICT to the namespace, then the cluster fails to form. I even tried to set DestinationRule for each service but with no luck:

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: strictauth-all
spec:
  mtls:
    mode: STRICT
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: mariadb0
spec:
  host: mariadb-0
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: mariadb1
spec:
  host: mariadb-1
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: mariadb2
spec:
  host: mariadb-2
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL

The problem is that for some reason rsync (cluster SST) is not working through istio-proxy:

Logs on mariadb-0 (initial pod):

2022-04-28 19:58:22 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 2)
2022-04-28 19:58:22 1 [Note] WSREP: Detected STR version: 1, req_len: 112, req: STRv1
2022-04-28 19:58:22 1 [Note] WSREP: Cert index preload: 2 -> 2
2022-04-28 19:58:22 1 [Note] WSREP: Server status change synced -> donor
2022-04-28 19:58:22 0 [Note] WSREP: async IST sender starting to serve tcp://10.10.3.125:4568 sending 2-2, preload starts from 2
2022-04-28 19:58:22 [Note] WSREP_NOTIFY: /usr/local/mysql/bin/wsrep_notify.sh was called with: --status donor
2022-04-28 19:58:22 [Note] WSREP_NOTIFY: /usr/local/mysql/bin/wsrep_notify.sh was called with: --status donor
2022-04-28 19:58:22 0 [Note] WSREP: Donor monitor thread started to monitor
2022-04-28 19:58:22 0 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'donor' --address '10.10.3.125:4444/rsync_sst' --local-port '3306' --socket '/var/run/mysqld/mysqld.sock' --datadir '/data/' --gtid '78930c3b-c72d-11ec-8c72-6f41e51e0c30:2' --gtid-domain-id '0' --mysqld-args --wsrep-new-cluster'
2022-04-28 19:58:22 1 [Note] WSREP: sst_donor_thread signaled with 0
2022-04-28 19:58:22 0 [Note] WSREP: Flushing tables for SST...
2022-04-28 19:58:22 0 [Note] WSREP: pause
2022-04-28 19:58:22 0 [Note] WSREP: Provider paused at 78930c3b-c72d-11ec-8c72-6f41e51e0c30:2 (5)
2022-04-28 19:58:22 0 [Note] WSREP: Server paused at: 2
2022-04-28 19:58:22 0 [Note] WSREP: Tables flushed.
rsync: did not see server greeting
rsync error: error starting client-server protocol (code 5) at main.c(1675) [sender=3.1.3]
WSREP_SST: [ERROR] rsync returned code 5: (20220428 19:58:22.903)
2022-04-28 19:58:22 0 [ERROR] WSREP: Failed to read from: wsrep_sst_rsync --role 'donor' --address '10.10.3.125:4444/rsync_sst' --local-port '3306' --socket '/var/run/mysqld/mysqld.sock' --datadir '/data/' --gtid '78930c3b-c72d-11ec-8c72-6f41e51e0c30:2' --gtid-domain-id '0' --mysqld-args --wsrep-new-cluster
2022-04-28 19:58:22 0 [ERROR] WSREP: Process completed with error: wsrep_sst_rsync --role 'donor' --address '10.10.3.125:4444/rsync_sst' --local-port '3306' --socket '/var/run/mysqld/mysqld.sock' --datadir '/data/' --gtid '78930c3b-c72d-11ec-8c72-6f41e51e0c30:2' --gtid-domain-id '0' --mysqld-args --wsrep-new-cluster: 255 (Unknown error 255)
2022-04-28 19:58:22 0 [Note] WSREP: resume
2022-04-28 19:58:22 0 [Note] WSREP: resuming provider at 5
2022-04-28 19:58:22 0 [Note] WSREP: Provider resumed.
2022-04-28 19:58:22 0 [Note] WSREP: SST sending failed: -255
2022-04-28 19:58:22 0 [Note] WSREP: Server status change donor -> joined
2022-04-28 19:58:22 [Note] WSREP_NOTIFY: /usr/local/mysql/bin/wsrep_notify.sh was called with: --status joined
2022-04-28 19:58:22 0 [ERROR] WSREP: Command did not run: wsrep_sst_rsync --role 'donor' --address '10.10.3.125:4444/rsync_sst' --local-port '3306' --socket '/var/run/mysqld/mysqld.sock' --datadir '/data/' --gtid '78930c3b-c72d-11ec-8c72-6f41e51e0c30:2' --gtid-domain-id '0' --mysqld-args --wsrep-new-cluster
2022-04-28 19:58:22 0 [Note] WSREP: Donor monitor thread ended with total time 0 sec
2022-04-28 19:58:22 0 [Warning] WSREP: 0.0 (mariadb-0): State transfer to 1.0 (mariadb-1) failed: -255 (Unknown error 255)
2022-04-28 19:58:22 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 2)
2022-04-28 19:58:22 0 [Note] WSREP: Member 0.0 (mariadb-0) synced with group.
2022-04-28 19:58:22 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 2)

and istio-proxy logs from mariadb-1 (or mariadb-2, both look the same):

2022-04-28T20:34:11.112266Z	info	JWT policy is third-party-jwt
2022-04-28T20:34:11.120934Z	warn	HTTP request unsuccessful with status: 400 Bad Request
2022-04-28T20:34:11.140093Z	info	CA Endpoint istiod.istio-system.svc:15012, provider Citadel
2022-04-28T20:34:11.140299Z	info	Using CA istiod.istio-system.svc:15012 cert with certs: var/run/secrets/istio/root-cert.pem
2022-04-28T20:34:11.140119Z	info	Opening status port 15020
2022-04-28T20:34:11.140510Z	info	citadelclient	Citadel client using custom root cert: istiod.istio-system.svc:15012
2022-04-28T20:34:11.168819Z	info	ads	All caches have been synced up in 61.101236ms, marking server ready
2022-04-28T20:34:11.169231Z	info	sds	SDS server for workload certificates started, listening on "etc/istio/proxy/SDS"
2022-04-28T20:34:11.169253Z	info	xdsproxy	Initializing with upstream address "istiod.istio-system.svc:15012" and cluster "Kubernetes"
2022-04-28T20:34:11.171267Z	info	Pilot SAN: [istiod.istio-system.svc]
2022-04-28T20:34:11.173499Z	info	Starting proxy agent
2022-04-28T20:34:11.173534Z	info	Epoch 0 starting
2022-04-28T20:34:11.173550Z	info	Envoy command: [-c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --drain-strategy immediate --parent-shutdown-time-s 60 --local-address-ip-version v4 --bootstrap-version 3 --file-flush-interval-msec 1000 --disable-hot-restart --log-format %Y-%m-%dT%T.%fZ	%l	envoy %n	%v -l warning --component-log-level misc:error --concurrency 2]
2022-04-28T20:34:11.175140Z	info	sds	Starting SDS grpc server
2022-04-28T20:34:11.175340Z	info	starting Http service at 127.0.0.1:15004
2022-04-28T20:34:11.278886Z	info	xdsproxy	connected to upstream XDS server: istiod.istio-system.svc:15012
2022-04-28T20:34:11.341603Z	info	ads	ADS: new connection for node:mariadb-1.cmanea-1
2022-04-28T20:34:11.342517Z	info	ads	ADS: new connection for node:mariadb-1.cmanea-2
2022-04-28T20:34:11.600609Z	info	cache	generated new workload certificate	latency=431.381267ms ttl=23h59m59.399404475s
2022-04-28T20:34:11.600657Z	info	cache	Root cert has changed, start rotating root cert
2022-04-28T20:34:11.600692Z	info	ads	XDS: Incremental Pushing:0 ConnectedEndpoints:2 Version:
2022-04-28T20:34:11.600813Z	info	cache	returned workload trust anchor from cache	ttl=23h59m59.399190674s
2022-04-28T20:34:11.600827Z	info	cache	returned workload trust anchor from cache	ttl=23h59m59.399174174s
2022-04-28T20:34:11.601021Z	info	cache	returned workload certificate from cache	ttl=23h59m59.398987473s
2022-04-28T20:34:11.601238Z	info	ads	SDS: PUSH request for node:mariadb-1.cmanea resources:1 size:1.1kB resource:ROOTCA
2022-04-28T20:34:11.601318Z	info	cache	returned workload trust anchor from cache	ttl=23h59m59.398684971s
2022-04-28T20:34:11.601379Z	info	ads	SDS: PUSH for node:mariadb-1.cmanea resources:1 size:1.1kB resource:ROOTCA
2022-04-28T20:34:11.601241Z	info	ads	SDS: PUSH request for node:mariadb-1.cmanea resources:1 size:4.0kB resource:default
[2022-04-28T20:34:11.927Z] "- - -" 0 - - - "-" 5 130 27 - "-" "-" "-" "-" "10.10.3.119:3306" outbound|3306||mariadb.cmanea.svc.cluster.local 10.10.1.152:60914 10.216.114.69:3306 10.10.1.152:58574 - -
[2022-04-28T20:34:12.003Z] "- - -" 0 UH - - "-" 0 0 7 - "-" "-" "-" "-" "-" outbound|4567||mariadb-2.cmanea.svc.cluster.local - 10.216.10.33:4567 10.10.1.152:50472 - -
2022-04-28T20:34:13.429735Z	info	Initialization took 2.320903835s
2022-04-28T20:34:13.429996Z	info	Envoy proxy is ready
[2022-04-28T20:34:14.656Z] "- - -" 0 NR filter_chain_not_found - "-" 0 0 0 - "-" "-" "-" "-" "-" - - 10.10.1.152:4444 10.10.3.119:48960 - -
[2022-04-28T20:34:12.004Z] "- - -" 0 - - - "-" 3037 4184 3693 - "-" "-" "-" "-" "10.10.3.119:4567" outbound|4567||mariadb-0.cmanea.svc.cluster.local 10.10.1.152:49860 10.216.225.145:4567 10.10.1.152:47238 - -
[2022-04-28T20:34:16.918Z] "- - -" 0 - - - "-" 5 130 5 - "-" "-" "-" "-" "10.10.3.119:3306" outbound|3306||mariadb.cmanea.svc.cluster.local 10.10.1.152:60986 10.216.114.69:3306 10.10.1.152:58646 - -
[2022-04-28T20:34:17.185Z] "- - -" 0 UH - - "-" 0 0 0 - "-" "-" "-" "-" "-" outbound|4567||mariadb-2.cmanea.svc.cluster.local - 10.216.10.33:4567 10.10.1.152:50544 - -
[2022-04-28T20:34:18.185Z] "- - -" 0 UH - - "-" 0 0 0 - "-" "-" "-" "-" "-" outbound|4567||mariadb-2.cmanea.svc.cluster.local - 10.216.10.33:4567 10.10.1.152:50562 - -
[2022-04-28T20:34:20.344Z] "- - -" 0 NR filter_chain_not_found - "-" 0 0 0 - "-" "-" "-" "-" "-" - - 10.10.1.152:4444 10.10.3.119:49028 - -
[2022-04-28T20:34:17.185Z] "- - -" 0 - - - "-" 3553 4424 4219 - "-" "-" "-" "-" "10.10.3.119:4567" outbound|4567||mariadb-0.cmanea.svc.cluster.local 10.10.1.152:49932 10.216.225.145:4567 10.10.1.152:47310 - -
[2022-04-28T20:34:34.637Z] "- - -" 0 - - - "-" 5 130 18 - "-" "-" "-" "-" "10.10.3.119:3306" outbound|3306||mariadb.cmanea.svc.cluster.local 10.10.1.152:32980 10.216.114.69:3306 10.10.1.152:58872 - -
[2022-04-28T20:34:34.814Z] "- - -" 0 UH - - "-" 0 0 0 - "-" "-" "-" "-" "-" outbound|4567||mariadb-2.cmanea.svc.cluster.local - 10.216.10.33:4567 10.10.1.152:50774 - -

IP Addresses for easy identification: mariadb-0 - pod - 10.10.3.119 - service - 10.216.225.145 mariadb-1: - pod - 10.10.1.152 - service - 10.216.94.132 mariadb-2: - pod - 10.10.3.136 - service - 10.216.10.33 mariadb service: 10.216.114.69

Version

kubectl version --short
Client Version: v1.19.4
Server Version: v1.20.15

istioctl version
client version: 1.11.3
control plane version: 1.11.3
data plane version: 1.11.3 (14 proxies)

Additional Information

No response

Apr 28 '22 20:04 crrazyman

I'm having exactly the same behaviour.

I enable Istio and no node can connect to the initial one. Furthermore, there are 0 traffic logs inside the istio-proxy containers.
I disable Istio and all works just fine.

So currently I can not use istio only because I am not able to start a galera cluster... I also tried to switch to another service mesh (the linkerd one). With that one all worked fine from the start. I also found some docs related to server first protocol. So it seems that I'm missing some port related hidden config.

Jun 29 '22 10:06 radubuciuceanu

Yes, i've also found the docs related to server first protocol and i've tried to set mTLS mode STRICT and it does not work. If i set mTLS mode DISABLE well that only disables mTLS, that's not what i want :)

Jun 29 '22 10:06 crrazyman

Ok. I figured it out. It's all related to server-first protocol. In order to make it working you have to:

Set the PILOT_ENABLE_MYSQL_FILTER env variable equals to '1' in your pods.
Configure mutual tls for all the traffic between your nodes.
Setting up the appProtocol in your services to mysql as described here.

Jun 29 '22 10:06 radubuciuceanu

What do you mean by the first 2 bullets?

Set the PILOT_ENABLE_MYSQL_FILTER env variable equals to '1' in your pods.

Where? In mariadb pods?

Configure mutual tls for all the traffic between your nodes.

How? With PeerAuthentication like below?

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: strictauth-all
spec:
  mtls:
    mode: STRICT

Jun 29 '22 11:06 crrazyman

What do you mean by the first 2 bullets?

Set the PILOT_ENABLE_MYSQL_FILTER env variable equals to '1' in your pods.

Where? In mariadb pods?

Configure mutual tls for all the traffic between your nodes.

How? With PeerAuthentication like below?
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: strictauth-all
spec:
  mtls:
    mode: STRICT

Yes and yes.

Jun 30 '22 16:06 radubuciuceanu

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2022-04-28. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

Aug 12 '22 05:08 istio-policy-bot