mysql-operator Mysql dont start completely and readiness fail

Notice sometimes when the pod restart for any reason it never comes up again for example

NAME                       READY   STATUS     RESTARTS   AGE
db-mysql-0                 3/4     Running    0          5m29s

events:

Events:
  Type     Reason     Age                From                                    Message
  ----     ------     ----               ----                                    -------
  Normal   Scheduled  2m2s               default-scheduler                       Successfully assigned 6c7951pozpb68x3w3es6pbodqe/db-mysql-0 to ipxxxxxxxx
  Normal   Pulled     103s               kubelet, ipxxxxxxxx  Container image "quay.io/presslabs/mysql-operator-sidecar:0.3.2" already present on machine
  Normal   Created    103s               kubelet, ipxxxxxxxx  Created container
  Normal   Started    103s               kubelet, ipxxxxxxxx  Started container
  Normal   Pulled     102s               kubelet, ipxxxxxxxx  Container image "percona@sha256:713c1817615b333b17d0fbd252b0ccc53c48a665d4cfcb42178167435a957322" already present on machine
  Normal   Created    102s               kubelet, ipxxxxxxxx  Created container
  Normal   Started    102s               kubelet, ipxxxxxxxx  Started container
  Normal   Pulled     101s               kubelet, ipxxxxxxxx  Container image "prom/mysqld-exporter:v0.11.0" already present on machine
  Normal   Created    101s               kubelet, ipxxxxxxxx  Created container
  Normal   Started    101s               kubelet, ipxxxxxxxx  Started container
  Normal   Pulled     101s               kubelet, ipxxxxxxxx  Container image "percona@sha256:713c1817615b333b17d0fbd252b0ccc53c48a665d4cfcb42178167435a957322" already present on machine
  Normal   Created    101s               kubelet, ipxxxxxxxx  Created container
  Normal   Started    101s               kubelet, ipxxxxxxxx  Started container
  Normal   Pulled     101s               kubelet, ipxxxxxxxx  Container image "quay.io/presslabs/mysql-operator-sidecar:0.3.2" already present on machine
  Normal   Created    100s               kubelet, ipxxxxxxxx  Created container
  Normal   Started    100s               kubelet, ipxxxxxxxx  Started container
  Normal   Pulled     100s               kubelet, ipxxxxxxxx  Container image "quay.io/presslabs/mysql-operator-sidecar:0.3.2" already present on machine
  Normal   Created    100s               kubelet, ipxxxxxxxx  Created container
  Normal   Started    100s               kubelet, ipxxxxxxxx  Started container
  Warning  Unhealthy  84s (x7 over 96s)  kubelet, ipxxxxxxxx  Readiness probe failed:

logs from the pods

Create rclone.conf file.
2019-10-24T15:47:32.823Z	INFO	sidecar	environment is not set	{"key": "INIT_BUCKET_URI"}
2019-10-24T15:47:32.823Z	INFO	sidecar	cloning command	{"host": "db-mysql-0"}
2019-10-24T15:47:32.823Z	INFO	sidecar	data already exists! Remove manually PVC to cleanup or to reinitialize.
2019-10-24T15:47:32.824Z	INFO	sidecar	configuring server	{"host": "db-mysql-0"}
2019-10-24T15:47:32.824Z	INFO	sidecar	error while reading PURGE GTID from xtrabackup info file	{"error": "open /var/lib/mysql/xtrabackup_binlog_info: no such file or directory"}
Initialization complete, now exiting!
2019-10-24T15:47:34.526691Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2019-10-24T15:47:34.528701Z 0 [Note] mysqld (mysqld 5.7.26-29-log) starting as process 1 ...
2019-10-24T15:47:34.538056Z 0 [Note] InnoDB: PUNCH HOLE support available
2019-10-24T15:47:34.538155Z 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2019-10-24T15:47:34.538178Z 0 [Note] InnoDB: Uses event mutexes
2019-10-24T15:47:34.538215Z 0 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2019-10-24T15:47:34.538233Z 0 [Note] InnoDB: Compressed tables use zlib 1.2.7
2019-10-24T15:47:34.538251Z 0 [Note] InnoDB: Using Linux native AIO
2019-10-24T15:47:34.538754Z 0 [Note] InnoDB: Number of pools: 1
2019-10-24T15:47:34.538969Z 0 [Note] InnoDB: Using CPU crc32 instructions
2019-10-24T15:47:34.545153Z 0 [Note] InnoDB: Initializing buffer pool, total size = 512M, instances = 1, chunk size = 128M
2019-10-24T15:47:34.569645Z 0 [Note] InnoDB: Completed initialization of buffer pool
2019-10-24T15:47:34.578085Z 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2019-10-24T15:47:34.595271Z 0 [Note] InnoDB: Crash recovery did not find the parallel doublewrite buffer at /var/lib/mysql/xb_doublewrite
2019-10-24T15:47:34.597393Z 0 [Note] InnoDB: Highest supported file format is Barracuda.
2019-10-24T15:47:34.633116Z 0 [Note] InnoDB: Created parallel doublewrite buffer at /var/lib/mysql/xb_doublewrite, size 3932160 bytes
2019-10-24T15:47:34.867432Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2019-10-24T15:47:34.867512Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2019-10-24T15:47:35.011558Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2019-10-24T15:47:35.012721Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
2019-10-24T15:47:35.012737Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2019-10-24T15:47:35.013453Z 0 [Note] InnoDB: Waiting for purge to start
2019-10-24T15:47:35.064136Z 0 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.7.26-29 started; log sequence number 203893653
2019-10-24T15:47:35.064470Z 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2019-10-24T15:47:35.064657Z 0 [Note] Plugin 'FEDERATED' is disabled.
2019-10-24T15:47:35.095022Z 0 [Note] Found ca.pem, server-cert.pem and server-key.pem in data directory. Trying to enable SSL support using them.
2019-10-24T15:47:35.095166Z 0 [Note] Skipping generation of SSL certificates as certificate files are present in data directory.
2019-10-24T15:47:35.099564Z 0 [Warning] CA certificate ca.pem is self signed.
2019-10-24T15:47:35.099626Z 0 [Note] Skipping generation of RSA key pair as key files are present in data directory.
2019-10-24T15:47:35.100921Z 0 [Note] Server hostname (bind-address): '*'; port: 3306
2019-10-24T15:47:35.100962Z 0 [Note] IPv6 is available.
2019-10-24T15:47:35.100976Z 0 [Note]   - '::' resolves to '::';
2019-10-24T15:47:35.100998Z 0 [Note] Server socket created on IP: '::'.
2019-10-24T15:47:35.174622Z 0 [Note] InnoDB: Buffer pool(s) load completed at 191024 15:47:35
2019-10-24T15:47:35.214181Z 0 [Note] Event Scheduler: Loaded 0 events
2019-10-24T15:47:35.214376Z 0 [Note] Execution of init_file '/etc/mysql/conf.d/operator-init.sql' started.
2019-10-24T15:47:35.224302Z 0 [Note] Execution of init_file '/etc/mysql/conf.d/operator-init.sql' ended.
2019-10-24T15:47:35.224498Z 0 [Note] mysqld: ready for connections.
Version: '5.7.26-29-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Percona Server (GPL), Release 29, Revision 11ad961
Create rclone.conf file.
2019-10-24T15:47:34.787Z	INFO	sidecar	environment is not set	{"key": "INIT_BUCKET_URI"}
2019-10-24T15:47:34.788Z	INFO	sidecar	start http server for backups
time="2019-10-24T15:47:35Z" level=info msg="Starting mysqld_exporter (version=0.11.0, branch=HEAD, revision=5d7179615695a61ecc3b5bf90a2a7c76a9592cdd)" source="mysqld_exporter.go:206"
time="2019-10-24T15:47:35Z" level=info msg="Build context (go=go1.10.3, user=root@3d3ff666b0e4, date=20180629-15:00:35)" source="mysqld_exporter.go:207"
time="2019-10-24T15:47:35Z" level=info msg="Enabled scrapers:" source="mysqld_exporter.go:218"
time="2019-10-24T15:47:35Z" level=info msg=" --collect.global_status" source="mysqld_exporter.go:222"
time="2019-10-24T15:47:35Z" level=info msg=" --collect.global_variables" source="mysqld_exporter.go:222"
time="2019-10-24T15:47:35Z" level=info msg=" --collect.slave_status" source="mysqld_exporter.go:222"
time="2019-10-24T15:47:35Z" level=info msg=" --collect.info_schema.tables" source="mysqld_exporter.go:222"
time="2019-10-24T15:47:35Z" level=info msg=" --collect.heartbeat" source="mysqld_exporter.go:222"
time="2019-10-24T15:47:35Z" level=info msg="Listening on 0.0.0.0:9125" source="mysqld_exporter.go:232"
Create rclone.conf file.
Usage: /usr/local/bin/sidecar-entrypoint.sh {files-config|clone|config-and-serve}
Now runs your command.
pt-heartbeat --update --replace --check-read-only --create-table --database sys_operator --table heartbeat --defaults-file /etc/mysql/heartbeat.conf --fail-successive-errors=20

any advice in how to recover from that?

/cc @AMecea @jwilander

Oct 24 '19 15:10 cpanato

Hi @cpanato, can you please describe the pod that is not ready (by running kubectl describe <pod>)? So I can see which container is not ready. If it's the mysql container then I need some logs of the operator too or at least check the logs for reconciling failures.

The operator sets the pod ready by updating the configured key under the OperatorStatusTableName table if it fails then the pod will not be ready. See the code here.

Hope this helps,

Oct 29 '19 12:10 AMecea

I have a similar issue with the readiness probe. The workaround has been to delete the mysql-operator pod. Any advise? The mysql operator is on 0.3.8. Thanks

Apr 08 '20 17:04 imriss

Similar issue here. If a mysql pod is killed and crash recovery takes too long, then the readiness probe fails for the pod and it is again killed before it can finish crash recovery. The result is a pod that just keeps restarting and will never finish recovery and rejoin the cluster.

Warning  Unhealthy               5m43s (x5 over 9m43s)  kubelet, gke-btc2-dev-btc2-dev-db-pods-ef5dc68d-9jzf  Readiness probe failed: ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1' (111)
/bin/sh: line 0: test: -eq: unary operator expected
  Normal   Killing    5m18s                  kubelet, gke-btc2-dev-btc2-dev-db-pods-ef5dc68d-9jzf  Container mysql failed liveness probe, will be restarted

helm ls -n db
NAME          	NAMESPACE	REVISION	UPDATED                             	STATUS  	CHART               	APP VERSION
btc2dev       	db       	31      	2020-08-19 12:36:02.215646 -0600 MDT	deployed	mysql-cluster-0.2.0 	1.0        
mysql-operator	db       	10      	2020-08-19 13:08:29.740631 -0600 MDT	deployed	mysql-operator-0.4.0	v0.4.0

Aug 19 '20 19:08 onelittleant

resolved?

Nov 13 '20 05:11 Erically

still happening.

Jan 02 '21 16:01 enoliveira

Because we are unable to configure the readinessProbe we cannot provide cluster stability in the case of a mysql pod that requires crash recovery. As a result we are unable to use this solution in a production environment. We really like the operator model and have been eager to use it in production. If we could give readinessProbe configurations in the mysql podSpec it would get us there, but as-is we can only use this for dev/test environments.

Jan 26 '21 00:01 onelittleant

This is a serious bug, which is not resolved yet. In my case, the mysql container was not ready.

After ssh-ing into the mysql container and manually executing UPDATE sys_operator.status SET value="1" WHERE name="configured", the pod turned ready for me...

Apr 06 '21 15:04 HendrikRoehm

One side note: We had two instance with more than 10mb of data and one with approx. 1.5mb of data. The one with 1.5mb of data completed the readiness check successfully as the only one.

Apr 06 '21 15:04 HendrikRoehm

Same here and it's easy to reproduce: kubectl delete pod ha-mysql-db-mysql-0 --grace-period=0 --force

Apr 13 '21 12:04 hmilkovi

After ssh-ing into the mysql container and manually executing UPDATE sys_operator.status SET value="1" WHERE name="configured", the pod turned ready for me...

@HendrikRoehm Can configured have different values in different Pods of the same MySQL cluster? Are these from different (local) databases? So do I need to do this workaround in multiple Pods?

Apr 16 '21 19:04 haslersn

@haslersn I don't really know as I am no expert on this. In my case this was a small test mysql instance with no replication and thus only 1 pod. I would guess that you should set it on the master instance of the mysql cluster if there are multiple instances.

Apr 16 '21 19:04 HendrikRoehm

We ran into this issue today when running a helm upgrade.

Sep 22 '21 21:09 worldofgeese

Is there any workaround for this issue?. I manage to make my cluster work updating sys_operator.status manually getting inside mysql pod, and everything goes up, but this is not the right way (if you have to move pod to another node issue will come back again). Does operator manage this value?.

Sep 27 '21 09:09 frjaraur

Can @calind take a look at this? This bug wipes out any serious use of this operator.

Sep 27 '21 11:09 worldofgeese

Just encountered this with a cluster whose node was recycled. The mysql pod reports its unready despite showing as healthy. Mysql is not accessible from outside. This killed my plans to use this operator in production. I had to move on.

Dec 24 '21 18:12 codebard

https://github.com/bitpoke/mysql-operator/issues/777#issuecomment-997704788

Are you sure it has nothing to do with calico? In my case, the problem with this

Dec 24 '21 22:12 tebaly

@tebaly
our production enviroment is using flannel, same problem

Jan 05 '22 06:01 mazizang

This could be a bug related to this delete table when it assumes the file does not exists while the file actually exists: https://github.com/bitpoke/mysql-operator/blob/18a6031621733203b43374b835eaa2fe59d35795/pkg/sidecar/appconf.go#L201-L205

I was able to make the mysql cluster get ready after deleting this file then deleting the pod: /var/lib/mysql/sys_operator/status.ibd

Jan 06 '22 19:01 imriss

Looking forward to the fix for this bug to be able to reconsider this operator for use in production.

Jan 11 '22 08:01 codebard

It seems like performing a manual failover, i.e. promoting one of the read replicas to be the new master, fixes this issue as well. At least that did the trick in my case.

This issue occurred in a freshly set up test cluster where I had shut down (gracefully using ACPI powerdown) all three nodes at the same time on purpose. After starting all three up again, I ended up in that very state, where basically everything worked fine, except kubernetes didn't activate the MySQL service as the master node stuck with the mysql container in state NotReady and manual intervention was necessary. (Also using flannel here by the way)

Feb 17 '22 13:02 proligde

My only workaround is:

set replicaset of mysql-operator to 0
delete the mysqlcluster statefullset
set replicaset of mysql-operator to 1
apply the mysql-cluster yaml

Not pretty though.

Jul 17 '22 14:07 randywatson1979

It is happening to me on kind.sigs.k8s.io v1.23.5 with 0.6.2 mysql-operator.

Looks like node_controller somehow "forget" to reconcile. Any edit on database pod trigger reconcile and eventually setting status=1

I still don't understand why reconcile was not happening and the only way I can reproduce the issue is when I shutdown my PC and start again. And even with this procedure, just cause the issue from time to time randomly.

Jan 17 '23 16:01 olivernadj

readinessProbe/livenessProbe configs would be most helpful... this is basic.....

Mar 07 '23 17:03 voarsh2

mysql-operator mysql-operator copied to clipboard

Mysql dont start completely and readiness fail

mysql-operator
mysql-operator copied to clipboard