mysql-operator
mysql-operator copied to clipboard
Mysql dont start completely and readiness fail
Notice sometimes when the pod restart for any reason it never comes up again for example
NAME READY STATUS RESTARTS AGE
db-mysql-0 3/4 Running 0 5m29s
events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m2s default-scheduler Successfully assigned 6c7951pozpb68x3w3es6pbodqe/db-mysql-0 to ipxxxxxxxx
Normal Pulled 103s kubelet, ipxxxxxxxx Container image "quay.io/presslabs/mysql-operator-sidecar:0.3.2" already present on machine
Normal Created 103s kubelet, ipxxxxxxxx Created container
Normal Started 103s kubelet, ipxxxxxxxx Started container
Normal Pulled 102s kubelet, ipxxxxxxxx Container image "percona@sha256:713c1817615b333b17d0fbd252b0ccc53c48a665d4cfcb42178167435a957322" already present on machine
Normal Created 102s kubelet, ipxxxxxxxx Created container
Normal Started 102s kubelet, ipxxxxxxxx Started container
Normal Pulled 101s kubelet, ipxxxxxxxx Container image "prom/mysqld-exporter:v0.11.0" already present on machine
Normal Created 101s kubelet, ipxxxxxxxx Created container
Normal Started 101s kubelet, ipxxxxxxxx Started container
Normal Pulled 101s kubelet, ipxxxxxxxx Container image "percona@sha256:713c1817615b333b17d0fbd252b0ccc53c48a665d4cfcb42178167435a957322" already present on machine
Normal Created 101s kubelet, ipxxxxxxxx Created container
Normal Started 101s kubelet, ipxxxxxxxx Started container
Normal Pulled 101s kubelet, ipxxxxxxxx Container image "quay.io/presslabs/mysql-operator-sidecar:0.3.2" already present on machine
Normal Created 100s kubelet, ipxxxxxxxx Created container
Normal Started 100s kubelet, ipxxxxxxxx Started container
Normal Pulled 100s kubelet, ipxxxxxxxx Container image "quay.io/presslabs/mysql-operator-sidecar:0.3.2" already present on machine
Normal Created 100s kubelet, ipxxxxxxxx Created container
Normal Started 100s kubelet, ipxxxxxxxx Started container
Warning Unhealthy 84s (x7 over 96s) kubelet, ipxxxxxxxx Readiness probe failed:
logs from the pods
Create rclone.conf file.
2019-10-24T15:47:32.823Z INFO sidecar environment is not set {"key": "INIT_BUCKET_URI"}
2019-10-24T15:47:32.823Z INFO sidecar cloning command {"host": "db-mysql-0"}
2019-10-24T15:47:32.823Z INFO sidecar data already exists! Remove manually PVC to cleanup or to reinitialize.
2019-10-24T15:47:32.824Z INFO sidecar configuring server {"host": "db-mysql-0"}
2019-10-24T15:47:32.824Z INFO sidecar error while reading PURGE GTID from xtrabackup info file {"error": "open /var/lib/mysql/xtrabackup_binlog_info: no such file or directory"}
Initialization complete, now exiting!
2019-10-24T15:47:34.526691Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2019-10-24T15:47:34.528701Z 0 [Note] mysqld (mysqld 5.7.26-29-log) starting as process 1 ...
2019-10-24T15:47:34.538056Z 0 [Note] InnoDB: PUNCH HOLE support available
2019-10-24T15:47:34.538155Z 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2019-10-24T15:47:34.538178Z 0 [Note] InnoDB: Uses event mutexes
2019-10-24T15:47:34.538215Z 0 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2019-10-24T15:47:34.538233Z 0 [Note] InnoDB: Compressed tables use zlib 1.2.7
2019-10-24T15:47:34.538251Z 0 [Note] InnoDB: Using Linux native AIO
2019-10-24T15:47:34.538754Z 0 [Note] InnoDB: Number of pools: 1
2019-10-24T15:47:34.538969Z 0 [Note] InnoDB: Using CPU crc32 instructions
2019-10-24T15:47:34.545153Z 0 [Note] InnoDB: Initializing buffer pool, total size = 512M, instances = 1, chunk size = 128M
2019-10-24T15:47:34.569645Z 0 [Note] InnoDB: Completed initialization of buffer pool
2019-10-24T15:47:34.578085Z 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2019-10-24T15:47:34.595271Z 0 [Note] InnoDB: Crash recovery did not find the parallel doublewrite buffer at /var/lib/mysql/xb_doublewrite
2019-10-24T15:47:34.597393Z 0 [Note] InnoDB: Highest supported file format is Barracuda.
2019-10-24T15:47:34.633116Z 0 [Note] InnoDB: Created parallel doublewrite buffer at /var/lib/mysql/xb_doublewrite, size 3932160 bytes
2019-10-24T15:47:34.867432Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2019-10-24T15:47:34.867512Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2019-10-24T15:47:35.011558Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2019-10-24T15:47:35.012721Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
2019-10-24T15:47:35.012737Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2019-10-24T15:47:35.013453Z 0 [Note] InnoDB: Waiting for purge to start
2019-10-24T15:47:35.064136Z 0 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.7.26-29 started; log sequence number 203893653
2019-10-24T15:47:35.064470Z 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2019-10-24T15:47:35.064657Z 0 [Note] Plugin 'FEDERATED' is disabled.
2019-10-24T15:47:35.095022Z 0 [Note] Found ca.pem, server-cert.pem and server-key.pem in data directory. Trying to enable SSL support using them.
2019-10-24T15:47:35.095166Z 0 [Note] Skipping generation of SSL certificates as certificate files are present in data directory.
2019-10-24T15:47:35.099564Z 0 [Warning] CA certificate ca.pem is self signed.
2019-10-24T15:47:35.099626Z 0 [Note] Skipping generation of RSA key pair as key files are present in data directory.
2019-10-24T15:47:35.100921Z 0 [Note] Server hostname (bind-address): '*'; port: 3306
2019-10-24T15:47:35.100962Z 0 [Note] IPv6 is available.
2019-10-24T15:47:35.100976Z 0 [Note] - '::' resolves to '::';
2019-10-24T15:47:35.100998Z 0 [Note] Server socket created on IP: '::'.
2019-10-24T15:47:35.174622Z 0 [Note] InnoDB: Buffer pool(s) load completed at 191024 15:47:35
2019-10-24T15:47:35.214181Z 0 [Note] Event Scheduler: Loaded 0 events
2019-10-24T15:47:35.214376Z 0 [Note] Execution of init_file '/etc/mysql/conf.d/operator-init.sql' started.
2019-10-24T15:47:35.224302Z 0 [Note] Execution of init_file '/etc/mysql/conf.d/operator-init.sql' ended.
2019-10-24T15:47:35.224498Z 0 [Note] mysqld: ready for connections.
Version: '5.7.26-29-log' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona Server (GPL), Release 29, Revision 11ad961
Create rclone.conf file.
2019-10-24T15:47:34.787Z INFO sidecar environment is not set {"key": "INIT_BUCKET_URI"}
2019-10-24T15:47:34.788Z INFO sidecar start http server for backups
time="2019-10-24T15:47:35Z" level=info msg="Starting mysqld_exporter (version=0.11.0, branch=HEAD, revision=5d7179615695a61ecc3b5bf90a2a7c76a9592cdd)" source="mysqld_exporter.go:206"
time="2019-10-24T15:47:35Z" level=info msg="Build context (go=go1.10.3, user=root@3d3ff666b0e4, date=20180629-15:00:35)" source="mysqld_exporter.go:207"
time="2019-10-24T15:47:35Z" level=info msg="Enabled scrapers:" source="mysqld_exporter.go:218"
time="2019-10-24T15:47:35Z" level=info msg=" --collect.global_status" source="mysqld_exporter.go:222"
time="2019-10-24T15:47:35Z" level=info msg=" --collect.global_variables" source="mysqld_exporter.go:222"
time="2019-10-24T15:47:35Z" level=info msg=" --collect.slave_status" source="mysqld_exporter.go:222"
time="2019-10-24T15:47:35Z" level=info msg=" --collect.info_schema.tables" source="mysqld_exporter.go:222"
time="2019-10-24T15:47:35Z" level=info msg=" --collect.heartbeat" source="mysqld_exporter.go:222"
time="2019-10-24T15:47:35Z" level=info msg="Listening on 0.0.0.0:9125" source="mysqld_exporter.go:232"
Create rclone.conf file.
Usage: /usr/local/bin/sidecar-entrypoint.sh {files-config|clone|config-and-serve}
Now runs your command.
pt-heartbeat --update --replace --check-read-only --create-table --database sys_operator --table heartbeat --defaults-file /etc/mysql/heartbeat.conf --fail-successive-errors=20
any advice in how to recover from that?
/cc @AMecea @jwilander
Hi @cpanato, can you please describe the pod that is not ready (by running kubectl describe <pod>
)? So I can see which container is not ready. If it's the mysql
container then I need some logs of the operator too or at least check the logs for reconciling failures.
The operator sets the pod ready by updating the configured
key under the OperatorStatusTableName
table if it fails then the pod will not be ready. See the code here.
Hope this helps,
I have a similar issue with the readiness probe. The workaround has been to delete the mysql-operator pod. Any advise? The mysql operator is on 0.3.8. Thanks
Similar issue here. If a mysql pod is killed and crash recovery takes too long, then the readiness probe fails for the pod and it is again killed before it can finish crash recovery. The result is a pod that just keeps restarting and will never finish recovery and rejoin the cluster.
Warning Unhealthy 5m43s (x5 over 9m43s) kubelet, gke-btc2-dev-btc2-dev-db-pods-ef5dc68d-9jzf Readiness probe failed: ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1' (111)
/bin/sh: line 0: test: -eq: unary operator expected
Normal Killing 5m18s kubelet, gke-btc2-dev-btc2-dev-db-pods-ef5dc68d-9jzf Container mysql failed liveness probe, will be restarted
helm ls -n db
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
btc2dev db 31 2020-08-19 12:36:02.215646 -0600 MDT deployed mysql-cluster-0.2.0 1.0
mysql-operator db 10 2020-08-19 13:08:29.740631 -0600 MDT deployed mysql-operator-0.4.0 v0.4.0
resolved?
still happening.
Because we are unable to configure the readinessProbe we cannot provide cluster stability in the case of a mysql pod that requires crash recovery. As a result we are unable to use this solution in a production environment. We really like the operator model and have been eager to use it in production. If we could give readinessProbe configurations in the mysql podSpec it would get us there, but as-is we can only use this for dev/test environments.
This is a serious bug, which is not resolved yet. In my case, the mysql container was not ready.
After ssh-ing into the mysql container and manually executing UPDATE sys_operator.status SET value="1" WHERE name="configured"
, the pod turned ready for me...
One side note: We had two instance with more than 10mb of data and one with approx. 1.5mb of data. The one with 1.5mb of data completed the readiness check successfully as the only one.
Same here and it's easy to reproduce:
kubectl delete pod ha-mysql-db-mysql-0 --grace-period=0 --force
After ssh-ing into the mysql container and manually executing
UPDATE sys_operator.status SET value="1" WHERE name="configured"
, the pod turned ready for me...
@HendrikRoehm Can configured
have different values in different Pods of the same MySQL cluster? Are these from different (local) databases? So do I need to do this workaround in multiple Pods?
@haslersn I don't really know as I am no expert on this. In my case this was a small test mysql instance with no replication and thus only 1 pod. I would guess that you should set it on the master instance of the mysql cluster if there are multiple instances.
We ran into this issue today when running a helm upgrade
.
Is there any workaround for this issue?. I manage to make my cluster work updating sys_operator.status manually getting inside mysql pod, and everything goes up, but this is not the right way (if you have to move pod to another node issue will come back again). Does operator manage this value?.
Can @calind take a look at this? This bug wipes out any serious use of this operator.
Just encountered this with a cluster whose node was recycled. The mysql pod reports its unready despite showing as healthy. Mysql is not accessible from outside. This killed my plans to use this operator in production. I had to move on.
https://github.com/bitpoke/mysql-operator/issues/777#issuecomment-997704788
Are you sure it has nothing to do with calico? In my case, the problem with this
@tebaly
our production enviroment is using flannel, same problem
This could be a bug related to this delete table when it assumes the file does not exists while the file actually exists: https://github.com/bitpoke/mysql-operator/blob/18a6031621733203b43374b835eaa2fe59d35795/pkg/sidecar/appconf.go#L201-L205
I was able to make the mysql cluster get ready after deleting this file then deleting the pod:
/var/lib/mysql/sys_operator/status.ibd
Looking forward to the fix for this bug to be able to reconsider this operator for use in production.
It seems like performing a manual failover, i.e. promoting one of the read replicas to be the new master, fixes this issue as well. At least that did the trick in my case.
This issue occurred in a freshly set up test cluster where I had shut down (gracefully using ACPI powerdown) all three nodes at the same time on purpose. After starting all three up again, I ended up in that very state, where basically everything worked fine, except kubernetes didn't activate the MySQL service as the master node stuck with the mysql container in state NotReady
and manual intervention was necessary. (Also using flannel
here by the way)
My only workaround is:
- set replicaset of mysql-operator to 0
- delete the mysqlcluster statefullset
- set replicaset of mysql-operator to 1
- apply the mysql-cluster yaml
Not pretty though.
It is happening to me on kind.sigs.k8s.io
v1.23.5
with 0.6.2
mysql-operator
.
Looks like node_controller
somehow "forget" to reconcile. Any edit on database pod trigger reconcile and eventually setting status=1
I still don't understand why reconcile was not happening and the only way I can reproduce the issue is when I shutdown my PC and start again. And even with this procedure, just cause the issue from time to time randomly.
readinessProbe/livenessProbe configs would be most helpful... this is basic.....