mariadb-operator
mariadb-operator copied to clipboard
[Bug] MariaDB fails to restart after the whole cluster rebooted
Describe the bug
I get the folloing error after all nodes where MariaDB insntaces exist has got rebooted
mariadb 2023-10-27 9:13:02 30477 [ERROR] Master 'mariadb-operator': Error reading packet from server: Lost connection to server during
query (server_errno=2013)
mariadb 2023-10-27 9:13:02 30477 [Note] Master 'mariadb-operator': Slave I/O thread exiting, read up to log 'mariadb-bin.000001', posit
ion 912848; GTID position 0-12-61, master mariadb-2.mariadb-internal.seichi-minecraft.svc.cluster.local:3306
mariadb 2023-10-27 9:13:02 30484 [Note] Master 'mariadb-operator': Slave I/O thread: Start semi-sync replication to master 'repl@mariad
b-2.mariadb-internal.seichi-minecraft.svc.cluster.local:3306' in log 'mariadb-bin.000001' at position 1680
mariadb 2023-10-27 9:13:02 30485 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'mariadb-b
in.000001' at position 1680, relay log './[email protected]' position: 4; GTID position ''
mariadb 2023-10-27 9:13:02 30484 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master '[email protected]
al.seichi-minecraft.svc.cluster.local:3306',replication starts at GTID position ''
mariadb 2023-10-27 9:13:03 30485 [ERROR] Master 'mariadb-operator': Slave SQL: Could not execute Write_rows_v1 event on table seichi-ti
med-stats-conifers.break_count_diff_point; Duplicate entry '2099' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_K
EY; the event's master log mariadb-bin.000001, end_log_pos 2067, Gtid 0-12-10, Internal MariaDB error code: 1062
mariadb 2023-10-27 9:13:03 30485 [Warning] Master 'mariadb-operator': Slave: Duplicate entry '2099' for key 'PRIMARY' Error_code: 1062
mariadb 2023-10-27 9:13:03 30485 [ERROR] Master 'mariadb-operator': Error running query, slave SQL thread aborted. Fix the problem, and
restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 1680; GTID position '0-12-9'
mariadb 2023-10-27 9:13:03 30485 [Note] Master 'mariadb-operator': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.00
0001' at position 1680; GTID position '0-12-9', master: mariadb-2.mariadb-internal.seichi-minecraft.svc.cluster.local:3306
apiVersion: mariadb.mmontes.io/v1alpha1
kind: MariaDB
metadata:
namespace: seichi-minecraft
name: mariadb
spec:
rootPasswordSecretKeyRef:
name: mariadb
key: root-password
database: seichiassist
username: mcserver
passwordSecretKeyRef:
name: mariadb
key: mcserver-password
image:
repository: mariadb
tag: "10.11.4"
pullPolicy: IfNotPresent
port: 3306
replicas: 3
replication:
enabled: true
primary:
podIndex: 0
automaticFailover: true
replica:
waitPoint: AfterSync
connectionTimeout: 10s
connectionRetries: 10
syncTimeout: 10s
primaryService:
type: ClusterIP
primaryConnection:
secretName: primary-mariadb-conn
secretTemplate:
key: dsn
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: "kubernetes.io/hostname"
tolerations:
- key: "mariadb.mmontes.io/ha"
operator: "Exists"
effect: "NoSchedule"
podDisruptionBudget:
minAvailable: "66%"
myCnf: |
[mysqld]
bind-address=0.0.0.0
default_storage_engine=InnoDB
binlog_format=row
innodb_autoinc_lock_mode=2
max_allowed_packet=256M
volumeClaimTemplate:
resources:
requests:
storage: 100Gi
storageClassName: synology-iscsi-storage
accessModes:
- ReadWriteOnce
livenessProbe:
exec:
command:
- bash
- -c
- mysql -u root -p"${MARIADB_ROOT_PASSWORD}" -e "SELECT 1;"
initialDelaySeconds: 20
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
exec:
command:
- bash
- -c
- mysql -u root -p"${MARIADB_ROOT_PASSWORD}" -e "SELECT 1;"
initialDelaySeconds: 20
periodSeconds: 10
timeoutSeconds: 5
metrics:
exporter:
image:
repository: prom/mysqld-exporter
tag: "v0.14.0"
pullPolicy: IfNotPresent
resources:
requests:
cpu: 300m
memory: 512Mi
limits:
cpu: 300m
memory: 512Mi
serviceMonitor:
prometheusRelease: prometheus
interval: 10s
scrapeTimeout: 10s
Is the bug related to Galera?
- [x] I have checked the Galera documentation
- [x] I have checked the Galera troubleshooting guide
Expected behaviour
Steps to reproduce the bug
Additional context
Environment details:
- Kubernetes version: 1.27.6
- mariadb-operator version: 0.20.0
- Install method: helm
- Install flavour: minimal
This issue is stale because it has been open 30 days with no activity.
This issue was closed because it has been stalled for 5 days with no activity.
@mmontes11 can you reopen this? :(
I might have run into the same issue. I'm not sure if the whole cluster rebooted, but two replicas are out of sync now. This is currently the case in my two MariaDB clusters I run on the same Kubernetes cluster.
I had to delete one of the replica pods today as it was promoted to master by the operator, which brought the issue to light. When I deleted the pod, the most up-to-date replica became master again, restoring the correct state of the cluster. When the deleted pod came back online, the logs showed:
2023-12-13 14:49:21 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2023-12-13 14:49:21 0 [Note] InnoDB: Number of pools: 1
2023-12-13 14:49:21 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2023-12-13 14:49:21 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
2023-12-13 14:49:21 0 [Note] InnoDB: Using Linux native AIO
2023-12-13 14:49:21 0 [Note] InnoDB: Initializing buffer pool, total size = 8589934592, chunk size = 134217728
2023-12-13 14:49:21 0 [Note] InnoDB: Completed initialization of buffer pool
2023-12-13 14:49:21 0 [Note] InnoDB: 128 rollback segments are active.
2023-12-13 14:49:21 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2023-12-13 14:49:21 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2023-12-13 14:49:21 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2023-12-13 14:49:21 0 [Note] InnoDB: 10.6.13 started; log sequence number 16155782456; transaction id 13287581
2023-12-13 14:49:21 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2023-12-13 14:49:21 0 [Note] Plugin 'FEEDBACK' is disabled.
2023-12-13 14:49:21 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-12-13 14:49:21 0 [Note] Server socket created on IP: '0.0.0.0'.
2023-12-13 14:49:21 0 [Note] Server socket created on IP: '::'.
2023-12-13 14:49:21 3 [Note] Reading Master_info: '[email protected]' Relay_info:'[email protected]'
2023-12-13 14:49:21 3 [Note] Initialized Master_info from '[email protected]'
2023-12-13 14:49:21 4 [Note] Master 'mariadb-operator': Slave I/O thread: Start asynchronous replication to master '[email protected]:3306' in log 'mariadb-bin.000004' at position 13297925
2023-12-13 14:49:21 3 [Note] Started replication for 'mariadb-operator'
2023-12-13 14:49:21 5 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'mariadb-bin.000004' at position 13297925, relay log './[email protected]' position: 4; GTID position ''
2023-12-13 14:49:21 0 [Note] mariadbd: ready for connections.
Version: '10.6.13-MariaDB-1:10.6.13+maria~ubu2004-log' socket: '/run/mysqld/mysqld.sock' port: 3306 mariadb.org binary distribution
2023-12-13 14:49:21 4 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master '[email protected]:3306',replication starts at GTID position ''
2023-12-13 14:49:32 0 [Note] InnoDB: Buffer pool(s) load completed at 231213 14:49:32
2023-12-13 14:49:49 5 [Note] Master 'mariadb-operator': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000001' at position 33348783; GTID position '0-10-30947', master: mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-13 14:49:49 4 [Note] Master 'mariadb-operator': Slave I/O thread exiting, read up to log 'mariadb-bin.000002', position 374705570; GTID position 0-10-1355222, master mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-13 14:49:49 491 [Note] Master connection name: 'mariadb-operator' Master_info_file: '[email protected]' Relay_info_file: '[email protected]'
2023-12-13 14:49:49 491 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='mariadb-0.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='mariadb-bin.000002', master_log_pos='374705570'. New state master_host='mariadb-0.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='', master_log_pos='4'.
2023-12-13 14:49:49 491 [Note] Previous Using_Gtid=Current_Pos. New Using_Gtid=Current_Pos
2023-12-13 14:49:49 492 [Note] Master 'mariadb-operator': Slave I/O thread: Start semi-sync replication to master '[email protected]:3306' in log '' at position 4
2023-12-13 14:49:49 493 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './[email protected]' position: 4; GTID position '0-10-30947'
2023-12-13 14:49:49 492 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master '[email protected]:3306',replication starts at GTID position '0-10-30947'
2023-12-13 14:50:31 493 [ERROR] Master 'mariadb-operator': mariadbd: Can't find record in 'view_track_stats'
2023-12-13 14:50:31 493 [ERROR] Master 'mariadb-operator': Slave SQL: Could not execute Update_rows_v1 event on table analytics.view_track_stats; Can't find record in 'view_track_stats', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mariadb-bin.000001, end_log_pos 128851519, Gtid 0-10-118186, Internal MariaDB error code: 1032
2023-12-13 14:50:31 493 [Warning] Master 'mariadb-operator': Slave: Can't find record in 'view_track_stats' Error_code: 1032
2023-12-13 14:50:31 493 [ERROR] Master 'mariadb-operator': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 128850327; GTID position '0-10-118185'
2023-12-13 14:50:31 493 [Note] Master 'mariadb-operator': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000001' at position 128850327; GTID position '0-10-118185', master: mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-13 14:51:42 1865 [Warning] Aborted connection 1865 to db: 'analytics' user: 'root' host: '127.0.0.1' (Got an error reading communication packets)
The second outdated replica has bit more logs to go on:
2023-12-03 14:09:07 24 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './[email protected]' position: 4; GTID position '0-10-9'
2023-12-03 14:09:07 23 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master '[email protected]:3306',replication starts at GTID position '0-10-9'
2023-12-03 14:10:06 24 [ERROR] Master 'mariadb-operator': Slave SQL: Could not execute Write_rows_v1 event on table analytics.events; Duplicate entry '132079351' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mariadb-bin.000001, end_log_pos 2220, Gtid 0-10-10, Internal MariaDB error code: 1062
2023-12-03 14:10:06 24 [Warning] Master 'mariadb-operator': Slave: Duplicate entry '132079351' for key 'PRIMARY' Error_code: 1062
2023-12-03 14:10:06 24 [ERROR] Master 'mariadb-operator': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 1681; GTID position '0-10-9'
2023-12-03 14:10:06 24 [Note] Master 'mariadb-operator': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000001' at position 1681; GTID position '0-10-9', master: mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-10 8:50:53 23 [Note] Master 'mariadb-operator': Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000004' at position 13297925; GTID position '0-10-3011382'
2023-12-10 8:50:54 23 [ERROR] Master 'mariadb-operator': Slave I/O: error reconnecting to master '[email protected]:3306' - retry-time: 10 maximum-retries: 100000 message: Lost connection to server at 'reading initial communication packet', system error: 0 "Internal error/check (Not system error)", Internal MariaDB error code: 2013
2023-12-10 8:51:37 23 [Note] Master 'mariadb-operator': Slave: connected to master '[email protected]:3306',replication resumed in log 'mariadb-bin.000004' at position 13297925
2023-12-10 8:51:40 23 [ERROR] Master 'mariadb-operator': Error reading packet from server: Lost connection to server during query (server_errno=2013)
2023-12-10 8:51:40 23 [Note] Master 'mariadb-operator': Slave I/O thread exiting, read up to log 'mariadb-bin.000004', position 13666032; GTID position 0-10-3011653, master mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-10 8:51:40 117175 [Note] Master connection name: 'mariadb-operator' Master_info_file: '[email protected]' Relay_info_file: '[email protected]'
2023-12-10 8:51:40 117175 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='mariadb-0.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='mariadb-bin.000004', master_log_pos='13666032'. New state master_host='mariadb-2.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='', master_log_pos='4'.
2023-12-10 8:51:40 117175 [Note] Previous Using_Gtid=Current_Pos. New Using_Gtid=Current_Pos
2023-12-10 8:51:40 117176 [Note] Master 'mariadb-operator': Slave I/O thread: Start semi-sync replication to master '[email protected]:3306' in log '' at position 4
2023-12-10 8:51:40 117177 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './[email protected]' position: 4; GTID position ''
2023-12-10 8:51:40 117176 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master '[email protected]:3306',replication starts at GTID position ''
2023-12-13 4:04:41 117176 [Note] Master 'mariadb-operator': Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000001' at position 1681; GTID position '0-12-3011391'
2023-12-13 4:05:01 117176 [ERROR] Master 'mariadb-operator': Slave I/O: error reconnecting to master '[email protected]:3306' - retry-time: 10 maximum-retries: 100000 message: Unknown server host 'mariadb-2.mariadb-internal.analytics.svc.cluster.local' (-3), Internal MariaDB error code: 2005
2023-12-13 4:05:31 117176 [Note] Master 'mariadb-operator': Slave: connected to master '[email protected]:3306',replication resumed in log 'mariadb-bin.000001' at position 1681
2023-12-13 4:09:02 117176 [Note] Master 'mariadb-operator': Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000001' at position 1681; GTID position '0-12-3011391'
2023-12-13 14:49:07 117176 [Note] Master 'mariadb-operator': Slave: received end packet from server, apparent master shutdown:
2023-12-13 14:49:07 117176 [Note] Master 'mariadb-operator': Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000001' at position 1681; GTID position '0-12-3011391'
2023-12-13 14:49:07 117176 [ERROR] Master 'mariadb-operator': Slave I/O: error reconnecting to master '[email protected]:3306' - retry-time: 10 maximum-retries: 100000 message: Lost connection to server at 'reading initial communication packet', system error: 0 "Internal error/check (Not system error)", Internal MariaDB error code: 2013
2023-12-13 14:49:17 117177 [Note] Master 'mariadb-operator': Error reading relay log event: slave SQL thread was killed
2023-12-13 14:49:17 117177 [Note] Master 'mariadb-operator': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000001' at position 1681; GTID position '0-12-3011391', master: mariadb-2.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-13 14:49:17 117176 [Note] Master 'mariadb-operator': Slave I/O thread killed during or after a reconnect done to recover from failed read
2023-12-13 14:49:17 117176 [Note] Master 'mariadb-operator': Slave I/O thread exiting, read up to log 'mariadb-bin.000001', position 1681; GTID position 0-12-3011391, master mariadb-2.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-13 14:49:18 117176 [Note] Master 'mariadb-operator': cannot connect to master to kill slave io_thread's connection
2023-12-13 14:49:18 173310 [Note] Master connection name: 'mariadb-operator' Master_info_file: '[email protected]' Relay_info_file: '[email protected]'
2023-12-13 14:49:18 173310 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='mariadb-2.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='mariadb-bin.000001', master_log_pos='1681'. New state master_host='mariadb-0.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='', master_log_pos='4'.
2023-12-13 14:49:18 173310 [Note] Previous Using_Gtid=Current_Pos. New Using_Gtid=Current_Pos
2023-12-13 14:49:18 173311 [Note] Master 'mariadb-operator': Slave I/O thread: Start semi-sync replication to master '[email protected]:3306' in log '' at position 4
2023-12-13 14:49:18 173312 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './[email protected]' position: 4; GTID position ''
2023-12-13 14:49:18 173311 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master '[email protected]:3306',replication starts at GTID position ''
2023-12-13 14:49:18 173312 [ERROR] Master 'mariadb-operator': mariadbd: Can't find record in 'view_user_stats'
2023-12-13 14:49:18 173312 [ERROR] Master 'mariadb-operator': Slave SQL: Could not execute Update_rows_v1 event on table analytics.view_user_stats; Can't find record in 'view_user_stats', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mariadb-bin.000001, end_log_pos 22844, Gtid 0-10-20, Internal MariaDB error code: 1032
2023-12-13 14:49:18 173312 [Warning] Master 'mariadb-operator': Slave: Can't find record in 'view_user_stats' Error_code: 1032
2023-12-13 14:49:18 173312 [ERROR] Master 'mariadb-operator': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 21411; GTID position '0-10-19'
2023-12-13 14:49:18 173312 [Note] Master 'mariadb-operator': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000001' at position 21411; GTID position '0-10-19', master: mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306
For now I'll purge my replica's and re-add them through this method: https://github.com/mariadb-operator/mariadb-operator/issues/141#issuecomment-1804760231
Thanks for the detailed information!
This most likely means that the replicas are missing some binary log operations from the primary and therefore are unable to unable to apply some further operations.
We are looking into automatically fix this situation, adding support for physical backups the first thing:
- https://github.com/mariadb-operator/mariadb-operator/pull/273
The idea is that the operator would detect when the replica Pods are unable to fully replicate and perform the following steps automatically to bring it back:
- https://github.com/mariadb-operator/mariadb-operator/issues/141#issuecomment-1804760231
The same applies to adding new replicas to the cluster, we would need to setup the new replica based on a physical backup from the primary, as some log operations may have been expired in the primary i.e. by configuring expire_logs_days. See:
- https://mariadb.com/kb/en/using-and-maintaining-the-binary-log/
- https://mariadb.com/kb/en/setting-up-a-replica-with-mariabackup/
Related to:
- https://github.com/mariadb-operator/mariadb-operator/issues/141
This issue is stale because it has been open 30 days with no activity.
Hey there!
With the release of v0.0.25, primary failover done by the operator has been superseded by MaxScale: a sophisticated database proxy, router, and load balancer designed specifically for and by MariaDB. I provides a more resilient primary failover based on MariaDB internals, I encourage you to take a look at the docs:
https://github.com/mariadb-operator/mariadb-operator/blob/main/docs/MAXSCALE.md
Hello, I have the same problems, I am testing the mariadb operator (one master and 2 replicas) with Proxysql in front. I am familiar with Proxysql and do not wish to move away from it and switch to MaxScale.
During my tests:
- When I kill the master
- a replica is elected master
- the old master can't connect to the new master
I have the same errors (reconnecting to master with the repl user). I kill the same pod again and still the result is the same.
error reconnecting to master '[email protected]:3306' - retry-time: 10 maximum-retries: 100000 message: Can't connect to server on 'mariadb-0.mariadb-internal.mariadb.svc.cluster.local' (111 "Connection refused")
Thank you in advance for your help
This issue is stale because it has been open 30 days with no activity.
Not stale. This is still in a row
This issue is stale because it has been open 30 days with no activity.
This issue is stale because it has been open 30 days with no activity.
This issue is stale because it has been open 60 days with no activity.