mariadb-operator icon indicating copy to clipboard operation
mariadb-operator copied to clipboard

[Bug] MariaDB fails to restart after the whole cluster rebooted

Open outductor opened this issue 2 years ago • 15 comments

Describe the bug

I get the folloing error after all nodes where MariaDB insntaces exist has got rebooted

mariadb 2023-10-27  9:13:02 30477 [ERROR] Master 'mariadb-operator': Error reading packet from server: Lost connection to server during
query (server_errno=2013)
mariadb 2023-10-27  9:13:02 30477 [Note] Master 'mariadb-operator': Slave I/O thread exiting, read up to log 'mariadb-bin.000001', posit
ion 912848; GTID position 0-12-61, master mariadb-2.mariadb-internal.seichi-minecraft.svc.cluster.local:3306
mariadb 2023-10-27  9:13:02 30484 [Note] Master 'mariadb-operator': Slave I/O thread: Start semi-sync replication to master 'repl@mariad
b-2.mariadb-internal.seichi-minecraft.svc.cluster.local:3306' in log 'mariadb-bin.000001' at position 1680
mariadb 2023-10-27  9:13:02 30485 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'mariadb-b
in.000001' at position 1680, relay log './[email protected]' position: 4; GTID position ''
mariadb 2023-10-27  9:13:02 30484 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master '[email protected]
al.seichi-minecraft.svc.cluster.local:3306',replication starts at GTID position ''
mariadb 2023-10-27  9:13:03 30485 [ERROR] Master 'mariadb-operator': Slave SQL: Could not execute Write_rows_v1 event on table seichi-ti
med-stats-conifers.break_count_diff_point; Duplicate entry '2099' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_K
EY; the event's master log mariadb-bin.000001, end_log_pos 2067, Gtid 0-12-10, Internal MariaDB error code: 1062
mariadb 2023-10-27  9:13:03 30485 [Warning] Master 'mariadb-operator': Slave: Duplicate entry '2099' for key 'PRIMARY' Error_code: 1062
mariadb 2023-10-27  9:13:03 30485 [ERROR] Master 'mariadb-operator': Error running query, slave SQL thread aborted. Fix the problem, and
 restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 1680; GTID position '0-12-9'
mariadb 2023-10-27  9:13:03 30485 [Note] Master 'mariadb-operator': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.00
0001' at position 1680; GTID position '0-12-9', master: mariadb-2.mariadb-internal.seichi-minecraft.svc.cluster.local:3306
apiVersion: mariadb.mmontes.io/v1alpha1
kind: MariaDB
metadata:
  namespace: seichi-minecraft
  name: mariadb
spec:
  rootPasswordSecretKeyRef:
    name: mariadb
    key: root-password

  database: seichiassist
  username: mcserver
  passwordSecretKeyRef:
    name: mariadb
    key: mcserver-password

  image:
    repository: mariadb
    tag: "10.11.4"
    pullPolicy: IfNotPresent

  port: 3306

  replicas: 3

  replication:
    enabled: true
    primary:
      podIndex: 0
      automaticFailover: true
    replica:
      waitPoint: AfterSync
      connectionTimeout: 10s
      connectionRetries: 10
      syncTimeout: 10s

  primaryService:
    type: ClusterIP
  primaryConnection:
    secretName: primary-mariadb-conn
    secretTemplate:
      key: dsn

  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: "kubernetes.io/hostname"

  tolerations:
    - key: "mariadb.mmontes.io/ha"
      operator: "Exists"
      effect: "NoSchedule"

  podDisruptionBudget:
    minAvailable: "66%"

  myCnf: |
    [mysqld]
    bind-address=0.0.0.0
    default_storage_engine=InnoDB
    binlog_format=row
    innodb_autoinc_lock_mode=2
    max_allowed_packet=256M

  volumeClaimTemplate:
    resources:
      requests:
        storage: 100Gi
    storageClassName: synology-iscsi-storage
    accessModes:
      - ReadWriteOnce

  livenessProbe:
    exec:
      command:
        - bash
        - -c
        - mysql -u root -p"${MARIADB_ROOT_PASSWORD}" -e "SELECT 1;"
    initialDelaySeconds: 20
    periodSeconds: 10
    timeoutSeconds: 5

  readinessProbe:
    exec:
      command:
        - bash
        - -c
        - mysql -u root -p"${MARIADB_ROOT_PASSWORD}" -e "SELECT 1;"
    initialDelaySeconds: 20
    periodSeconds: 10
    timeoutSeconds: 5

  metrics:
    exporter:
      image:
        repository: prom/mysqld-exporter
        tag: "v0.14.0"
        pullPolicy: IfNotPresent
      resources:
        requests:
          cpu: 300m
          memory: 512Mi
        limits:
          cpu: 300m
          memory: 512Mi
    serviceMonitor:
      prometheusRelease: prometheus
      interval: 10s
      scrapeTimeout: 10s

Is the bug related to Galera?

Expected behaviour

Steps to reproduce the bug

Additional context

Environment details:

  • Kubernetes version: 1.27.6
  • mariadb-operator version: 0.20.0
  • Install method: helm
  • Install flavour: minimal

outductor avatar Oct 27 '23 09:10 outductor

This issue is stale because it has been open 30 days with no activity.

github-actions[bot] avatar Nov 27 '23 02:11 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Dec 03 '23 02:12 github-actions[bot]

@mmontes11 can you reopen this? :(

outductor avatar Dec 03 '23 23:12 outductor

I might have run into the same issue. I'm not sure if the whole cluster rebooted, but two replicas are out of sync now. This is currently the case in my two MariaDB clusters I run on the same Kubernetes cluster.

I had to delete one of the replica pods today as it was promoted to master by the operator, which brought the issue to light. When I deleted the pod, the most up-to-date replica became master again, restoring the correct state of the cluster. When the deleted pod came back online, the logs showed:

2023-12-13 14:49:21 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2023-12-13 14:49:21 0 [Note] InnoDB: Number of pools: 1
2023-12-13 14:49:21 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2023-12-13 14:49:21 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
2023-12-13 14:49:21 0 [Note] InnoDB: Using Linux native AIO
2023-12-13 14:49:21 0 [Note] InnoDB: Initializing buffer pool, total size = 8589934592, chunk size = 134217728
2023-12-13 14:49:21 0 [Note] InnoDB: Completed initialization of buffer pool
2023-12-13 14:49:21 0 [Note] InnoDB: 128 rollback segments are active.
2023-12-13 14:49:21 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2023-12-13 14:49:21 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2023-12-13 14:49:21 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2023-12-13 14:49:21 0 [Note] InnoDB: 10.6.13 started; log sequence number 16155782456; transaction id 13287581
2023-12-13 14:49:21 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2023-12-13 14:49:21 0 [Note] Plugin 'FEEDBACK' is disabled.
2023-12-13 14:49:21 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-12-13 14:49:21 0 [Note] Server socket created on IP: '0.0.0.0'.
2023-12-13 14:49:21 0 [Note] Server socket created on IP: '::'.
2023-12-13 14:49:21 3 [Note] Reading Master_info: '[email protected]'  Relay_info:'[email protected]'
2023-12-13 14:49:21 3 [Note] Initialized Master_info from '[email protected]'
2023-12-13 14:49:21 4 [Note] Master 'mariadb-operator': Slave I/O thread: Start asynchronous replication to master '[email protected]:3306' in log 'mariadb-bin.000004' at position 13297925
2023-12-13 14:49:21 3 [Note] Started replication for 'mariadb-operator'
2023-12-13 14:49:21 5 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'mariadb-bin.000004' at position 13297925, relay log './[email protected]' position: 4; GTID position ''
2023-12-13 14:49:21 0 [Note] mariadbd: ready for connections.
Version: '10.6.13-MariaDB-1:10.6.13+maria~ubu2004-log'  socket: '/run/mysqld/mysqld.sock'  port: 3306  mariadb.org binary distribution
2023-12-13 14:49:21 4 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master '[email protected]:3306',replication starts at GTID position ''
2023-12-13 14:49:32 0 [Note] InnoDB: Buffer pool(s) load completed at 231213 14:49:32
2023-12-13 14:49:49 5 [Note] Master 'mariadb-operator': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000001' at position 33348783; GTID position '0-10-30947', master: mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-13 14:49:49 4 [Note] Master 'mariadb-operator': Slave I/O thread exiting, read up to log 'mariadb-bin.000002', position 374705570; GTID position 0-10-1355222, master mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-13 14:49:49 491 [Note] Master connection name: 'mariadb-operator'  Master_info_file: '[email protected]'  Relay_info_file: '[email protected]'
2023-12-13 14:49:49 491 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='mariadb-0.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='mariadb-bin.000002', master_log_pos='374705570'. New state master_host='mariadb-0.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='', master_log_pos='4'.
2023-12-13 14:49:49 491 [Note] Previous Using_Gtid=Current_Pos. New Using_Gtid=Current_Pos
2023-12-13 14:49:49 492 [Note] Master 'mariadb-operator': Slave I/O thread: Start semi-sync replication to master '[email protected]:3306' in log '' at position 4
2023-12-13 14:49:49 493 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './[email protected]' position: 4; GTID position '0-10-30947'
2023-12-13 14:49:49 492 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master '[email protected]:3306',replication starts at GTID position '0-10-30947'
2023-12-13 14:50:31 493 [ERROR] Master 'mariadb-operator': mariadbd: Can't find record in 'view_track_stats'
2023-12-13 14:50:31 493 [ERROR] Master 'mariadb-operator': Slave SQL: Could not execute Update_rows_v1 event on table analytics.view_track_stats; Can't find record in 'view_track_stats', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mariadb-bin.000001, end_log_pos 128851519, Gtid 0-10-118186, Internal MariaDB error code: 1032
2023-12-13 14:50:31 493 [Warning] Master 'mariadb-operator': Slave: Can't find record in 'view_track_stats' Error_code: 1032
2023-12-13 14:50:31 493 [ERROR] Master 'mariadb-operator': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 128850327; GTID position '0-10-118185'
2023-12-13 14:50:31 493 [Note] Master 'mariadb-operator': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000001' at position 128850327; GTID position '0-10-118185', master: mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-13 14:51:42 1865 [Warning] Aborted connection 1865 to db: 'analytics' user: 'root' host: '127.0.0.1' (Got an error reading communication packets)

The second outdated replica has bit more logs to go on:

2023-12-03 14:09:07 24 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './[email protected]' position: 4; GTID position '0-10-9'
2023-12-03 14:09:07 23 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master '[email protected]:3306',replication starts at GTID position '0-10-9'
2023-12-03 14:10:06 24 [ERROR] Master 'mariadb-operator': Slave SQL: Could not execute Write_rows_v1 event on table analytics.events; Duplicate entry '132079351' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mariadb-bin.000001, end_log_pos 2220, Gtid 0-10-10, Internal MariaDB error code: 1062
2023-12-03 14:10:06 24 [Warning] Master 'mariadb-operator': Slave: Duplicate entry '132079351' for key 'PRIMARY' Error_code: 1062
2023-12-03 14:10:06 24 [ERROR] Master 'mariadb-operator': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 1681; GTID position '0-10-9'
2023-12-03 14:10:06 24 [Note] Master 'mariadb-operator': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000001' at position 1681; GTID position '0-10-9', master: mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-10  8:50:53 23 [Note] Master 'mariadb-operator': Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000004' at position 13297925; GTID position '0-10-3011382'
2023-12-10  8:50:54 23 [ERROR] Master 'mariadb-operator': Slave I/O: error reconnecting to master '[email protected]:3306' - retry-time: 10  maximum-retries: 100000  message: Lost connection to server at 'reading initial communication packet', system error: 0 "Internal error/check (Not system error)", Internal MariaDB error code: 2013
2023-12-10  8:51:37 23 [Note] Master 'mariadb-operator': Slave: connected to master '[email protected]:3306',replication resumed in log 'mariadb-bin.000004' at position 13297925
2023-12-10  8:51:40 23 [ERROR] Master 'mariadb-operator': Error reading packet from server: Lost connection to server during query (server_errno=2013)
2023-12-10  8:51:40 23 [Note] Master 'mariadb-operator': Slave I/O thread exiting, read up to log 'mariadb-bin.000004', position 13666032; GTID position 0-10-3011653, master mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-10  8:51:40 117175 [Note] Master connection name: 'mariadb-operator'  Master_info_file: '[email protected]'  Relay_info_file: '[email protected]'
2023-12-10  8:51:40 117175 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='mariadb-0.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='mariadb-bin.000004', master_log_pos='13666032'. New state master_host='mariadb-2.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='', master_log_pos='4'.
2023-12-10  8:51:40 117175 [Note] Previous Using_Gtid=Current_Pos. New Using_Gtid=Current_Pos
2023-12-10  8:51:40 117176 [Note] Master 'mariadb-operator': Slave I/O thread: Start semi-sync replication to master '[email protected]:3306' in log '' at position 4
2023-12-10  8:51:40 117177 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './[email protected]' position: 4; GTID position ''
2023-12-10  8:51:40 117176 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master '[email protected]:3306',replication starts at GTID position ''
2023-12-13  4:04:41 117176 [Note] Master 'mariadb-operator': Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000001' at position 1681; GTID position '0-12-3011391'
2023-12-13  4:05:01 117176 [ERROR] Master 'mariadb-operator': Slave I/O: error reconnecting to master '[email protected]:3306' - retry-time: 10  maximum-retries: 100000  message: Unknown server host 'mariadb-2.mariadb-internal.analytics.svc.cluster.local' (-3), Internal MariaDB error code: 2005
2023-12-13  4:05:31 117176 [Note] Master 'mariadb-operator': Slave: connected to master '[email protected]:3306',replication resumed in log 'mariadb-bin.000001' at position 1681
2023-12-13  4:09:02 117176 [Note] Master 'mariadb-operator': Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000001' at position 1681; GTID position '0-12-3011391'
2023-12-13 14:49:07 117176 [Note] Master 'mariadb-operator': Slave: received end packet from server, apparent master shutdown: 
2023-12-13 14:49:07 117176 [Note] Master 'mariadb-operator': Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000001' at position 1681; GTID position '0-12-3011391'
2023-12-13 14:49:07 117176 [ERROR] Master 'mariadb-operator': Slave I/O: error reconnecting to master '[email protected]:3306' - retry-time: 10  maximum-retries: 100000  message: Lost connection to server at 'reading initial communication packet', system error: 0 "Internal error/check (Not system error)", Internal MariaDB error code: 2013
2023-12-13 14:49:17 117177 [Note] Master 'mariadb-operator': Error reading relay log event: slave SQL thread was killed
2023-12-13 14:49:17 117177 [Note] Master 'mariadb-operator': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000001' at position 1681; GTID position '0-12-3011391', master: mariadb-2.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-13 14:49:17 117176 [Note] Master 'mariadb-operator': Slave I/O thread killed during or after a reconnect done to recover from failed read
2023-12-13 14:49:17 117176 [Note] Master 'mariadb-operator': Slave I/O thread exiting, read up to log 'mariadb-bin.000001', position 1681; GTID position 0-12-3011391, master mariadb-2.mariadb-internal.analytics.svc.cluster.local:3306
2023-12-13 14:49:18 117176 [Note] Master 'mariadb-operator': cannot connect to master to kill slave io_thread's connection
2023-12-13 14:49:18 173310 [Note] Master connection name: 'mariadb-operator'  Master_info_file: '[email protected]'  Relay_info_file: '[email protected]'
2023-12-13 14:49:18 173310 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='mariadb-2.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='mariadb-bin.000001', master_log_pos='1681'. New state master_host='mariadb-0.mariadb-internal.analytics.svc.cluster.local', master_port='3306', master_log_file='', master_log_pos='4'.
2023-12-13 14:49:18 173310 [Note] Previous Using_Gtid=Current_Pos. New Using_Gtid=Current_Pos
2023-12-13 14:49:18 173311 [Note] Master 'mariadb-operator': Slave I/O thread: Start semi-sync replication to master '[email protected]:3306' in log '' at position 4
2023-12-13 14:49:18 173312 [Note] Master 'mariadb-operator': Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './[email protected]' position: 4; GTID position ''
2023-12-13 14:49:18 173311 [Note] Master 'mariadb-operator': Slave I/O thread: connected to master '[email protected]:3306',replication starts at GTID position ''
2023-12-13 14:49:18 173312 [ERROR] Master 'mariadb-operator': mariadbd: Can't find record in 'view_user_stats'
2023-12-13 14:49:18 173312 [ERROR] Master 'mariadb-operator': Slave SQL: Could not execute Update_rows_v1 event on table analytics.view_user_stats; Can't find record in 'view_user_stats', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mariadb-bin.000001, end_log_pos 22844, Gtid 0-10-20, Internal MariaDB error code: 1032
2023-12-13 14:49:18 173312 [Warning] Master 'mariadb-operator': Slave: Can't find record in 'view_user_stats' Error_code: 1032
2023-12-13 14:49:18 173312 [ERROR] Master 'mariadb-operator': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000001' position 21411; GTID position '0-10-19'
2023-12-13 14:49:18 173312 [Note] Master 'mariadb-operator': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000001' at position 21411; GTID position '0-10-19', master: mariadb-0.mariadb-internal.analytics.svc.cluster.local:3306

For now I'll purge my replica's and re-add them through this method: https://github.com/mariadb-operator/mariadb-operator/issues/141#issuecomment-1804760231

boedy avatar Dec 13 '23 15:12 boedy

Thanks for the detailed information!

This most likely means that the replicas are missing some binary log operations from the primary and therefore are unable to unable to apply some further operations.

We are looking into automatically fix this situation, adding support for physical backups the first thing:

  • https://github.com/mariadb-operator/mariadb-operator/pull/273

The idea is that the operator would detect when the replica Pods are unable to fully replicate and perform the following steps automatically to bring it back:

  • https://github.com/mariadb-operator/mariadb-operator/issues/141#issuecomment-1804760231

The same applies to adding new replicas to the cluster, we would need to setup the new replica based on a physical backup from the primary, as some log operations may have been expired in the primary i.e. by configuring expire_logs_days. See:

  • https://mariadb.com/kb/en/using-and-maintaining-the-binary-log/
  • https://mariadb.com/kb/en/setting-up-a-replica-with-mariabackup/

Related to:

  • https://github.com/mariadb-operator/mariadb-operator/issues/141

mmontes11 avatar Dec 16 '23 18:12 mmontes11

This issue is stale because it has been open 30 days with no activity.

github-actions[bot] avatar Jan 16 '24 02:01 github-actions[bot]

Hey there!

With the release of v0.0.25, primary failover done by the operator has been superseded by MaxScale: a sophisticated database proxy, router, and load balancer designed specifically for and by MariaDB. I provides a more resilient primary failover based on MariaDB internals, I encourage you to take a look at the docs:

https://github.com/mariadb-operator/mariadb-operator/blob/main/docs/MAXSCALE.md

mmontes11 avatar Feb 11 '24 11:02 mmontes11

Hello, I have the same problems, I am testing the mariadb operator (one master and 2 replicas) with Proxysql in front. I am familiar with Proxysql and do not wish to move away from it and switch to MaxScale.

During my tests:

  • When I kill the master
  • a replica is elected master
  • the old master can't connect to the new master

I have the same errors (reconnecting to master with the repl user). I kill the same pod again and still the result is the same.

error reconnecting to master '[email protected]:3306' - retry-time: 10  maximum-retries: 100000  message: Can't connect to server on 'mariadb-0.mariadb-internal.mariadb.svc.cluster.local' (111 "Connection refused")

Thank you in advance for your help

marc-cenon avatar Feb 26 '24 10:02 marc-cenon

This issue is stale because it has been open 30 days with no activity.

github-actions[bot] avatar Mar 28 '24 02:03 github-actions[bot]

Not stale. This is still in a row

kvaps avatar Mar 28 '24 07:03 kvaps

This issue is stale because it has been open 30 days with no activity.

github-actions[bot] avatar Apr 28 '24 02:04 github-actions[bot]

This issue is stale because it has been open 30 days with no activity.

github-actions[bot] avatar May 30 '24 02:05 github-actions[bot]

This issue is stale because it has been open 60 days with no activity.

github-actions[bot] avatar Jul 30 '24 02:07 github-actions[bot]