mysql-operator Cluster never gets Ready

Hello!

I'm traying to spin up a cluster but It never gets ready.

I can't identify why. Can you help me?

Some information:

kubectl get MysqlCluster -n transaction

NAME                   READY   REPLICAS   AGE
contract-api-cluster   False              29m

 kubectl describe MysqlCluster contract-api-cluster -n transaction

Name:         contract-api-cluster
Namespace:    transaction
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"mysql.presslabs.org/v1alpha1","kind":"MysqlCluster","metadata":{"annotations":{},"name":"contract-api-cluster","namespace":...
              mysql.presslabs.org/version: 300
API Version:  mysql.presslabs.org/v1alpha1
Kind:         MysqlCluster
Metadata:
  Creation Timestamp:  2020-03-26T21:13:05Z
  Finalizers:
    mysql.presslabs.org/registered-in-orchestrator
  Generation:        2
  Resource Version:  458095091
  Self Link:         /apis/mysql.presslabs.org/v1alpha1/namespaces/transaction/mysqlclusters/contract-api-cluster
  UID:               46692ed0-7564-4e0e-ae3a-9418e3ad6f82
Spec:
  Pod Spec:
    Resources:
  Secret Name:  contract-api-db
  Volume Spec:
    Persistent Volume Claim:
      Access Modes:
        ReadWriteOnce
      Resources:
        Requests:
          Storage:         1Gi
      Storage Class Name:  us-east-1a
Status:
  Conditions:
    Last Transition Time:  2020-03-26T21:13:07Z
    Message:               cluster is in read only
    Reason:                ClusterReadOnlyTrue
    Status:                True
    Type:                  ReadOnly
    Last Transition Time:  2020-03-26T21:13:07Z
    Message:               Node contract-api-cluster-mysql-0.mysql.transaction is part of topology and not replicating
    Reason:                NotReplicating
    Status:                False
    Type:                  Ready
    Last Transition Time:  2020-03-26T21:13:07Z
    Message:               no pending ack
    Reason:                NoPendingFailoverAckExists
    Status:                False
    Type:                  PendingFailoverAck
  Nodes:
    Conditions:
      Last Transition Time:  2020-03-26T21:15:27Z
      Status:                Unknown
      Type:                  Lagged
      Last Transition Time:  2020-03-26T21:15:27Z
      Status:                Unknown
      Type:                  Replicating
      Last Transition Time:  2020-03-26T21:15:27Z
      Status:                Unknown
      Type:                  Master
      Last Transition Time:  2020-03-26T21:15:27Z
      Status:                Unknown
      Type:                  ReadOnly
    Name:                    contract-api-cluster-mysql-0.mysql.transaction
  Ready Nodes:               1
Events:
  Type    Reason                         Age                 From                     Message
  ----    ------                         ----                ----                     -------
  Normal  ConfigMapSyncSuccessfull       33m                 controller.mysqlcluster  *v1.ConfigMap transaction/contract-api-cluster-mysql created successfully
  Normal  OperatedSecretSyncSuccessfull  33m                 controller.mysqlcluster  *v1.Secret transaction/contract-api-cluster-mysql-operated created successfully
  Normal  MasterSVCSyncSuccessfull       33m                 controller.mysqlcluster  *v1.Service transaction/contract-api-cluster-mysql-master created successfully
  Normal  HealthySVCSyncSuccessfull      33m                 controller.mysqlcluster  *v1.Service transaction/contract-api-cluster-mysql created successfully
  Normal  StatefulSetSyncSuccessfull     33m                 controller.mysqlcluster  *v1.StatefulSet transaction/contract-api-cluster-mysql created successfully
  Normal  StatefulSetSyncSuccessfull     31m (x13 over 33m)  controller.mysqlcluster  *v1.StatefulSet transaction/contract-api-cluster-mysql updated successfully

kubectl logs mysql-operator-0 -c orchestrator

2020/03/26 20:20:51 [INFO] raft: Restored from snapshot 49-178780-1585251793307
2020/03/26 20:20:51 [INFO] raft: Node at 172.20.8.4:10008 [Follower] entering Follower state (Leader: "")
[martini] Started GET /api/raft-health for 10.160.105.250:58832
[martini] Completed 200 OK in 28.29883ms
2020/03/26 20:20:53 [WARN] raft: Heartbeat timeout from "" reached, starting election
2020/03/26 20:20:53 [INFO] raft: Node at 172.20.8.4:10008 [Candidate] entering Candidate state
2020/03/26 20:20:53 [DEBUG] raft: Votes needed: 1
2020/03/26 20:20:53 [DEBUG] raft: Vote granted from 172.20.8.4:10008. Tally: 1
2020/03/26 20:20:53 [INFO] raft: Election won. Tally: 1
2020/03/26 20:20:53 [INFO] raft: Node at 172.20.8.4:10008 [Leader] entering Leader state
[martini] Started GET /api/raft-follower-health-report/d5a03335/mysql-operator-0/mysql-operator-0-svc for 10.160.105.250:46200
2020-03-26 20:20:53 ERROR Raft health report: unknown token d5a03335
[martini] Completed 500 Internal Server Error in 2.924568ms
2020-03-26 20:20:53 ERROR HttpGetLeader: got 500 status on http://mysql-operator-0-svc:80/api/raft-follower-health-report/d5a03335/mysql-operator-0/mysql-operator-0-svc
2020/03/26 20:20:53 [DEBUG] raft: Node 172.20.8.4:10008 updated peer set (2): [172.20.8.4:10008]
[martini] Started GET /api/raft-follower-health-report/7887f81f/mysql-operator-0/mysql-operator-0-svc for 10.160.105.250:46200
[martini] Completed 200 OK in 1.018483ms
[martini] Started GET /api/raft-health for 10.160.105.250:58968
[martini] Completed 200 OK in 1.525997ms
[martini] Started GET /api/raft-follower-health-report/17394bc5/mysql-operator-0/mysql-operator-0-svc for 10.160.105.250:46200
[martini] Completed 200 OK in 1.687689ms
[martini] Started GET /api/cluster/contract-api-cluster.transaction for 127.0.0.1:45908
2020-03-26 20:21:12 ERROR Unable to determine cluster name. clusterHint=contract-api-cluster.transaction
[martini] Completed 500 Internal Server Error in 6.671548ms
[martini] Started GET /api/discover/contract-api-cluster-mysql-0.mysql.transaction/3306 for 127.0.0.1:45908
2020-03-26 20:21:12 ERROR x509: certificate is valid for MySQL_Server_5.7.26-29_Auto_Generated_Server_Certificate, not contract-api-cluster-mysql-0.mysql.transaction
2020-03-26 20:21:12 ERROR ReadTopologyInstance(contract-api-cluster-mysql-0.mysql.transaction:3306) show variables like 'maxscale%': x509: certificate is valid for MySQL_Server_5.7.26-29_Auto_Generated_Server_Certificate, not contract-api-cluster-mysql-0.mysql.transaction
[martini] Completed 500 Internal Server Error in 95.750111ms
[martini] Started GET /api/audit-recovery/contract-api-cluster.transaction for 127.0.0.1:45908
[martini] Completed 200 OK in 2.705527ms
[martini] Started GET /api/raft-health for 10.160.105.250:59128
[martini] Completed 200 OK in 3.488913ms
[martini] Started GET /api/cluster/contract-api-cluster.transaction for 127.0.0.1:45908
2020-03-26 20:21:17 ERROR Unable to determine cluster name. clusterHint=contract-api-cluster.transaction
[martini] Completed 500 Internal Server Error in 6.019458ms
[martini] Started GET /api/discover/contract-api-cluster-mysql-0.mysql.transaction/3306 for 127.0.0.1:45908
2020-03-26 20:21:17 ERROR x509: certificate is valid for MySQL_Server_5.7.26-29_Auto_Generated_Server_Certificate, not contract-api-cluster-mysql-0.mysql.transaction
[martini] Completed 500 Internal Server Error in 19.872942ms
[martini] Started GET /api/audit-recovery/contract-api-cluster.transaction for 127.0.0.1:45908
[martini] Completed 200 OK in 2.83317ms
[martini] Started GET /api/raft-follower-health-report/1096224b/mysql-operator-0/mysql-operator-0-svc for 10.160.105.250:46200
[martini] Completed 200 OK in 590.396µs
[martini] Started GET /api/cluster/contract-api-cluster.transaction for 127.0.0.1:45908
2020-03-26 20:21:22 ERROR Unable to determine cluster name. clusterHint=contract-api-cluster.transaction
[martini] Completed 500 Internal Server Error in 5.708816ms
[martini] Started GET /api/discover/contract-api-cluster-mysql-0.mysql.transaction/3306 for 127.0.0.1:45908
2020-03-26 20:21:22 ERROR x509: certificate is valid for MySQL_Server_5.7.26-29_Auto_Generated_Server_Certificate, not contract-api-cluster-mysql-0.mysql.transaction
[martini] Completed 500 Internal Server Error in 18.506516ms
[martini] Started GET /api/audit-recovery/contract-api-cluster.transaction for 127.0.0.1:45908
[martini] Completed 200 OK in 1.456536ms
[martini] Started GET /api/raft-health for 10.160.105.250:59298
[martini] Completed 200 OK in 1.268885ms
[martini] Started GET /api/cluster/contract-api-cluster.transaction for 127.0.0.1:45908
2020-03-26 20:21:27 ERROR Unable to determine cluster name. clusterHint=contract-api-cluster.transaction
[martini] Completed 500 Internal Server Error in 8.430831ms

Mar 26 '20 21:03 gnoguchi

Hi @gnoguchi !

I need more details. Please share with me the logs from mysql container:

kubectl -n transaction logs contract-api-cluster-mysql-0 -c mysql

I suspect that node contract-api-cluster-mysql-0.mysql.transaction is misconfigured, maybe the orchestrator user was not created.

Thank you for reporting this and helping to debug! Best regards!

Mar 27 '20 11:03 AMecea

Hello @AMecea ! Thanks for your support!


kubectl -n transaction logs contract-api-cluster-mysql-0 -c mysql

+ kubectl -n transaction logs contract-api-cluster-mysql-0 -c mysql
2020-03-27T15:34:34.300577Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2020-03-27T15:34:34.303072Z 0 [Note] mysqld (mysqld 5.7.26-29-log) starting as process 1 ...
2020-03-27T15:34:34.309942Z 0 [Note] InnoDB: PUNCH HOLE support available
2020-03-27T15:34:34.309980Z 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2020-03-27T15:34:34.309989Z 0 [Note] InnoDB: Uses event mutexes
2020-03-27T15:34:34.309996Z 0 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2020-03-27T15:34:34.310004Z 0 [Note] InnoDB: Compressed tables use zlib 1.2.7
2020-03-27T15:34:34.310011Z 0 [Note] InnoDB: Using Linux native AIO
2020-03-27T15:34:34.310639Z 0 [Note] InnoDB: Number of pools: 1
2020-03-27T15:34:34.310813Z 0 [Note] InnoDB: Using CPU crc32 instructions
2020-03-27T15:34:34.319797Z 0 [Note] InnoDB: Initializing buffer pool, total size = 512M, instances = 1, chunk size = 128M
2020-03-27T15:34:34.344040Z 0 [Note] InnoDB: Completed initialization of buffer pool
2020-03-27T15:34:34.351542Z 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2020-03-27T15:34:34.374202Z 0 [Note] InnoDB: Crash recovery did not find the parallel doublewrite buffer at /var/lib/mysql/xb_doublewrite
2020-03-27T15:34:34.379578Z 0 [Note] InnoDB: Highest supported file format is Barracuda.
2020-03-27T15:34:34.423917Z 0 [Note] InnoDB: Created parallel doublewrite buffer at /var/lib/mysql/xb_doublewrite, size 3932160 bytes
2020-03-27T15:34:34.722985Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2020-03-27T15:34:34.723110Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2020-03-27T15:34:34.769249Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2020-03-27T15:34:34.770217Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
2020-03-27T15:34:34.770228Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2020-03-27T15:34:34.771619Z 0 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.7.26-29 started; log sequence number 11908284
2020-03-27T15:34:34.771827Z 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2020-03-27T15:34:34.772126Z 0 [Note] Plugin 'FEDERATED' is disabled.
2020-03-27T15:34:34.812097Z 0 [Note] Found ca.pem, server-cert.pem and server-key.pem in data directory. Trying to enable SSL support using them.
2020-03-27T15:34:34.812121Z 0 [Note] Skipping generation of SSL certificates as certificate files are present in data directory.
2020-03-27T15:34:34.818902Z 0 [Warning] CA certificate ca.pem is self signed.
2020-03-27T15:34:34.818960Z 0 [Note] Skipping generation of RSA key pair as key files are present in data directory.
2020-03-27T15:34:34.824909Z 0 [Note] Server hostname (bind-address): '*'; port: 3306
2020-03-27T15:34:34.825054Z 0 [Note] IPv6 is available.
2020-03-27T15:34:34.825074Z 0 [Note]   - '::' resolves to '::';
2020-03-27T15:34:34.825107Z 0 [Note] Server socket created on IP: '::'.
2020-03-27T15:34:34.969725Z 0 [Note] InnoDB: Buffer pool(s) load completed at 200327 15:34:34
2020-03-27T15:34:35.152793Z 0 [Note] Event Scheduler: Loaded 0 events
2020-03-27T15:34:35.152958Z 0 [Note] Execution of init_file '/etc/mysql/conf.d/operator-init.sql' started.
2020-03-27T15:34:35.172012Z 0 [Note] Execution of init_file '/etc/mysql/conf.d/operator-init.sql' ended.
2020-03-27T15:34:35.172156Z 0 [Note] mysqld: ready for connections.
Version: '5.7.26-29-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Percona Server (GPL), Release 29, Revision 11ad961

Mar 27 '20 15:03 gnoguchi

Hey guys! Any news?

Thanks!

Apr 07 '20 22:04 gnoguchi

Hi, am observing similar issue, cluster status is stuck in False. There was a reelection and the original leader is stuck in not replicating stage. Is there a way to recover from the situation without loosing data? Tried rebooting mysql pods, operator pod but didnt help. Appreciate any help/fix. Below is the dump of various logs and commands,

Using - operator version 0.3.9

kubectl describe mysql metadata-cluster-0 -n infra

Name: metadata-cluster-0 Namespace: infra Labels: active=true name=metadata-cluster Annotations: kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"mysql.presslabs.org/v1alpha1","kind":"MysqlCluster","metadata":{"annotations":{},"labels":{"active":"true","name":"metadata... mysql.presslabs.org/version: 300 API Version: mysql.presslabs.org/v1alpha1 Kind: MysqlCluster Spec: Pod Spec: Resources: Replicas: 3 Secret Name: mysql-secret Volume Spec: Persistent Volume Claim: Access Modes: ReadWriteOnce Resources: Requests: Storage: 10Gi Status: Conditions: Last Transition Time: 2020-04-17T09:59:51Z Message: cluster is writable Reason: ClusterReadOnlyFalse Status: False Type: ReadOnly Last Transition Time: 2020-04-16T18:47:04Z Message: Node metadata-cluster-0-mysql-0.mysql.infra is part of topology and not replicating

Reason:                NotReplicating
Status:                False
Type:                  Ready
Last Transition Time:  2020-04-17T05:45:50Z
Message:               no pending ack
Reason:                NoPendingFailoverAckExists
Status:                False
Type:                  PendingFailoverAck
Last Transition Time:  2020-04-17T05:36:30Z
Message:               Master is healthy in orchestrator
Reason:                ClusterMasterHealthy
Status:                False
Type:                  FailoverInProgress

Nodes: Conditions: Last Transition Time: 2020-04-17T05:36:35Z Status: Unknown Type: Lagged Last Transition Time: 2020-04-17T05:35:45Z Status: False Type: Replicating Last Transition Time: 2020-04-17T05:35:45Z Status: False Type: Master Last Transition Time: 2020-04-16T19:13:19Z Status: True Type: ReadOnly Name: metadata-cluster-0-mysql-0.mysql.infra

Logs from MySQL Operator

{"level":"info","ts":1587116536.3045983,"logger":"controller.mysqlcluster","msg":"cluster status","status":{"readyNodes":3,"conditions":[{"type":"ReadOnly","status":"True","lastTransitionTime":"2020-04-17T09:42:16Z","reason":"ClusterReadOnlyTrue","message":"cluster is in read only"},{"type":"Ready","status":"False","lastTransitionTime":"2020-04-16T18:47:04Z","reason":"NotReplicating","message":"Node metadata-cluster-0-mysql-0.mysql.infra is part of topology and not replicating"},

Logs from Leader pod

2020-04-17T05:35:58.208160Z 42 [ERROR] Slave I/O for channel '': error connecting to master '[email protected]:3306' - retry-time: 10 retries: 1, Error_code: 2005 2020-04-17T05:36:08.610914Z 42 [ERROR] Slave I/O for channel '': error connecting to master '[email protected]:3306' - retry-time: 10 retries: 2, Error_code: 2005 2020-04-17T05:36:19.013254Z 42 [ERROR] Slave I/O for channel '': error connecting to master '[email protected]:3306' - retry-time: 10 retries: 3, Error_code: 2005 2020-04-17T05:36:29.348391Z 42 [Note] Slave I/O thread for channel '': connected to master '[email protected]:3306',replication started in log 'FIRST' at position 4 2020-04-17T05:36:30.046037Z 43 [ERROR] Slave SQL for channel '': Error executing row event: 'Table 'sys_operator.heartbeat' doesn't exist', Error_code: 1146 2020-04-17T05:36:30.046060Z 43 [Warning] Slave: Table 'sys_operator.heartbeat' doesn't exist Error_code: 1146 2020-04-17T05:36:30.046065Z 43 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000001' position 154.

MySQL dump - Heartbeat table is missing.

mysql> use sys_operator Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A

Database changed mysql> show tables; +------------------------+ | Tables_in_sys_operator | +------------------------+ | status | +------------------------+ 1 row in set (0.00 sec)

mysql>

Apr 17 '20 10:04 mak-454

[martini] Started GET /api/discover/contract-api-cluster-mysql-0.mysql.transaction/3306 for 127.0.0.1:45908
2020-03-26 20:21:12 ERROR x509: certificate is valid for MySQL_Server_5.7.26-29_Auto_Generated_Server_Certificate, not contract-api-cluster-mysql-0.mysql.transaction

do you notice this error in the log? and in the mysql err-log:

[Note] Bad handshake

Aug 09 '21 14:08 usernameisnull

Hi, Can you try with the freshly released v0.5.0?

Oct 11 '21 12:10 calind

It was quite scary because I thought my clusterbackup was corrupt, but it was not the case. In my case the first node has everything ready except the mysql pod which has ready false, despite container ready is true.

I scaled my cluster statefulset to zero, which then triggered the cluster to terminate the first node and recreate it with the existing PVC and other resources made from the previous process.

The second time it did start the first and second node without any problem.

Jan 22 '23 15:01 randywatson1979

mysql-operator mysql-operator copied to clipboard

Cluster never gets Ready

Using - operator version 0.3.9

mysql-operator
mysql-operator copied to clipboard