mysql-operator icon indicating copy to clipboard operation
mysql-operator copied to clipboard

All MySQL clusters down. orchestrator: ERROR database disk image is malformed

Open haslersn opened this issue 4 years ago • 2 comments

We were running mysql operator for a few months. We run a single operator Pod, i.e the orchestrator is not HA.

Now suddenly, all of our mysql clusters have no master anymore. All replicas are labelled with role=replica. The orchestrator shows the following log. This specific log was taken from a fresh operator Pod restart:

presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:32 ERROR NewResolveInstanceKey: Empty hostname
presslabs-mysql-operator-0 orchestrator 2021/04/27 19:57:33 [INFO] raft: Restored from snapshot 69-924273-1619552283145
presslabs-mysql-operator-0 orchestrator 2021/04/27 19:57:33 [INFO] raft: Node at 10.96.56.68:10008 [Follower] entering Follower state (Leader: "")
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:33 ERROR NewResolveInstanceKey: Empty hostname
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:33 ERROR NewResolveInstanceKey: Empty hostname
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:33 ERROR dial tcp: lookup engelsystem-mysql-2.mysql.helfer-test on 10.96.0.10:53: no such host
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:33 ERROR ReadTopologyInstance(engelsystem-mysql-2.mysql.helfer-test:3306) show variables like 'maxscale%': QueryRowsMap unexpected error: runtime error: invalid memory address or nil pointer dereference
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:33 ERROR database disk image is malformed
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:33 ERROR database disk image is malformed
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:33 ERROR ReadTopologyInstance(test-mysql-0.mysql.sven-test:3306) ReplicationLagQuery: Error 1146: Table 'sys_operator.heartbeat' doesn't exist
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:34 ERROR NewResolveInstanceKey: Empty hostname
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:34 ERROR NewResolveInstanceKey: Empty hostname
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:34 ERROR database disk image is malformed
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:34 ERROR database disk image is malformed
presslabs-mysql-operator-0 orchestrator 2021/04/27 19:57:34 [WARN] raft: Heartbeat timeout from "" reached, starting election
presslabs-mysql-operator-0 orchestrator 2021/04/27 19:57:34 [INFO] raft: Node at 10.96.56.68:10008 [Candidate] entering Candidate state
presslabs-mysql-operator-0 orchestrator 2021/04/27 19:57:34 [DEBUG] raft: Votes needed: 1
presslabs-mysql-operator-0 orchestrator 2021/04/27 19:57:34 [DEBUG] raft: Vote granted from 10.96.56.68:10008. Tally: 1
presslabs-mysql-operator-0 orchestrator 2021/04/27 19:57:34 [INFO] raft: Election won. Tally: 1
presslabs-mysql-operator-0 orchestrator 2021/04/27 19:57:34 [INFO] raft: Node at 10.96.56.68:10008 [Leader] entering Leader state
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:35 ERROR NewResolveInstanceKey: Empty hostname
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:35 ERROR NewResolveInstanceKey: Empty hostname
presslabs-mysql-operator-0 orchestrator [martini] Started GET /api/raft-health for 192.168.21.102:33714
presslabs-mysql-operator-0 orchestrator [martini] Completed 200 OK in 647.97µs
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:36 ERROR NewResolveInstanceKey: Empty hostname
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:36 ERROR NewResolveInstanceKey: Empty hostname
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:37 ERROR NewResolveInstanceKey: Empty hostname
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:37 ERROR NewResolveInstanceKey: Empty hostname
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:37 ERROR database disk image is malformed
presslabs-mysql-operator-0 orchestrator 2021-04-27 19:57:37 ERROR database disk image is malformed

Questions:

  • How to debug this?
  • In case the persistent data of the (single Pod) orchestrator is irrevocable currupted, can I safely delete the persistent data of the orchestrator? What would be the implications? Can I continue using my existing MySQL clusters?

haslersn avatar Apr 30 '21 15:04 haslersn

Hi,

Can you post the issues to https://github.com/openark/orchestrator as well?

You should be safe to delete the orchestrator data (but keep a backup around). Also before doing this scale the operator to 1 replica.

Btw. have you managed to solve this?

calind avatar Oct 11 '21 11:10 calind

I think, back then, deleteing the persistent data of the orchestrator "solved" the issue.

haslersn avatar Dec 01 '22 14:12 haslersn