orchestrator icon indicating copy to clipboard operation
orchestrator copied to clipboard

Orchestrator switches incorrectly, causing database service failure, problem analysis

Open duanhui8 opened this issue 3 years ago • 1 comments

Hi I may have found a bug, please help When I execute the change master command at (192.168.73.128:4307), the DistributePairs function writes the downed mysql information (192.168.73.128:4308) to consul, causing the database to fail to serve

1、This is my database topology image

2、Database_instance table information in sqlite database(192.168.73.128:4308 MYSQL is down) BA7F6AC07A2E1744BE5FD846B1E2DF7A

3、This code filters the real mysql master(192.168.73.128:3307), retains the mysql information(192.168.73.128:4308) that has been down, and is written to consul in the following code C5D29285BB1B743E5DCCE4B39BCFF2EF 4、The downed mysql information(192.168.73.128:4308) is written to consul in this code 6}D5K41DFGPS~7V5O$I32IA

5、consul-template finds that the key of consul has changed, and updates the information of mysql that has been down to haproxy.cfg, causing a failure

If need additional information from me, please contact me Please help, if it is a bug, please help to fix it, thank you

duanhui8 avatar Oct 21 '22 13:10 duanhui8

I believe I have the same issue using mariadb. I shutdown the server via systemd to simulate a crash, and the mysqlorchestrator UI shows the replica unable to connect and is red, the database table also shows the new master value, but consul is still showing the dead server as the master. Even if I delete the consul values and force a master repopulation via the cli, or just wait long enough, orc populates the dead server back into consul as the current master. So it's not a static data issue, it seems like orc is simply wrong about which box should be the master, but only when writing to the KV store.

This is unfortunate as this was the exact use case I was hoping to use it for. It also clearly didn't used to be a problem as there are many blogs from many companies on line talking about using it in this capacity, so I assume this was a bug mistakenly introduced in later versions. It seems the project has gone stale as well since the primary maintainer has stepped back from development. I hope this project does not die, it seems wildly useful.

alnet avatar Feb 14 '23 02:02 alnet