mysql-cluster
mysql-cluster copied to clipboard
Horizontal scaling not functional after binlog expiry is passed
Hi,
We have been using a mysql cluster (Master-master) for testing purposes for more than a month now and have been testing these past days horizontal scaling of the mysql nodes. What we found is that when we scale through the UI, a slave node is created (which is fine), but it immediately logs a replication error and is added to the ProxySQL balancing... which causes query errors on the application.
What we expected is that horizontal scaling with 0 downtime would be possible by just adding a node through the topology UI.
We think the error happens when the newly created slave starts catching up by reading the binlogs, but as the oldest binlogs have been deleted it misses any operations that are older than 30 days (the default value for automatic log file removal).
Here is the slave log:
2020-09-28T16:32:50.385084Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.21' socket: '/var/lib/mysql/mysql.sock' port: 3306 MySQL Community Server - GPL.
2020-09-28T16:33:01.263173Z 13 [Warning] [MY-010604] [Repl] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=node43424-oa-mysql-dev-relay-bin' to avoid this problem.
2020-09-28T16:33:01.267862Z 13 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL '' executed'. Previous state master_host='', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='node41185', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.
2020-09-28T16:33:01.274050Z 14 [Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2020-09-28T16:33:01.275945Z 14 [System] [MY-010562] [Repl] Slave I/O thread for channel '': connected to master 'repl-7086965@node41185:3306',replication started in log 'FIRST' at position 4
2020-09-28T16:33:01.285289Z 15 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Error 'Unknown database 'oadev'' on query. Default database: 'oadev'. Query: 'update `knex_migrations_lock` set `is_locked` = 1 where `is_locked` = 0', Error_code: MY-001049
2020-09-28T16:33:01.285329Z 15 [Warning] [MY-010584] [Repl] Slave: Unknown database 'oadev' Error_code: MY-001049
2020-09-28T16:33:01.285340Z 15 [ERROR] [MY-010586] [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000016' position 156
We are looking for the right approach here to be able to scale horizontally at any time. Maybe by making a dump available somewhere where the slave would automatically load it up at first boot for example.
@kaore, thank you for the request, we've discussed it here internally with a team and can conclude that it is a real major issue, we also found a way how to fix it properly by changing slave scaling algorithm, going to check if these changes are safe, we will keep you posted with results. Have a good day. )
Hi @ihorman any progress on this? Let me know if you need any further info regarding our cluster implementation
Hi @kaore we have a concept of how to solve your issue. We'll get back to you once we'll have the result. thanks,
Basically, the solution is to create a new slave by means of cloning existing one. This approach will eliminate load on the master node and dependencies from the binlogs. As for the ETA, please expect it by the end of the next week
Hi @ihorman ok happy to hear this is moving forward. Thanks for your feedback
Hi @kaore Apologies for the delay, we are on qa stage
Hi @ihorman @sych74 any update on this? Cheers
Hi @kaore we are testing the even distribution of the slaves during scaling between the masters in the master master topology. after that the DB cluster will be updated in the marketplace and will be available to everyone.
Hi @kaore The change you have requested is finally completed and available on productions (both via marketplace and topology wizard). In order to check it our you need to create a new environment. Later after you perform the tests you will be able to re-import your data from previous installation to a new one. A few words about the changes: The topologies master-slave & master-master have a new horizontal scaling algorithm where a new cluster member is created via cloning an existing slave node. Right after the cloning procedure is completed the database on the new cluster member catches up data via binlog replay which definitely will be pretty short in time and guarantees the binlog will not expire unlike the case when the newly added node was created from scratch. Looking forward to getting your feedback.
Hi @sych74 ok great thanks. We'll test it shortly, I'll get back to you as soon as we have something up and running
Hi @sych74 we have deployed and tested the scalability, it seems to work ok but there is one thing that will be troublesome when we'll need to scale up without downtime: from what we saw, when a new node is added it is immediately added in the mysql_servers of ProxySQL in ONLINE mode, which triggers query timeouts every time the new instance is hit, at least until it has finished importing data.
Hi @kaore Thanks a lot for your question and conducted investigation. Pay your attention on that it is not correct to monitor the status of cluster members via mysql_servers. The mysql_servers describes all the db nodes have been added to the ProxySQL configuration. Active nodes that participate in the distribution must be monitored in the runtime_mysql_servers table, and if the node is unavailable or replication is late, the node status will be changed to the SHUNNED and requests to it will not be proxied. Check the status of active nodes via runtime_mysql_servers table.
Hi @sych74 thanks for your reply. As the replication got started, we ran queries from outside of the cluster to the proxy (proxy.jcloud...) to see how it was balancing the load while a new node was being added: select @@server_id; select count(id) from agenda; With this we encountered frequent timeouts. Executing repeatedly the first part of the query only confirmed that the slave was being solicited. Only when we noticed this issue did we start to look into what was going on in the proxy.
The problem is related to the fact that right after the node became unavailable, the requests still come to it via ProxySQL until the monitor changes node status to the "Shunned". Status is changed once the time-out of the node unavailability is run out. The solution is to push the node to be cloned out of the ProxySQL before scaling. After scaling, we change the status of new node to the "Shunned" and it starts to catch up the data. Right after the data will be synchronized on the new node, the ProxySQL will switch it automatically from the "Shunned" to "Online". We'll add it next week and let you know. Thanks
Hi @kaore The fix of the reported issues is available on productions (both via marketplace and topology wizard). In order to check it you need to create a new environment. Looking forward to getting your feedback. Thanks
Hi @sych74 We created a cluster, scaled it up and down without getting any query errors. Looks like it works! Thank you for this!
Hi @kaore thank you for getting back with the confirmation that it works