orchestrator
orchestrator copied to clipboard
Orchestrator Not Detecting Failure with Full Disk and Writes
Experienced an issue where the disk filled up on a master and the server was still up, but any types of writes would fail. Orchestrator did not see this as a problem and did not fail the server over to a slave.
As an example I took a MySQL server and filled up the drive, and then tested SELECT queries and they were ok, but tried creating a schema and got the following error. I would consider this server as down and want to failover.
root@mysql1:/home/vagrant# mysql -h 192.168.56.101 -u my_app2_user -p -e "SELECT @@hostname"
Warning: Using a password on the command line interface can be insecure.
+------------+
| @@hostname |
+------------+
| mysql2 |
+------------+
root@mysql1:/home/vagrant# mysql -h 192.168.56.101 -u my_app2_user -p -e "CREATE DATABASE testing_space"
ERROR 1006 (HY000) at line 1: Can't create database 'testing_space' (errno: 28)
Would it be possible to add a write test for servers that are configured as writeable, or add an option in the config to override the default check with a custom check to determine if a server or master is up and writeable. So I could run the following SQL for server writability check
CREATE TABLE mysql.test_write(test INT); DROP TABLE mysql.test_write
While I see your point, this failure scenario does not fall well under orchestrator's realm.
In the eyes of orchestrator, and probably in the eyes of the replicas, the master box is still up. I I agree it's not functional, orchestrator wishes to avoid premature promotion, and the method you illustrate may cause premature promotion: it let's orchestrator fail over based on a test initiated by orchestrator and orchestrator only.
Can't create database 'testing_space' (errno: 28) could be caused by other reasons than disk space, and I'm hesitant to cause orchestrator issue a failover: that same reason may happen on the replcaement master, too!
If the is a definitive way (I don't think there is) to know that a write failed due to exhausted disk space, I'm willing to consider it.
Otherwise I think this particular problem had better be mitigated by disk space monitoring.
I do fully agree that the issue should have been resolved by disk space monitoring.
Do the other methods that you mention that can cause this scenario also generate the same errno 28? I thought this was an os generated error number specifically for disk space.
Thanks for your time in reviewing it's much appreciated.
Do the other methods that you mention that can cause this scenario also generate the same errno 28?
Hmmm, I'm not sure, I didn't consider this was an os generated error number specifically for disk space. If this were the case, it's worth considering letting orchestrator handle this as a special failover scenario -- though I still find it a bit scary.
I believe that errorno 28 is specific to disk space issues, as defined in the linux kernel:
#define ENOSPC 28 /* No space left on device */
However, the value is used in over 1000 locations within the code base (mostly in CPU arch specific files) so there is a bit of effort in exhaustively validating that it's only ever used in instances where the disks has run out of space.
Personally, I'm working around this issue by using a local script to watch disk space issues and, taking into account some other variables, shut down MySQL after a graceful failover if it gets too low.