ansible-mariadb-galera-cluster
ansible-mariadb-galera-cluster copied to clipboard
Need to add delay between "setup_cluster | restarting node to apply config changes (other nodes)"
Is your feature request related to a problem? Please describe.
When a configuration change on our Galera Cluster, the ansible tasks setup_cluster | restarting node to apply config changes (other nodes)
happen too quickly between other nodes so the cluster is not in a good state for dozen a second.
as example, the restart happen at 5h28m14 for the second node and at 5h28m24s for the third node. 10 seconds between them it's too short in our case as the second node it's not yet Synced in the Galera cluster
The throttle :1
is working as expected but need to add a wait for X second (or even better wait for server integrate to the cluster pool + wait for X second)
Describe the solution you'd like i'm thinking to move this specific tasks on a include tasks and add a wait_for delay on it
What it's you point of view ?
I have replace this code :
- name: setup_cluster | restarting node to apply config changes (other nodes)
service: # noqa 503
name: "{{ mariadb_systemd_service_name }}"
state: "restarted"
become: true
throttle: 1
when: >
galera_cluster_configured.stat.exists and
_mariadb_galera_cluster_reconfigured.changed and
inventory_hostname != galera_mysql_first_node
With :
- name: Per server, wait and restarting node
include_tasks: restart_node.yml
with_items: "{{ ansible_play_hosts }}"
when:
- inventory_hostname == item
- galera_cluster_configured.stat.exists
- _mariadb_galera_cluster_reconfigured.changed
- inventory_hostname != galera_mysql_first_node
And the new file restart_node.yml :
---
- name: Sleep for 60 seconds and timeout
wait_for:
delay: 60
timeout: 0
check_mode: no
- name: setup_cluster | restarting node to apply config changes (other nodes)
service: # noqa 503
name: "{{ mariadb_systemd_service_name }}"
state: "restarted"
become: true
and it's working as expected
ps : the check_mode: no is only used for testing with ansible-playbook --check --diff
ps2: it's possible to add a variable for the delay, like delay: "{{ galera_delay_between_restart | default ('60') }}"
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Keepalive
I like the idea of extracting code for restarting nodes to separate task file. I think there are two places where nodes are restarted so I would re-use it twice. We have a better way to determine if the node is synced state (look at https://github.com/mrlesmithjr/ansible-mariadb-galera-cluster/blob/master/tasks/setup_cluster.yml#L98-L112), we use the systemd status for mariadb service, this service is designed in way, that it is not in a running state until the node is synced with the cluster. We can also have a small constant timeout to be sure, but this should not be neccessary.
@roumano please test latest code if it resolved your issue
@elcomtik , Thank you, i reviewed the code and it's look like exactly what it's need. Sadly, i don't have a galera cluster Test environment so i can't test it easily (and quickly) I will be able to test only from the Week 45 (from monday 7 november )
It's possible to add a variable to permit custom timeout sleeping ( for the tasks setup_cluster | sleep for 15 seconds to wait for node WSREP prepared state
) as on bigger/under charge cluster, it's can be longer than 15secondes to be in READY state ?
It should not be necessary to hqve any timeout as we have there check wich looks at active state of systemd service. This service is written in way that it's not active untile node is synced with cluster. We have there small timeout for getting system load to normal, as it may be little bit high after starting/syncing galera node.
@roumano - did you had any chance to test this?
@eRadical , Yes, i update our ansible role with this version and we found a major issue ( hopefully i stop the deployment before it's break all our mariadb cluster)
The regression is linked to this change : https://github.com/mrlesmithjr/ansible-mariadb-galera-cluster/pull/120
the error in the log is :
WSREP: handshake with remote endpoint ssl://XXXX:4567 failed: asio.ssl:337047686: 'certificate verify failed' ( 337047686: 'error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed')
With this configuration, it's failed to start :
wsrep_provider_options = "ist.recv_addr=XXXX:4568; ist.recv_bind=XXXX; socket.ssl_cert=/etc/mysql/certificates/server-cert.pem; socket.ssl_key=/etc/mysql/certificates/server-key.pem; socket.ssl_ca=/etc/mysql/certificates/ca.pem"
It's need to first set the certificat file then the recv_addr on two different lines :
wsrep_provider_options = "socket.ssl_cert=/etc/mysql/certificates/server-cert.pem; socket.ssl_key=/etc/mysql/certificates/server-key.pem; socket.ssl_ca=/etc/mysql/certificates/ca.pem"
wsrep_provider_options = "ist.recv_addr=XXXX:4568; ist.recv_bind=XXXX"
wsrep_provider_options
is not additive even though it is a dynamic variable.
As per MariaDB documentation:
Options need to be provided as a semicolon (;) separated list on a single line.
This is also emphasized on Codership website where they say:
All wsrep_provider_options settings need to be specified on a single line.
In case of multiple instances of wsrep_provider_options,
only the last one is used.
What kind of SSL certs are you using?
You can test this by moving the task "WSREP TLS encryption settings" to be first in the "block:". The change will make the SSL declarations to be at the beginning.
Also can you look in the running servers what is the value of "wsrep_provider_options" by issuing:
SELECT REPLACE(@@wsrep_provider_options, ';', '\n')\G
What kind of SSL certs are you using?
- server-cert.pem: it's a wild card domain certificate :
openssl x509 -in /etc/mysql/certificates/server-cert.pem -text -noout
Validity
Not Before: Jun 7 00:00:00 2022 GMT
Not After : Jul 8 23:59:59 2023 GMT
Subject: CN = *.REPLACED_DOMAIN.fr
- /etc/mysql/certificates/server-key.pem, it's the private key of the ssl
-----BEGIN PRIVATE KEY-----
...
...
-----END PRIVATE KEY-----
- /etc/mysql/certificates/ca.pem : it's the CA certificate :
openssl x509 -in /etc/mysql/certificates/ca.pem -text -noout
...
Issuer: C = US, ST = New Jersey, L = Jersey City, O = The USERTRUST Network, CN = USERTrust RSA Certification Authority
Validity
Not Before: Nov 2 00:00:00 2018 GMT
Not After : Dec 31 23:59:59 2030 GMT
Subject: C = GB, ST = Greater Manchester, L = Salford, O = Sectigo Limited, CN = Sectigo RSA Domain Validation Secure Server CA
My exact old configuration was : ( used with ansible-playbook --check --diff, it's was see the '-' at the begin of lines :
-# WSREP TLS encryption settings
-wsrep_provider_options="socket.ssl_cert=/etc/mysql/certificates/server-cert.pem;socket.ssl_key=/etc/mysql/certificates/server-key.pem;socket.ssl_ca=/etc/mysql/certificates/ca.pem"
-wsrep_provider_options="ist.recv_addr=XXXX:4568"
-wsrep_provider_options="ist.recv_bind=XXXX"
-wsrep_provider_options = ""
- On my running servers, i don't found any wsrep_provider_options with your command (or with
SHOW STATUS LIKE 'wsrep%';
)- actually i have 2 servers with the old configuration with :
-# WSREP TLS encryption settings
-wsrep_provider_options="socket.ssl_cert=/etc/mysql/certificates/server-cert.pem;socket.ssl_key=/etc/mysql/certificates/server-key.pem;socket.ssl_ca=/etc/mysql/certificates/ca.pem"
-wsrep_provider_options="ist.recv_addr=XXXX:4568"
-wsrep_provider_options="ist.recv_bind=XXXX"
-wsrep_provider_options = ""
- 1 server with
wsrep_provider_options = "socket.ssl_cert=/etc/mysql/certificates/server-cert.pem; socket.ssl_key=/etc/mysql/certificates/server-key.pem; socket.ssl_ca=/etc/mysql/certificates/ca.pem"
wsrep_provider_options = "ist.recv_addr=XXXX:4568; ist.recv_bind=XXXX"
Then:
SHOW VARIABLES LIKE 'wsrep_provider_options'
Not interested in what's in the file but rather what the server is running with.
There is also this » https://github.com/codership/galera/issues/571
But 1st let's be sure the old servers are actually using the cert part.
I also don't think that the entries in wsrep_provider_options
need to be in a particular order.
I will look deeper on the wsrep_provider_options and the ssl and ... However, i confirm the role is now restarting one node after a another, so it's fine.
Eventually, to make the role more robust If this tasks failed :
TASK [mariadb-galera-cluster : manage_node_state | make node systemd service restarted] *************************************************
fatal: [sql3]: FAILED! => {"changed": false, "msg": "Unable to restart service mysql: Job for mariadb.service failed because the control process exited with error code.\nSee \"systemctl status mariadb.service\" and \"journalctl -xe\" for details.\n"}
It's should stop and not continue on other node : ( hopefully i make a Ctrl-C to stop ) :
TASK [mariadb-galera-cluster : setup_cluster | cluster rolling restart - apply config changes (other nodes)] ****************************
included: ansible/roles/mariadb-galera-cluster/tasks/manage_node_state.yml for sql4, sql5 => (item=sql4)
included: ansible/roles/mariadb-galera-cluster/tasks/manage_node_state.yml for sql4, sql5 => (item=sql5)
^C [ERROR]: User interrupted execution