ansible-mariadb-galera-cluster icon indicating copy to clipboard operation
ansible-mariadb-galera-cluster copied to clipboard

Need to add delay between "setup_cluster | restarting node to apply config changes (other nodes)"

Open roumano opened this issue 2 years ago • 4 comments

Is your feature request related to a problem? Please describe. When a configuration change on our Galera Cluster, the ansible tasks setup_cluster | restarting node to apply config changes (other nodes) happen too quickly between other nodes so the cluster is not in a good state for dozen a second.

as example, the restart happen at 5h28m14 for the second node and at 5h28m24s for the third node. 10 seconds between them it's too short in our case as the second node it's not yet Synced in the Galera cluster

The throttle :1 is working as expected but need to add a wait for X second (or even better wait for server integrate to the cluster pool + wait for X second)

Describe the solution you'd like i'm thinking to move this specific tasks on a include tasks and add a wait_for delay on it

What it's you point of view ?

roumano avatar Aug 24 '22 07:08 roumano

I have replace this code :

 - name: setup_cluster | restarting node to apply config changes (other nodes)
   service: # noqa 503
     name: "{{ mariadb_systemd_service_name }}"
     state: "restarted"
   become: true
   throttle: 1
   when: >
     galera_cluster_configured.stat.exists and
     _mariadb_galera_cluster_reconfigured.changed and
     inventory_hostname != galera_mysql_first_node

With :

- name: Per server, wait and restarting node
  include_tasks: restart_node.yml
  with_items: "{{ ansible_play_hosts }}"
  when:
  - inventory_hostname == item
  - galera_cluster_configured.stat.exists
  - _mariadb_galera_cluster_reconfigured.changed
  - inventory_hostname != galera_mysql_first_node

And the new file restart_node.yml :

---
- name: Sleep for 60 seconds and timeout
  wait_for:
    delay: 60
    timeout: 0
  check_mode: no

- name: setup_cluster | restarting node to apply config changes (other nodes)
  service: # noqa 503
    name: "{{ mariadb_systemd_service_name }}"
    state: "restarted"
  become: true

and it's working as expected ps : the check_mode: no is only used for testing with ansible-playbook --check --diff ps2: it's possible to add a variable for the delay, like delay: "{{ galera_delay_between_restart | default ('60') }}"

roumano avatar Aug 24 '22 15:08 roumano

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 23 '22 16:10 stale[bot]

Keepalive

elcomtik avatar Oct 23 '22 16:10 elcomtik

I like the idea of extracting code for restarting nodes to separate task file. I think there are two places where nodes are restarted so I would re-use it twice. We have a better way to determine if the node is synced state (look at https://github.com/mrlesmithjr/ansible-mariadb-galera-cluster/blob/master/tasks/setup_cluster.yml#L98-L112), we use the systemd status for mariadb service, this service is designed in way, that it is not in a running state until the node is synced with the cluster. We can also have a small constant timeout to be sure, but this should not be neccessary.

elcomtik avatar Oct 24 '22 19:10 elcomtik

@roumano please test latest code if it resolved your issue

elcomtik avatar Oct 25 '22 19:10 elcomtik

@elcomtik , Thank you, i reviewed the code and it's look like exactly what it's need. Sadly, i don't have a galera cluster Test environment so i can't test it easily (and quickly) I will be able to test only from the Week 45 (from monday 7 november )

It's possible to add a variable to permit custom timeout sleeping ( for the tasks setup_cluster | sleep for 15 seconds to wait for node WSREP prepared state) as on bigger/under charge cluster, it's can be longer than 15secondes to be in READY state ?

roumano avatar Oct 27 '22 07:10 roumano

It should not be necessary to hqve any timeout as we have there check wich looks at active state of systemd service. This service is written in way that it's not active untile node is synced with cluster. We have there small timeout for getting system load to normal, as it may be little bit high after starting/syncing galera node.

elcomtik avatar Oct 27 '22 08:10 elcomtik

@roumano - did you had any chance to test this?

eRadical avatar Nov 28 '22 11:11 eRadical

@eRadical , Yes, i update our ansible role with this version and we found a major issue ( hopefully i stop the deployment before it's break all our mariadb cluster)

The regression is linked to this change : https://github.com/mrlesmithjr/ansible-mariadb-galera-cluster/pull/120

the error in the log is :

WSREP: handshake with remote endpoint ssl://XXXX:4567 failed: asio.ssl:337047686: 'certificate verify failed' ( 337047686: 'error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed')

With this configuration, it's failed to start :

wsrep_provider_options = "ist.recv_addr=XXXX:4568; ist.recv_bind=XXXX; socket.ssl_cert=/etc/mysql/certificates/server-cert.pem; socket.ssl_key=/etc/mysql/certificates/server-key.pem; socket.ssl_ca=/etc/mysql/certificates/ca.pem"

It's need to first set the certificat file then the recv_addr on two different lines :

wsrep_provider_options = "socket.ssl_cert=/etc/mysql/certificates/server-cert.pem; socket.ssl_key=/etc/mysql/certificates/server-key.pem; socket.ssl_ca=/etc/mysql/certificates/ca.pem"
wsrep_provider_options = "ist.recv_addr=XXXX:4568; ist.recv_bind=XXXX"

roumano avatar Nov 28 '22 12:11 roumano

wsrep_provider_options is not additive even though it is a dynamic variable.

As per MariaDB documentation: Options need to be provided as a semicolon (;) separated list on a single line.

This is also emphasized on Codership website where they say:

All wsrep_provider_options settings need to be specified on a single line.
In case of multiple instances of wsrep_provider_options,
only the last one is used.

eRadical avatar Nov 28 '22 13:11 eRadical

What kind of SSL certs are you using?

eRadical avatar Nov 28 '22 13:11 eRadical

You can test this by moving the task "WSREP TLS encryption settings" to be first in the "block:". The change will make the SSL declarations to be at the beginning.

Also can you look in the running servers what is the value of "wsrep_provider_options" by issuing:

SELECT REPLACE(@@wsrep_provider_options, ';', '\n')\G

eRadical avatar Nov 28 '22 13:11 eRadical

What kind of SSL certs are you using?

  • server-cert.pem: it's a wild card domain certificate :
openssl x509 -in /etc/mysql/certificates/server-cert.pem -text -noout
        Validity
            Not Before: Jun  7 00:00:00 2022 GMT
            Not After : Jul  8 23:59:59 2023 GMT
        Subject: CN = *.REPLACED_DOMAIN.fr        
  • /etc/mysql/certificates/server-key.pem, it's the private key of the ssl
-----BEGIN PRIVATE KEY-----
...
...
-----END PRIVATE KEY-----
  • /etc/mysql/certificates/ca.pem : it's the CA certificate :
openssl x509 -in /etc/mysql/certificates/ca.pem -text -noout
...
        Issuer: C = US, ST = New Jersey, L = Jersey City, O = The USERTRUST Network, CN = USERTrust RSA Certification Authority
        Validity
            Not Before: Nov  2 00:00:00 2018 GMT
            Not After : Dec 31 23:59:59 2030 GMT
        Subject: C = GB, ST = Greater Manchester, L = Salford, O = Sectigo Limited, CN = Sectigo RSA Domain Validation Secure Server CA

My exact old configuration was : ( used with ansible-playbook --check --diff, it's was see the '-' at the begin of lines :

-# WSREP TLS encryption settings
-wsrep_provider_options="socket.ssl_cert=/etc/mysql/certificates/server-cert.pem;socket.ssl_key=/etc/mysql/certificates/server-key.pem;socket.ssl_ca=/etc/mysql/certificates/ca.pem"
-wsrep_provider_options="ist.recv_addr=XXXX:4568"
-wsrep_provider_options="ist.recv_bind=XXXX"
-wsrep_provider_options = ""
  • On my running servers, i don't found any wsrep_provider_options with your command (or with SHOW STATUS LIKE 'wsrep%'; )
    • actually i have 2 servers with the old configuration with :
-# WSREP TLS encryption settings
-wsrep_provider_options="socket.ssl_cert=/etc/mysql/certificates/server-cert.pem;socket.ssl_key=/etc/mysql/certificates/server-key.pem;socket.ssl_ca=/etc/mysql/certificates/ca.pem"
-wsrep_provider_options="ist.recv_addr=XXXX:4568"
-wsrep_provider_options="ist.recv_bind=XXXX"
-wsrep_provider_options = ""
  • 1 server with
wsrep_provider_options = "socket.ssl_cert=/etc/mysql/certificates/server-cert.pem; socket.ssl_key=/etc/mysql/certificates/server-key.pem; socket.ssl_ca=/etc/mysql/certificates/ca.pem"
wsrep_provider_options = "ist.recv_addr=XXXX:4568; ist.recv_bind=XXXX"

roumano avatar Nov 28 '22 13:11 roumano

Then:

SHOW VARIABLES LIKE 'wsrep_provider_options'

Not interested in what's in the file but rather what the server is running with.

eRadical avatar Nov 28 '22 14:11 eRadical

There is also this » https://github.com/codership/galera/issues/571

eRadical avatar Nov 28 '22 14:11 eRadical

But 1st let's be sure the old servers are actually using the cert part.

I also don't think that the entries in wsrep_provider_options need to be in a particular order.

eRadical avatar Nov 28 '22 14:11 eRadical

I will look deeper on the wsrep_provider_options and the ssl and ... However, i confirm the role is now restarting one node after a another, so it's fine.

Eventually, to make the role more robust If this tasks failed :

TASK [mariadb-galera-cluster : manage_node_state | make node systemd service restarted] *************************************************
fatal: [sql3]: FAILED! => {"changed": false, "msg": "Unable to restart service mysql: Job for mariadb.service failed because the control process exited with error code.\nSee \"systemctl status mariadb.service\" and \"journalctl -xe\" for details.\n"}

It's should stop and not continue on other node : ( hopefully i make a Ctrl-C to stop ) :

TASK [mariadb-galera-cluster : setup_cluster | cluster rolling restart - apply config changes (other nodes)] ****************************
included:  ansible/roles/mariadb-galera-cluster/tasks/manage_node_state.yml for sql4, sql5 => (item=sql4)
included:  ansible/roles/mariadb-galera-cluster/tasks/manage_node_state.yml for sql4, sql5 => (item=sql5)
^C [ERROR]: User interrupted execution

roumano avatar Nov 28 '22 16:11 roumano