wazuh-qa
wazuh-qa copied to clipboard
Unstable connection between master and workers according cluster.log in Reliability test
Wazuh version | Component |
---|---|
4.6.0 | Wazuh server cluster |
Description
An error in automated tests is produced by the output generated in the cluster.log
file. This log file snippet shows that the machine fails to connect to its master node and the same log is present consistently during all the log.
Running reliability/test_cluster/test_cluster_logs/test_cluster_connection/test_cluster_connection.py::test_cluster_connection
if disconnected_nodes:
> pytest.fail(f'The following nodes disconnected from master at any point:\n- ' + '\n- '.join(disconnected_nodes))
E Failed: The following nodes disconnected from master at any point:
E - worker_21
E - worker_13
E - worker_4
E - worker_6
E - worker_18
E - worker_20
E - worker_11
E - worker_2
E - worker_14
E - worker_3
E - worker_9
E - worker_17
E - worker_22
E - worker_7
E - worker_5
E - worker_19
E - worker_12
E - worker_25
E - worker_8
E - worker_24
E - worker_23
E - worker_10
E - worker_16
E - worker_15
E - worker_1
test_cluster_logs/test_cluster_connection/test_cluster_connection.py:47: Failed
Steps to reproduce
It can be reproduced by running the Workload benchmark test pipeline and after running manual reliability tests using wazuh-qa tests in 4.6.0. The pipeline should be run with the same parameters as https://ci.wazuh.info/job/CLUSTER-Workload_benchmarks_metrics/274/
Report is attached here.
Update
It seems related to https://github.com/wazuh/wazuh/issues/17928
All the reliability tests:
- test_cluster_connection
- test_cluster_error_logs
- test_check_logs_order_workers
- test_check_logs_order_master
- test_cluster_sync
- test_cluster_task_order
are not currently prepared to analyze logs where one of the managers has been restarted. They should be adapted to work correctly when this happens, since it is a usual scenario in the workload benchmarks if the API performance is being tested.
For example, the test_check_logs_order_workers
checks whether the logs of every cluster task (Agent-info sync
, Integrity check
, etc.) appear in the expected order. However, tasks are reinitialized when the connection is broken because of a cluster restart, so the test should also reset the expected order when it finds a connection broken log.
It was reported here that some tests did not run in the last time build. We should fix it as part of this issue if the problem keeps happening:
- https://github.com/wazuh/wazuh-qa/issues/5281
Update
I was reading the tests code and giving the first steps to the resolution.
Update
I was debugging the test test_check_logs_order_workers
and I got working the case for a particular logs sample.
2023/08/14 19:42:19 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Permission to synchronize granted.
2023/08/14 19:42:19 INFO: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Starting.
2023/08/14 19:42:19 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Obtained 29 chunks of data in 0.235s.
2023/08/14 19:42:19 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Sending chunks.
2023/08/14 19:42:20 INFO: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Main] The master closed the connection
2023/08/14 19:43:50 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Permission to synchronize granted.
2023/08/14 19:43:50 INFO: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Starting.
2023/08/14 19:43:50 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Obtained 29 chunks of data in 0.232s.
2023/08/14 19:43:52 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Sending chunks.
2023/08/14 19:43:54 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Main] Command received: 'b'dapi''
2023/08/14 19:43:54 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [D API] Received request: b'master*124633 {"f": {"__callable__": {"__name__": "restart", "__module__": "wazuh.manager", "__qualname__": "restart", "__type__": "function"}}, "f_kwargs": {}, "request_type": "distributed_master", "wait_for_complete": false, "from_cluster": true, "is_async": false, "local_client_arg": null, "basic_services": ["wazuh-modulesd", "wazuh-analysisd", "wazuh-execd", "wazuh-db", "wazuh-remoted"], "rbac_permissions": {"agent:create": {"*:*:*": "allow"}, "group:create": {"*:*:*": "allow"}, "agent:read": {"agent:id:*": "allow", "agent:group:*": "allow"}, "agent:delete": {"agent:id:*": "allow", "agent:group:*": "allow"}, "agent:modify_group": {"agent:id:*": "allow", "agent:group:*": "allow"}, "agent:reconnect": {"agent:id:*": "allow", "agent:group:*": "allow"}, "agent:restart": {"agent:id:*": "allow", "agent:group:*": "allow"}, "agent:upgrade": {"agent:id:*": "allow", "agent:group:*": "allow"}, "group:read": {"group:id:*": "allow"}, "group:delete": {"group:id:*": "allow"}, "group:update_config": {"group:id:*": "allow"}, "group:modify_assignments": {"group:id:*": "allow"}, "active-response:command": {"agent:id:*": "allow"}, "security:create": {"*:*:*": "allow"}, "security:create_user": {"*:*:*": "allow"}, "security:read_config": {"*:*:*": "allow"}, "security:update_config": {"*:*:*": "allow"}, "security:revoke": {"*:*:*": "allow"}, "security:edit_run_as": {"*:*:*": "allow"}, "security:read": {"role:id:*": "allow", "policy:id:*": "allow", "user:id:*": "allow", "rule:id:*": "allow"}, "security:update": {"role:id:*": "allow", "policy:id:*": "allow", "user:id:*": "allow", "rule:id:*": "allow"}, "security:delete": {"role:id:*": "allow", "policy:id:*": "allow", "user:id:*": "allow", "rule:id:*": "allow"}, "cluster:status": {"*:*:*": "allow"}, "manager:read": {"*:*:*": "allow"}, "manager:read_api_config": {"*:*:*": "allow"}, "manager:update_config": {"*:*:*": "allow"}, "manager:restart": {"*:*:*": "allow"}, "cluster:read_api_config": {"node:id:*": "allow"}, "cluster:read": {"node:id:*": "allow"}, "cluster:restart": {"node:id:*": "allow"}, "cluster:update_config": {"node:id:*": "allow"}, "ciscat:read": {"agent:id:*": "allow"}, "decoders:read": {"decoder:file:*": "allow"}, "decoders:delete": {"decoder:file:*": "allow"}, "decoders:update": {"*:*:*": "allow"}, "lists:read": {"list:file:*": "allow"}, "lists:delete": {"list:file:*": "allow"}, "lists:update": {"*:*:*": "allow"}, "rootcheck:clear": {"agent:id:*": "allow"}, "rootcheck:read": {"agent:id:*": "allow"}, "rootcheck:run": {"agent:id:*": "allow"}, "rules:read": {"rule:file:*": "allow"}, "rules:delete": {"rule:file:*": "allow"}, "rules:update": {"*:*:*": "allow"}, "mitre:read": {"*:*:*": "allow"}, "sca:read": {"agent:id:*": "allow"}, "syscheck:clear": {"agent:id:*": "allow"}, "syscheck:read": {"agent:id:*": "allow"}, "syscheck:run": {"agent:id:*": "allow"}, "syscollector:read": {"agent:id:*": "allow"}, "logtest:run": {"*:*:*": "allow"}, "task:status": {"*:*:*": "allow"}, "vulnerability:read": {"agent:id:*": "allow"}, "vulnerability:run": {"*:*:*": "allow"}, "rbac_mode": "white"}, "current_user": "", "broadcasting": true, "nodes": ["master", "CLUSTER-Workload_benchmarks_metrics_B283_manager_15", "CLUSTER-Workload_benchmarks_metrics_B283_manager_1", "CLUSTER-Workload_benchmarks_metrics_B283_manager_10", "CLUSTER-Workload_benchmarks_metrics_B283_manager_25", "CLUSTER-Workload_benchmarks_metrics_B283_manager_13", "CLUSTER-Workload_benchmarks_metrics_B283_manager_2", "CLUSTER-Workload_benchmarks_metrics_B283_manager_19", "CLUSTER-Workload_benchmarks_metrics_B283_manager_11", "CLUSTER-Workload_benchmarks_metrics_B283_manager_6", "CLUSTER-Workload_benchmarks_metrics_B283_manager_20", "CLUSTER-Workload_benchmarks_metrics_B283_manager_8", "CLUSTER-Workload_benchmarks_metrics_B283_manager_4", "CLUSTER-Workload_benchmarks_metrics_B283_manager_7", "CLUSTER-Workload_benchmarks_metrics_B283_manager_9", "CLUSTER-Workload_benchmarks_metrics_B283_manager_22", "CLUSTER-Workload_benchmarks_metrics_B283_manager_12", "CLUSTER-Workload_benchmarks_metrics_B283_manager_16", "CLUSTER-Workload_benchmarks_metrics_B283_manager_17", "CLUSTER-Workload_benchmarks_metrics_B283_manager_14", "CLUSTER-Workload_benchmarks_metrics_B283_manager_24", "CLUSTER-Workload_benchmarks_metrics_B283_manager_23", "CLUSTER-Workload_benchmarks_metrics_B283_manager_5", "CLUSTER-Workload_benchmarks_metrics_B283_manager_18", "CLUSTER-Workload_benchmarks_metrics_B283_manager_21", "CLUSTER-Workload_benchmarks_metrics_B283_manager_3"], "api_timeout": 10}'
2023/08/14 19:43:54 INFO: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [D API] Receiving request: restart from master (124633)
2023/08/14 19:43:54 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [D API] Receiving parameters {}
2023/08/14 19:43:54 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [D API] Starting to execute request locally
2023/08/14 19:43:54 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [D API] Finished executing request locally
2023/08/14 19:43:54 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [D API] Time calculating request result: 0.024s
2023/08/14 19:43:55 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Main] Command received: 'b'syn_m_a_e''
2023/08/14 19:43:55 INFO: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Agent-info sync] Finished in 5.318s. Updated 29 chunks.
2023/08/14 19:43:58 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Integrity check] Permission to synchronize granted.
2023/08/14 19:43:58 INFO: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Integrity check] Starting.
2023/08/14 19:43:58 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Integrity check] Compressing 'files_metadata.json' of 34 files.
2023/08/14 19:43:58 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Integrity check] Sending zip file to master.
2023/08/14 19:43:58 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Integrity check] Zip file sent to master.
2023/08/14 19:43:58 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Main] Command received: 'b'syn_m_c_ok''
2023/08/14 19:43:58 INFO: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Integrity check] Finished in 0.051s. Sync not required.
2023/08/14 19:43:59 INFO: [Worker CLUSTER-Workload_benchmarks_metrics_B283_manager_23] [Main] The master closed the connection
2023/08/14 19:43:59 INFO: [Cluster] [Main] SIGNAL [(15)-(SIGTERM)] received. Exit...
2023/08/14 19:44:18 DEBUG: [Cluster] [Main] Removing '/var/ossec/queue/cluster/'.
2023/08/14 19:44:18 DEBUG: [Cluster] [Main] Removed '/var/ossec/queue/cluster/'.
2023/08/14 19:44:19 INFO: [Local Server] [Main] Serving on /var/ossec/queue/cluster/c-internal.sock
2023/08/14 19:44:19 DEBUG: [Local Server] [Keep alive] Calculating.
2023/08/14 19:44:19 DEBUG: [Local Server] [Keep alive] Calculated.
(wqa) ➜ wazuh-qa git:(fix/4385-unstable-realiability-tests) ✗ pytest -s --artifacts_path=/home/nstefani/Downloads/283/ tests/reliability/test_cluster/test_cluster_logs/test_cluster_worker_logs_order/test_cluster_worker_logs_order.py
================================================================================================ test session starts ================================================================================================
platform linux -- Python 3.9.16, pytest-7.1.2, pluggy-0.13.1
rootdir: /home/nstefani/git/wazuh-qa/tests, configfile: pytest.ini
plugins: html-3.1.1, testinfra-5.0.0, metadata-2.0.4
collected 1 item
tests/reliability/test_cluster/test_cluster_logs/test_cluster_worker_logs_order/test_cluster_worker_logs_order.py .
================================================================================================= 1 passed in 0.05s =================================================================================================
The next step is to generalize this solution for more cases.
Update
I've been working on the fix of test_cluster_connection
.
Still, I need to investigate the behavior of test_check_logs_order_workers
with the changes introduced in https://github.com/wazuh/wazuh/pull/19888.
Update
I've been working on the requested changes.