wazuh-qa icon indicating copy to clipboard operation
wazuh-qa copied to clipboard

Performance for Vulnerability Detection module in clustered environments

Open Rebits opened this issue 9 months ago • 6 comments

Description

This issue is dedicated to conducting a thorough performance analysis of two proposed development approaches:

  • @wazuh/devel-framework: https://github.com/wazuh/wazuh/issues/23058
  • @wazuh/devel-core2 development: https://github.com/wazuh/wazuh/issues/22867

The objective is to perform performance tests and compare the results of both approaches. This comparative analysis will provide a comprehensive understanding of the potential impact on the product.

Test environment

Component Quantity Operating System CPU (cores) RAM (GB) Disk (GB)
Master 1 Ubuntu 22 4 8 50
Workers 2 Ubuntu 22 4 8 50
Agent 1 1 Ubuntu 22 2 4 30
Agent 2 1 Windows 11 2 4 30
Load Balancer 1 Ubuntu 22 4 8 50
Indexers 2 Ubuntu 22 2 4 30

[!NOTE] The load balancer is located on the master node.

23058 Development Packages

Architecture Framework development package URL URL
DEB 4.8.0-python.vd.spike.deb.1
RPM 4.8.0-python.vd.spike.rpm.1

22867 Development Packages

Architecture Core development package URL
DEB 4.8.0-0.commitd31b277
RPM 4.8.0-0.commitd31b277

Test Cases

Testing

Automatic

Methodology

Utilizing the CLUSTER-Workload_benchmarks_metrics pipeline to execute specified test cases automatically. Results will be manually analyzed and shared with the development team for validation adjustments.

Test Cases

Case Description Number of Agents EPS Frequency Number of Vulnerable Packages Time
Minimum Activity Simulate a small, stable environment with low activity 10 10 600 100 3h
Medium Activity Simulate a medium-sized environment with moderate activity 50 10 300 100 3h
High Activity Simulate a large-scale environment with significant activity 200 50 60 100 3h

Manual

Methodology

Customizing the set of vulnerable packages is not feasible in automatic testing. Therefore, manual testing will utilize a larger set of 10,000 vulnerabilities to identify any potential instability in environments with a high vulnerability count. The following Wazuh-QA tools will be employed for manual performance analysis:

  • Monitor class for resource measurement of Wazuh central components
  • Statistics class for Wazuh data analysis
  • Simulate agents script for Wazuh agent simulation

Test Cases

Case Description Number of Agents EPS Frequency Number of Vulnerable Packages Time
High Vulnerability Environment Simulate an intermediate-sized environment with high vulnerability 10 10 60 10,000 3h

Conclusion :red_circle:

New Issues

  • https://github.com/wazuh/wazuh-jenkins/issues/6474
  • https://github.com/wazuh/wazuh-jenkins/issues/6473
  • https://github.com/wazuh/wazuh-jenkins/issues/6203
  • https://github.com/wazuh/wazuh-jenkins/issues/6475
  • https://github.com/wazuh/wazuh/issues/23202
  • https://github.com/wazuh/wazuh/issues/22847

Known issues

  • https://github.com/wazuh/wazuh-jenkins/issues/6203

[!NOTE] Manual performance testing, Minimum Activity and High Activity has not been performed. More information in https://github.com/wazuh/wazuh-qa/issues/5313#issuecomment-2100349272

Rebits avatar Apr 30 '24 09:04 Rebits

Automatic

  • Minimum Activity: https://ci.wazuh.info/job/CLUSTER-Workload_benchmarks_metrics/510/
  • Medium Activity: https://ci.wazuh.info/job/CLUSTER-Workload_benchmarks_metrics/511/
  • High Activity: https://ci.wazuh.info/job/CLUSTER-Workload_benchmarks_metrics/512/

Rebits avatar May 06 '24 17:05 Rebits

Minimum Activity and High activity performance tests fail due to no space left error. Reported in https://github.com/wazuh/wazuh-jenkins/issues/6475

22:03:52  
22:03:52  TASK [Copy ossec.log file to data files] ***************************************
22:03:52  fatal: [CLUSTER-Workload_benchmarks_metrics_B510_manager_2]: UNREACHABLE! => {
22:03:52      "changed": false,
22:03:52      "unreachable": true
22:03:52  }
22:03:52  
22:03:52  MSG:
22:03:52  
22:03:52  Warning: Permanently added '172.31.3.110' (ECDSA) to the list of known hosts.

22:03:52  mkdir: cannot create directory ‘/tmp/ansible-tmp-1715115832.7137516-30912-167679972105845’: No space left on device
22:03:52  
22:03:53  fatal: [CLUSTER-Workload_benchmarks_metrics_B510_manager_1]: UNREACHABLE! => {
22:03:53      "changed": false,
22:03:53      "unreachable": true
22:03:53  }
22:03:53  
22:03:53  MSG:
22:03:53  
22:03:53  Warning: Permanently added '172.31.4.31' (ECDSA) to the list of known hosts.

22:03:53  mkdir: cannot create directory ‘/tmp/ansible-tmp-1715115832.724964-30911-242038256013694’: No space left on device

Only Medium Activity performance tests finished successfully Build: https://ci.wazuh.info/job/CLUSTER-Workload_benchmarks_metrics/511/

Rebits avatar May 08 '24 08:05 Rebits

Medium Activity :red_circle:

Build: https://ci.wazuh.info/job/CLUSTER-Workload_benchmarks_metrics/511/ Report: Artifact.zip

Logs :red_circle:

Summary

  • Worker logs indicate the same database error reported in https://github.com/wazuh/wazuh/issues/22847
  • No errors present in the master node
  • No errors present in the indexer nodes

Master :yellow_circle:

  • Master node is started before the correct indexer configuration is set. Expected:
2024/05/07 21:14:30 indexer-connector: WARNING: No username and password found in the keystore, using default values.
2024/05/07 21:14:30 indexer-connector: WARNING: IndexerConnector initialization failed for index 'wazuh-states-vulnerabilities', retrying until the connection is successful.
2024/05/07 21:16:52 indexer-connector: WARNING: Failed to sync agent '000' with the indexer.

Worker 1 :red_circle:

  • Worker node is started before the correct indexer configuration is set. Expected
2024/05/07 21:14:30 indexer-connector: WARNING: No username and password found in the keystore, using default values.
2024/05/07 21:14:30 indexer-connector: WARNING: IndexerConnector initialization failed for index 'wazuh-states-vulnerabilities', retrying until the connection is successful.
2024/05/07 21:16:52 indexer-connector: WARNING: Failed to sync agent '000' with the indexer.
  • Multiple database errors reported in https://github.com/wazuh/wazuh/issues/22847
2024/05/07 21:24:24 wazuh-remoted: INFO: (1409): Authentication file changed. Updating.
2024/05/07 21:24:24 wazuh-remoted: INFO: (1410): Reading authentication keys file.
2024/05/07 21:24:48 wazuh-db: ERROR: DB(004) sqlite3_prepare_v2() : no such table: sys_osinfo
2024/05/07 21:24:48 wazuh-db: ERROR: (5214): Null statement on internal cache.
2024/05/07 21:24:48 wazuh-db: ERROR: DB(004) sqlite3_prepare_v2() : no such table: sys_programs
2024/05/07 21:24:48 wazuh-db: ERROR: (5214): Null statement on internal cache.
2024/05/07 21:24:48 wazuh-db: ERROR: DB(004) sqlite3_prepare_v2() : no such table: sys_programs
2024/05/07 21:24:48 wazuh-db: ERROR: (5214): Null statement on internal cache.

Worker 2 :yellow_circle:

  • Worker node is started before the correct indexer configuration is set. Expected
2024/05/07 21:14:30 indexer-connector: WARNING: No username and password found in the keystore, using default values.
2024/05/07 21:14:30 indexer-connector: WARNING: IndexerConnector initialization failed for index 'wazuh-states-vulnerabilities', retrying until the connection is successful.
2024/05/07 21:16:52 indexer-connector: WARNING: Failed to sync agent '000' with the indexer.

Indexer 1 :green_circle:

No warnings or errors

Indexer 2 :green_circle:

No warnings or errors


Metrics :red_circle:

Summary

  • Low resource usage in the master node
  • Possible file descriptor leaks. Reported in https://github.com/wazuh/wazuh/issues/23202
  • Worker nodes are experiencing high CPU and memory usage due to an unrealistic level of activity, with an expected influx of 500 syscollector messages per second in a two-node cluster environment. As a result, it's unsurprising to observe these elevated values

Master :green_circle:

Metrics

CPU Disk_Read Disk_Read_Speed Disk_Write_Speed Disk_Written FD PSS Read_Ops RSS SWAP USS VMS Write_Ops

Worker 1 :red_circle:

Metrics

CPU Disk_Read Disk_Read_Speed Disk_Write_Speed Disk_Written FD PSS Read_Ops RSS SWAP USS VMS Write_Ops

Worker 2 :red_circle:

Metrics

CPU Read_Ops RSS SWAP USS VMS Write_Ops Disk_Read Disk_Read_Speed Disk_Write_Speed Disk_Written FD PSS

Indexer 1 :green_circle:

No abnormal behavior detected

Metrics

CPU Disk_Read Disk_Read_Speed Disk_Write_Speed Disk_Written FD PSS Read_Ops RSS SWAP USS VMS Write_Ops

Indexer 2 :green_circle:

No abnormal behavior detected

Metrics

CPU Disk_Read Disk_Read_Speed Disk_Write_Speed Disk_Written FD PSS Read_Ops RSS SWAP USS VMS Write_Ops


Statistics :green_circle:

Vulnerabilities State :green_circle:

The vulnerability generator module, utilized by the simulate agents script, is designed to transmit 100 vulnerable packages to the manager and subsequently confirm their removal. This behavior is visualized through sinuous graphics, reaching a peak with each repetition after processing all vulnerabilities.

In the plot, it's evident that the indexer connector fails to match the ideal expected graphics. However, it's apparent that the simulator is performing as intended.

total_vulnerabilities

Implementing various testing methods to determine if the final number of vulnerabilities aligns with expectations at specific points during the test could be highly beneficial.


Alerts :green_circle:

We anticipate that the alerts generated by both the workers and the manager should correspond with the indexed alert values. Nonetheless, there appears to be a discrepancy:

combined_and_new_total_alerts

Due to the high activity levels, some variance between the written alerts and indexed alerts is expected. However, it would be advantageous to incorporate testing methods to gradually mitigate this, thereby stabilizing the environment over time.


Evidence collection :red_circle:

It has been detected the following errors regarding the evidence-collection capabilities of the pipeline

  • Vulnerabilities and alerts indexed metrics do not contain timestamps. Including the timestamp will make it easy to compare these values with the rest of the graphics. Reported in https://github.com/wazuh/wazuh-jenkins/issues/6474
  • Indexer statistics were present in the logcollector directory. Reported in https://github.com/wazuh/wazuh-jenkins/issues/6473
  • Statistics values for analysis are not correctly plotted. Reported in https://github.com/wazuh/wazuh-jenkins/issues/6203

Rebits avatar May 08 '24 08:05 Rebits

Following a discussion with @juliamagan, we've made the decision not to replicate the unsuccessful High Activity and Low Activity performance tests. Instead, these tests will be re-launched in RC2

Rebits avatar May 08 '24 11:05 Rebits

GJ, but the graphs of the indexer 1 metrics cannot be displayed, perhaps because of an error in writing the comment.

MARCOSD4 avatar May 08 '24 12:05 MARCOSD4

LGTM

MARCOSD4 avatar May 08 '24 13:05 MARCOSD4

LGTM

juliamagan avatar May 09 '24 09:05 juliamagan