opensearch-benchmark [BUG] OpenSearch Benchmark produces inconsistent results over the time

Describe the bug I'm seeing significantly different results in various measurements during consequent runs.

To Reproduce Setup OpenSearch instance/cluster Setup OpenSearch Benchmark Create workload parameters file, lets say workload-params-plain-niofs.json Run a workload in the loop, lets say 10 iterations export eshost=172.31.31.54:9200 && for j in plain-niofs; do for k in {0..9} ; for i in pmc http_logs; do opensearch-benchmark execute_test --workload $i --target-hosts=$eshost --workload-params=./workload-params-$j.json --results-format=csv --results-file=results-os124remote-20220516-v5-$j-$i-v$k.csv --pipeline=benchmark-only ; done ; done ; done Collect CSV results and compare them Observe large (>50%) variations on some measurements

Expected behavior I expect to see the consistent test results when I'm testing the same setup

Logs compare-oc124remote-http_logs-niofs-20220508 - compared 10 iterations for http_logs workload compare-oc124remote-pmc-niofs-20220508 - compared 10 iterations for pmc workload results-oc124remote-baremetal-20220508.zip - original CSV results collected workload-params-plain-hybridfs.json.txt - used workload parameters file, renamed from workload-params-plain-hybridfs.json workload-params-plain-niofs.json.txt - used workload parameters file, renamed from workload-params-plain-niofs.json

More Context (please complete the following information):

Workload: pmc and http_logs
Service OpenSearch 1.2.4
Version 0.0.2

Additional context I observed similar behavior for standalone and 3 node cluster OpenSearch installation. I tested OpenSearch Benchmark running locally and on the separate instance and observed the same discrepancy.

May 19 '22 01:05 alexmsu75

Thanks for sharing this issue @alexmsu75.

@treddeni-amazon do you know who would have an idea of what is happening? It doesn't seem like there should be this much variation over consecutive runs with the same configurations.

May 19 '22 13:05 elfisher

Hi @alexmsu75 thanks for your interest in OpenSearch Benchmark!

I looked through the results that you provided and it seems that in general there are large discrepancies in the p100 (max) service_time/latency for query operations like 200s-in-range, 400s-in-range, range, etc. These should not be of particular concern because they only represent a single request which had high latency. This high latency was likely caused by an intermittent event on the cluster like a GC pause.

The latency metric is defined as the service_time metric plus any additional time that a request may have sat in the request queue prior to being sent in order to meet the target throughput for a given query; generally service_time and latency are nearly identical. This means that variance in service_time would also show up as a nearly identical variance in the latency, causing twice the number of high variance metrics.

Additionally, wait-until-merges-finish operations show large variance but each of these is only run once per iteration of the workload so large variance is not necessarily surprising.

In the pmc test runs p100 service_time/latency seems to be the only metric for queries which shows any significant variance.

There are a few queries for http_logs like asc_sort_with_after_timestamp, desc_sort_timestamp where there is high variance across all percentiles. These seem particularly odd to me since similar queries like desc_sort_with_after_timestamp and asc_sort_timestamp were run also run and saw little variance outside of p100 service_time/latency.

This makes me wonder if there were activities going on with the cluster during these operations. Resource intensive activities like snapshots could result in increased latency. Are you aware of anything else that was going on with the cluster during this time @alexmsu75? It would also be worth getting more information on what kind of machine the OpenSearch test clusters were run on along with any JVM or GC settings you might have been using.

It is expected to see some amount of variation between tests. Our typical practice is to run 5 total iterations OpenSearch Benchmark and average the results of the last 3, treating the first two iterations as "warmups" to better simulate steady state conditions. I've written more about that in OpenSearch#2461

May 19 '22 15:05 travisbenedict

Hi @travisbenedict and thank you for the feedback,

I collected these specific results from the following setup: single instance OpenSearch 1.2.4 running on AWS z1d.metal; 48vCPU; 384gbRAM; local NVMe storage; Ubuntu 20; OpenJDK 11.0.15 with 64GB heap; security is disabled single client OpenSearch Benchmark 0.0.2 on AWS c6i.4large no other load on these machines opensearch installation from tar archive and the following command was used to start it: /opt/opensearch/dev/opensearch-1.2.4/bin/opensearch -Ecluster.name=opensearch-cluster -Enode.name=opensearch-node1 -Ehttp.host=0.0.0.0 -Ediscovery.type=single-node -Epath.data=/data/opensearch-1.2.4/data -Epath.logs=/data/opensearch-1.2.4/logs

I saw the similar discrepancies on very simple clustered setup and on the client/benchmark running on the same machine as opensearch instance

Please let me know if you have any additional questions about the setup but I feel this issue can be reproduced at any time.

As for the discrepancies, here is what I see: for http_logs workload, out of 267 various measurements produced by the tool, 80 show >30% fluctuation for pmc load, 37 out of 127 over 30% difference in measurement

I'm in agreement that one-run tests and p100/max could be discarded but the rest are still fluctuating. So, what am I doing wrong here?

May 19 '22 21:05 alexmsu75

opensearch-benchmark opensearch-benchmark copied to clipboard

[BUG] OpenSearch Benchmark produces inconsistent results over the time

opensearch-benchmark
opensearch-benchmark copied to clipboard