opensearch-benchmark [BUG] Benchmark tool got stuck without any error message

Describe the bug Benchmark tool got stuck without any error message when it runs against custom workload

To Reproduce Change http_logs/operations.default.json file as follow

    {
      "name": "create-http-log-geoip-pipeline",
      "operation-type": "put-pipeline",
      "id": "http-log-geoip-pipeline",
      "body": {
        "description": "Enrich the data with the geo-ip filter. Requires --opensearch-plugins='ingest-geoip'",
        "processors": [
          {
            "ip2geo": {
               "field": "clientip",
               "target_field": "geoip",
               "datasource": "my-datasource",
               "properties": [
                  "city_name",
                  "country_name",
                  "location"
                ]
            }
          }
        ]
      }
    },

Run

opensearch-benchmark execute-test \
--target-hosts localhost:9200 \
--workload=http_logs \
--test-procedure=append-index-only-with-ingest-pipeline \
--workload-params="ingest_pipeline:'geoip'" \
--pipeline=benchmark-only

Expected behavior Throw an error

Logs

2023-05-30 17:47:18,991 ActorAddr-(T|:36117)/PID:67597 osbenchmark.worker_coordinator.worker_coordinator INFO Total run duration: 0.035839 seconds.
2023-05-30 17:47:18,992 ActorAddr-(T|:36117)/PID:67597 osbenchmark.worker_coordinator.worker_coordinator INFO Total time to shutdown asyncgens: 0.000273 seconds.
2023-05-30 17:47:18,992 ActorAddr-(T|:36117)/PID:67597 osbenchmark.worker_coordinator.worker_coordinator INFO Total time to close transports: -0.000679 seconds.
2023-05-30 17:47:23,959 ActorAddr-(T|:36117)/PID:67597 osbenchmark.actor INFO Worker[0] is ready for the next task.
2023-05-30 17:47:23,959 ActorAddr-(T|:36117)/PID:67597 osbenchmark.actor INFO Worker[0] reached join point at index [18].
2023-05-30 17:47:23,960 ActorAddr-(T|:33151)/PID:67578 osbenchmark.worker_coordinator.worker_coordinator INFO [8/8] workers reached join point [9/9].
2023-05-30 17:47:23,960 ActorAddr-(T|:33151)/PID:67578 osbenchmark.worker_coordinator.worker_coordinator INFO All workers completed their tasks until join point [9/9].
2023-05-30 17:47:23,967 ActorAddr-(T|:33151)/PID:67578 osbenchmark.telemetry INFO Gathering indices stats for all primaries on benchmark stop.
2023-05-30 17:47:23,975 ActorAddr-(T|:33151)/PID:67578 osbenchmark.worker_coordinator.worker_coordinator INFO All steps completed.
2023-05-30 17:47:23,976 ActorAddr-(T|:33151)/PID:67578 osbenchmark.metrics INFO Closing metrics store.

More Context (please complete the following information):

Workload: workload.zip
Service: OpenSearch
Version: 3.x

Additional context Add any other context about the problem here.

May 30 '23 18:05 heemin32

Trying this out, OSB did not get "stuck", but reported an 100% error rate. Is this different from your test run?

Also, please indicate the type of error message you would expect.

Running delete-index                                                           [100% done]
Running create-index                                                           [100% done]
Running check-cluster-health                                                   [100% done]
Running create-http-log-geoip-pipeline                                         [100% done]
Running index-append-with-ingest-geoip-pipeline                                [100% done]
Running refresh-after-index                                                    [100% done]
Running force-merge                                                            [100% done]
Running refresh-after-force-merge                                              [100% done]
Running wait-until-merges-finish                                               [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                         Metric |                                    Task |       Value |   Unit |
|---------------------------------------------------------------:|----------------------------------------:|------------:|-------:|
|                     Cumulative indexing time of primary shards |                                         |      0.0038 |    min |
|             Min cumulative indexing time across primary shards |                                         |           0 |    min |
|          Median cumulative indexing time across primary shards |                                         |           0 |    min |
|             Max cumulative indexing time across primary shards |                                         |     0.00155 |    min |
|            Cumulative indexing throttle time of primary shards |                                         |           0 |    min |
|    Min cumulative indexing throttle time across primary shards |                                         |           0 |    min |
| Median cumulative indexing throttle time across primary shards |                                         |           0 |    min |
|    Max cumulative indexing throttle time across primary shards |                                         |           0 |    min |
|                        Cumulative merge time of primary shards |                                         |           0 |    min |
|                       Cumulative merge count of primary shards |                                         |           0 |        |
|                Min cumulative merge time across primary shards |                                         |           0 |    min |
|             Median cumulative merge time across primary shards |                                         |           0 |    min |
|                Max cumulative merge time across primary shards |                                         |           0 |    min |
|               Cumulative merge throttle time of primary shards |                                         |           0 |    min |
|       Min cumulative merge throttle time across primary shards |                                         |           0 |    min |
|    Median cumulative merge throttle time across primary shards |                                         |           0 |    min |
|       Max cumulative merge throttle time across primary shards |                                         |           0 |    min |
|                      Cumulative refresh time of primary shards |                                         |  0.00128333 |    min |
|                     Cumulative refresh count of primary shards |                                         |         898 |        |
|              Min cumulative refresh time across primary shards |                                         |           0 |    min |
|           Median cumulative refresh time across primary shards |                                         |           0 |    min |
|              Max cumulative refresh time across primary shards |                                         |     0.00035 |    min |
|                        Cumulative flush time of primary shards |                                         |           0 |    min |
|                       Cumulative flush count of primary shards |                                         |          12 |        |
|                Min cumulative flush time across primary shards |                                         |           0 |    min |
|             Median cumulative flush time across primary shards |                                         |           0 |    min |
|                Max cumulative flush time across primary shards |                                         |           0 |    min |
|                                        Total Young Gen GC time |                                         |       0.027 |      s |
|                                       Total Young Gen GC count |                                         |           1 |        |
|                                          Total Old Gen GC time |                                         |           0 |      s |
|                                         Total Old Gen GC count |                                         |           0 |        |
|                                                     Store size |                                         |     90.6359 |     GB |
|                                                  Translog size |                                         | 2.76603e-06 |     GB |
|                                         Heap used for segments |                                         |           0 |     MB |
|                                       Heap used for doc values |                                         |           0 |     MB |
|                                            Heap used for terms |                                         |           0 |     MB |
|                                            Heap used for norms |                                         |           0 |     MB |
|                                           Heap used for points |                                         |           0 |     MB |
|                                    Heap used for stored fields |                                         |           0 |     MB |
|                                                  Segment count |                                         |         163 |        |
|                                       100th percentile latency |          create-http-log-geoip-pipeline |     18.8267 |     ms |
|                                  100th percentile service time |          create-http-log-geoip-pipeline |     18.8267 |     ms |
|                                                     error rate |          create-http-log-geoip-pipeline |         100 |      % |
|                                                 Min Throughput | index-append-with-ingest-geoip-pipeline |     12874.7 | docs/s |
|                                                Mean Throughput | index-append-with-ingest-geoip-pipeline |     12874.7 | docs/s |
|                                              Median Throughput | index-append-with-ingest-geoip-pipeline |     12874.7 | docs/s |
|                                                 Max Throughput | index-append-with-ingest-geoip-pipeline |     12874.7 | docs/s |
|                                        50th percentile latency | index-append-with-ingest-geoip-pipeline |     27.8619 |     ms |
|                                        90th percentile latency | index-append-with-ingest-geoip-pipeline |     59.1429 |     ms |
|                                       100th percentile latency | index-append-with-ingest-geoip-pipeline |      96.267 |     ms |
|                                   50th percentile service time | index-append-with-ingest-geoip-pipeline |     27.8619 |     ms |
|                                   90th percentile service time | index-append-with-ingest-geoip-pipeline |     59.1429 |     ms |
|                                  100th percentile service time | index-append-with-ingest-geoip-pipeline |      96.267 |     ms |
|                                                     error rate | index-append-with-ingest-geoip-pipeline |         100 |      % |
|                                                 Min Throughput |                wait-until-merges-finish |       12.14 |  ops/s |
|                                                Mean Throughput |                wait-until-merges-finish |       12.14 |  ops/s |
|                                              Median Throughput |                wait-until-merges-finish |       12.14 |  ops/s |
|                                                 Max Throughput |                wait-until-merges-finish |       12.14 |  ops/s |
|                                       100th percentile latency |                wait-until-merges-finish |     34.7581 |     ms |
|                                  100th percentile service time |                wait-until-merges-finish |     34.7581 |     ms |
|                                                     error rate |                wait-until-merges-finish |           0 |      % |

[WARNING] Error rate is 100.0 for operation 'create-http-log-geoip-pipeline'. Please check the logs.
[WARNING] No throughput metrics available for [create-http-log-geoip-pipeline]. Likely cause: Error rate is 100.0%. Please check the logs.
[WARNING] Error rate is 100.0 for operation 'index-append-with-ingest-geoip-pipeline'. Please check the logs.

Jun 01 '23 21:06 gkamat

That looks okay. I ran opensearch cluster and benchmark tool in the same machine and it never generated such report.

Jun 01 '23 21:06 heemin32

@heemin32 Have you tried running this same setup with an external cluster (on a separate machine) instead of a local cluster? It may help us pinpoint the issue. You could leverage this repository to quickly provision an vanilla OpenSearch cluster.

Let us know if you were able to successfully do this.

Jun 08 '23 21:06 IanHoang

~I think, if might happen when OpenSearch cluster does not respond to some of requests made by benchmark tool. Then, the benchmark tool wait for the response forever instead of timing out. Timing out on request might help to avoid the benchmark tool from getting stuck?~

I ran the benchmark tool against an external cluster.

In the log, each worker continued their work and finished but the reporting agent got stuck somewhere around 30% and never generated a report.

For example, I could see from the log that this four workers completed and I saw the percent to reach benchmark.log

2023-06-09 20:51:15,189 ActorAddr-(T|:37769)/PID:89522 osbenchmark.actor DEBUG Worker[2] is executing [index-append-with-ingest-geoip-pipeline] (99.03% complete).
2023-06-09 20:51:15,231 ActorAddr-(T|:39675)/PID:89520 osbenchmark.actor DEBUG Worker[0] is executing [index-append-with-ingest-geoip-pipeline] (99.61% complete).
2023-06-09 20:51:15,330 ActorAddr-(T|:36595)/PID:89521 osbenchmark.actor DEBUG Worker[1] is executing [index-append-with-ingest-geoip-pipeline] (99.13% complete).
2023-06-09 20:51:15,353 ActorAddr-(T|:39307)/PID:89523 osbenchmark.actor DEBUG Worker[3] is executing [index-append-with-ingest-geoip-pipeline] (99.51% complete).

console

Running index-append-with-ingest-geoip-pipeline                                [ 5Running index-append-with-ingest-geoip-pipeline                                [ 30% done]
Running index-append-with-ingest-geoip-pipeline                                [ 5Running index-append-with-ingest-geoip-pipeline                                [ 30% done]

Jun 08 '23 22:06 heemin32

Trying this out, OSB did not get "stuck", but reported an 100% error rate. Is this different from your test run?

You need to run without --test-mode parameter to see it gets stuck.

Jun 09 '23 20:06 heemin32

@gkamat @IanHoang see if we can reproduce and if issue still exists.

Oct 01 '24 20:10 IanHoang

opensearch-benchmark opensearch-benchmark copied to clipboard

[BUG] Benchmark tool got stuck without any error message

opensearch-benchmark
opensearch-benchmark copied to clipboard