opensearch-benchmark [BUG] Unable to run opensearch-benchmark [OSB] execute-test against OSB provisioned cluster due to failing health check

Describe the bug opensearch-benchmark execute test cannot run as the health check with provisioned cluster doesn't succeed

Tried running opensearch-benchmark command execute-test as below

opensearch-benchmark execute-test --pipeline=benchmark-only --workload=geonames --target-host=127.0.0.1:9200 --test-mode --kill-running-processes

But the above command gets stuck and checking the logs I see failing health checks

tail -f ~/.benchmark/logs/benchmark.log
2023-09-28 23:07:35,325 -not-actor-/PID:31934 opensearch WARNING GET http://127.0.0.1:9200/_cluster/health?wait_for_nodes=%3E%3D1 [status:503 request:30.013s]
2023-09-28 23:08:05,341 -not-actor-/PID:31934 opensearch WARNING GET http://127.0.0.1:9200/_cluster/health?wait_for_nodes=%3E%3D1 [status:503 request:30.015s]
2023-09-28 23:08:38,359 -not-actor-/PID:31934 opensearch WARNING GET http://127.0.0.1:9200/_cluster/health?wait_for_nodes=%3E%3D1 [status:503 request:30.017s]
2023-09-28 23:09:54,761 -not-actor-/PID:31934 opensearch WARNING GET http://127.0.0.1:9200/_cluster/health?wait_for_nodes=%3E%3D1 [status:503 request:76.402s]

To Reproduce Provision the cluster using opensearch-benchmark pypi utility

opensearch-benchmark install --provision-config-instance=defaults --distribution-version=2.10.0 --node-name="osb-node-1" --network-host="127.0.0.1" --http-port=9200 --master-nodes="opensearch-node-1" --seed-hosts="127.0.0.1:9300" --quiet
{
  "installation-id": "7bedc677-82d4-48e8-866c-bf62250eca9d"
}

Started Single Node Cluster using opensearch benchmark

opensearch-benchmark start --installation-id=7bedc677-82d4-48e8-866c-bf62250eca9d --test-execution-id=benchmark

Validated the the cluster status

curl localhost:9200
{
  "name" : "osb-node-1",
  "cluster_name" : "benchmark-provisioned-cluster",
  "cluster_uuid" : "_na_",
  "version" : {
    "distribution" : "opensearch",
    "number" : "2.10.0",
    "build_type" : "tar",
    "build_hash" : "eee49cb340edc6c4d489bcd9324dda571fc8dc03",
    "build_date" : "2023-09-20T23:54:29.889267151Z",
    "build_snapshot" : false,
    "lucene_version" : "9.7.0",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}

Now when I try to access cluster health endpoint, it returns error with 503 status code

curl "localhost:9200/_cluster/health?pretty"
{
  "error" : {
    "root_cause" : [
      {
        "type" : "cluster_manager_not_discovered_exception",
        "reason" : null
      }
    ],
    "type" : "cluster_manager_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}

Expected behavior Health status API should return success and opensearch-benchmark execute-test should run successfully

Logs If applicable, add logs to help explain your problem.

-not-actor-/PID:31923 osbenchmark.test_execution_orchestrator INFO Test Execution id [6ecd84cc-a9c1-448c-9349-1e75928755f0]
2023-09-28 23:04:02,180 -not-actor-/PID:31923 osbenchmark.test_execution_orchestrator INFO User specified pipeline [benchmark-only].
2023-09-28 23:04:02,181 -not-actor-/PID:31923 osbenchmark.test_execution_orchestrator INFO Using configured hosts [{'host': '127.0.0.1', 'port': 9200}]
2023-09-28 23:04:02,182 -not-actor-/PID:31923 osbenchmark.actor INFO Joining already running actor system with system base [multiprocTCPBase].
2023-09-28 23:04:02,186 ActorAddr-(T|:1900)/PID:31932 osbenchmark.actor INFO Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention Address.IPv4': '127.0.0.1:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 8, 12, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1695942242144'}] match requirements [{'coordinator': True}].
2023-09-28 23:04:32,261 -not-actor-/PID:31934 opensearch WARNING GET http://127.0.0.1:9200/_cluster/health?wait_for_nodes=%3E%3D1 [status:503 request:30.012s]
2023-09-28 23:05:02,276 -not-actor-/PID:31934 opensearch WARNING GET http://127.0.0.1:9200/_cluster/health?wait_for_nodes=%3E%3D1 [status:503 request:30.014s]

More Context (please complete the following information):

Workload(Share link for custom workloads): geonames
Service(E.g OpenSearch) Opesnsearch
Version (E.g. 1.0) 2.10.0

Additional context Add any other context about the problem here.

Sep 29 '23 00:09 cgchinmay

@cgchinmay OSB won't be able to start the test (unless we skip the cluster health check which we don't recommend) because this is an issue with the cluster that was setup. This can be seen from how the logs show a 503 status (error on the server side)

2023-09-28 23:07:35,325 -not-actor-/PID:31934 opensearch WARNING GET http://127.0.0.1:9200/_cluster/health?wait_for_nodes=%3E%3D1 [status:503 request:30.013s]

This can also be confirmed from when you curl the cluster

curl "localhost:9200/_cluster/health?pretty"
{
  "error" : {
    "root_cause" : [
      {
        "type" : "cluster_manager_not_discovered_exception",
        "reason" : null
      }
    ],
    "type" : "cluster_manager_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}

Do you have only one cluster running or did you provision others on your local host as well? I recommend checking to see if there are any other clusters you might have provisioned.

Oct 02 '23 17:10 IanHoang

@IanHoang I retried above steps and there is a problem with cluster provisioned using OSB. The health checks keep failing which prevents from running test

However I was able to execute test against a cluster provisioned by me using docker compose instructions as given here.

I also checked that we do have a process listening on port 9300 and 9200 after provisioning cluster with OSB

lsof -i :9300        
COMMAND   PID    USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
java    38919 chinmay  609u  IPv6 0x45b963dcff4dad67      0t0  TCP localhost:vrace (LISTEN)

For your reference, here is the output of stats api for cluster provisioned with OSB. Have stripped the output of unnecessary details.

curl -X GET "http://localhost:9200/_cluster/stats?pretty"
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "benchmark-provisioned-cluster",
  "cluster_uuid" : "_na_",
  "timestamp" : 1696293835154,
  "indices" : {
    "count" : 0,
    "shards" : { },
    "docs" : {
      "count" : 0,
      "deleted" : 0
    },
    ...
  },
  "nodes" : {
    "count" : {
      "total" : 1,
      "cluster_manager" : 1,
      "coordinating_only" : 0,
      "data" : 1,
      "ingest" : 1,
      "master" : 1,
      "remote_cluster_client" : 1,
      "search" : 0
    },
    "versions" : [
      "2.10.0"
    ],
    "os" : {
      "available_processors" : 8,
      "allocated_processors" : 8,
      "names" : [
        {
          "name" : "Mac OS X",
          "count" : 1
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Mac OS X",
          "count" : 1
        }
      ],
      "mem" : {
        "total_in_bytes" : 8589934592,
        "free_in_bytes" : 106086400,
        "used_in_bytes" : 8483848192,
        "free_percent" : 1,
        "used_percent" : 99
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 0
      },
      "open_file_descriptors" : {
        "min" : 613,
        "max" : 613,
        "avg" : 613
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 408743,
      "versions" : [
        {
          "version" : "17.0.8",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "17.0.8+7",
          "vm_vendor" : "Eclipse Adoptium",
          "bundled_jdk" : true,
          "using_bundled_jdk" : false,
          "count" : 1
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 147504128,
        "heap_max_in_bytes" : 1073741824
      },
      "threads" : 32
    },
    "fs" : {
      "total_in_bytes" : 245107195904,
      "free_in_bytes" : 159130832896,
      "available_in_bytes" : 159130832896,
      "cache_reserved_in_bytes" : 0
    },
    "plugins" : [
     ....
    ],
    "ingest" : {
      "number_of_pipelines" : 0,
      "processor_stats" : { }
    }
  }
}

Oct 03 '23 00:10 cgchinmay

I was facing the same issue with the cluster setup and followed the instructions given by @rishabh6788 in Slack. Posting it here for reach.

Download the zip/tar.gz from https://opensearch.org/downloads.html
Extract it, go into the opensearch folder and open config/opensearch.yml file.
Add following settings to it discovery.type: single-node and plugins.security.disabled: true then save and close it.
Run the opensearch-install.bat script (for linux it's opensearch-tar-install.sh) inside the opensearch folder.
Check the output of curl.exe "http://localhost:9200/_cluster/health?pretty". Output should be similar to the below:

{
  "cluster_name" : "<name>",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 3,
  "active_shards" : 3,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

opensearch-benchmark execute-test --pipeline=benchmark-only --workload=geonames --target-host=127.0.0.1:9200 --test-mode --workload-params '{"number_of_shards":"1","number_of_replicas":"0"}' <- Run this command to try geonames out!

Oct 04 '23 00:10 AkshathRaghav

Will take another look at this and update

Oct 12 '23 21:10 cgchinmay

Thanks @AkshathRaghav for the steps, I was able to see it working. So I looked into OSB provisioned cluster and its using following opensearch.yml file and this file does not have discovery.type: single-node set by default. So I updated the yaml file to include this setting, but then the cluster doesn't get provisioned at all and I don't see any error in logs either.

cc: @rishabh6788 , @IanHoang any suggestion on how to debug this ?

Here is the updated yaml file

cat ~/.benchmark/benchmarks/test_executions/7fc066bb-6cdb-46a6-be78-e6dd4d5e305d/osb-node-1/install/opensearch-2.10.0/config/opensearch.yml 
# ======================== OpenSearch Configuration =========================
#
# NOTE: OpenSearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: benchmark-provisioned-cluster
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: osb-node-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: ['/Users/chinmay/.benchmark/benchmarks/test_executions/7fc066bb-6cdb-46a6-be78-e6dd4d5e305d/osb-node-1/install/opensearch-2.10.0/data']
#
# Path to log files:
#
path.logs: /Users/chinmay/.benchmark/benchmarks/test_executions/7fc066bb-6cdb-46a6-be78-e6dd4d5e305d/osb-node-1/logs/server
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# OpenSearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 127.0.0.1
#
# Set a custom port for HTTP:
#
http.port: 9200

transport.tcp.port: 9300
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.seed_hosts: ["127.0.0.1:9300"]
# Prevent split brain by specifying the initial master nodes.
cluster.initial_master_nodes: ["opensearch-node-1"]
discovery.type: single-node
#
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
plugins.security.disabled: true

Oct 25 '23 06:10 cgchinmay

Hey @cgchinmay, I was looking into this a bit and narrowed it down to a potential problem with how OSB is populating the cluster.initial_master_nodes setting in the opensearch.yml config file.

Right now if I hardcode the node name before provisioning and starting the cluster using the commands you provided: cluster.initial_master_nodes: ["osb-node-1"]

I get healthy responses from the cluster:

curl localhost:9200
{
  "name" : "osb-node-1",
  "cluster_name" : "benchmark-provisioned-cluster",
  "cluster_uuid" : "rtOEAzCRR0G-aLRjTE791g",
  "version" : {
    "distribution" : "opensearch",
    "number" : "2.10.0",
    "build_type" : "tar",
    "build_hash" : "eee49cb340edc6c4d489bcd9324dda571fc8dc03",
    "build_date" : "2023-09-20T23:54:29.889267151Z",
    "build_snapshot" : false,
    "lucene_version" : "9.7.0",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}

and

curl "http://localhost:9200/_cluster/health?pretty" 
{
  "cluster_name" : "benchmark-provisioned-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 3,
  "active_shards" : 3,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

And I can run the benchmark as intended. cc: @IanHoang @gkamat

Aug 15 '24 18:08 OVI3D0

@OVI3D0 that's an interesting find. Will try this out

Aug 15 '24 18:08 cgchinmay

For a little more context, this came after investigating the logs under ~/.benchmark/benchmarks/test_executions/<installation-id>/osb-node-1/logs/server after replicating your issue, and seeing the following warning:

tail -f benchmark-provisioned-cluster.log 
[2024-08-15T17:09:04,353][WARN ][o.o.c.c.ClusterFormationFailureHelper] [osb-node-1] cluster-manager not discovered yet, this node has not previously joined a bootstrapped cluster, and this node must discover cluster-manager-eligible nodes [opensearch-node-1] to bootstrap a cluster: have discovered [{osb-node-1}{gULLiOVDS565rl8liTw5pw}{puMy3v0nS3KZIbZz_huXuQ}{127.0.0.1}{127.0.0.1:9300}{dimr}{shard_indexing_pressure_enabled=true}]; discovery will continue using [] from hosts providers and [{osb-node-1}{gULLiOVDS565rl8liTw5pw}{puMy3v0nS3KZIbZz_huXuQ}{127.0.0.1}{127.0.0.1:9300}{dimr}{shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 0, last-accepted version 0 in term 0

Hey @cgchinmay, I was looking into this a bit and narrowed it down to a potential problem with how OSB is populating the cluster.initial_master_nodes setting in the opensearch.yml config file.

Right now if I hardcode the node name before provisioning and starting the cluster using the commands you provided: cluster.initial_master_nodes: ["osb-node-1"]

I get healthy responses from the cluster:

curl localhost:9200
{
  "name" : "osb-node-1",
  "cluster_name" : "benchmark-provisioned-cluster",
  "cluster_uuid" : "rtOEAzCRR0G-aLRjTE791g",
  "version" : {
    "distribution" : "opensearch",
    "number" : "2.10.0",
    "build_type" : "tar",
    "build_hash" : "eee49cb340edc6c4d489bcd9324dda571fc8dc03",
    "build_date" : "2023-09-20T23:54:29.889267151Z",
    "build_snapshot" : false,
    "lucene_version" : "9.7.0",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}

and

curl "http://localhost:9200/_cluster/health?pretty" 
{
  "cluster_name" : "benchmark-provisioned-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 3,
  "active_shards" : 3,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

And I can run the benchmark as intended. cc: @IanHoang @gkamat

Aug 15 '24 19:08 OVI3D0

Great find @OVI3D0! Are there any follow-up action items we can do to streamline this for our users (e.g. help text when users encounter the issue that Chinmay provided above or automatically set the cluster.initial_master_nodesto "osb-node-1" by default)?

Aug 16 '24 16:08 IanHoang

Isn't the OpenSearch distribution installation command provided in the description incorrect?

opensearch-benchmark install --provision-config-instance=defaults --distribution-version=2.10.0 --node-name="osb-node-1" --network-host="127.0.0.1" --http-port=9200 --master-nodes="opensearch-node-1" --seed-hosts="127.0.0.1:9300" --quiet

The master node specification should match the node name. Changing it to osb-node-1 appears to get the cluster working properly. Perhaps @OVI3D0 can check if further action is needed. Thanks.

Aug 16 '24 16:08 gkamat

Isn't the OpenSearch distribution installation command provided in the description incorrect?
opensearch-benchmark install --provision-config-instance=defaults --distribution-version=2.10.0 --node-name="osb-node-1" --network-host="127.0.0.1" --http-port=9200 --master-nodes="opensearch-node-1" --seed-hosts="127.0.0.1:9300" --quiet
The master node specification should match the node name. Changing it to osb-node-1 appears to get the cluster working properly. Perhaps @OVI3D0 can check if further action is needed. Thanks.

You're right, this worked for me as well. If these values need to match, then I think we can streamline this by adding a check before building the cluster, to make sure they always match: https://github.com/opensearch-project/opensearch-benchmark/pull/621

Aug 19 '24 22:08 OVI3D0

I will close this issue. Here is the right command based on above observations. The node name and master nodes name must match

opensearch-benchmark  install --provision-config-instance=defaults --distribution-version=2.10.0 --node-name="osb-node-1" --network-host="127.0.0.1" --http-port=9200 --master-nodes="osb-node-1" --seed-hosts="127.0.0.1:9300"

Aug 26 '24 13:08 cgchinmay