security icon indicating copy to clipboard operation
security copied to clipboard

[BUG] Securityadmin error: exits with node reported failures

Open prudhvigodithi opened this issue 3 years ago • 37 comments

What is the bug? Executing /usr/share/opensearch/plugins/opensearch-security/tools/securityadmin.sh throws error as FAIL: Expected 2 nodes to return response, but got 0 Full error

**************************************************************************
** This tool will be deprecated in the next major release of OpenSearch **
** https://github.com/opensearch-project/security/issues/1755           **
**************************************************************************
Security Admin v7
Will connect to my-first-cluster.default.svc.cluster.local:9200 ... done
Connected as "CN=admin,OU=my-first-cluster"
OpenSearch Version: 2.0.1
Contacting opensearch cluster 'opensearch' and wait for YELLOW clusterstate ...
Clustername: my-first-cluster
Clusterstate: GREEN
Number of nodes: 2
Number of data nodes: 1
.opendistro_security index already exists, so we do not need to create one.
Legacy index '.opendistro_security' (ES 6) detected (or forced). You should migrate the configuration!
Populate config from /usr/share/opensearch/config/opensearch-security/
Will update '/config' with /usr/share/opensearch/config/opensearch-security/config.yml (legacy mode)
   SUCC: Configuration for 'config' created or updated
Will update '/roles' with /usr/share/opensearch/config/opensearch-security/roles.yml (legacy mode)
   SUCC: Configuration for 'roles' created or updated
Will update '/rolesmapping' with /usr/share/opensearch/config/opensearch-security/roles_mapping.yml (legacy mode)
   SUCC: Configuration for 'rolesmapping' created or updated
Will update '/internalusers' with /usr/share/opensearch/config/opensearch-security/internal_users.yml (legacy mode)
   SUCC: Configuration for 'internalusers' created or updated
Will update '/actiongroups' with /usr/share/opensearch/config/opensearch-security/action_groups.yml (legacy mode)
   SUCC: Configuration for 'actiongroups' created or updated
Will update '/nodesdn' with /usr/share/opensearch/config/opensearch-security/nodes_dn.yml (legacy mode)
   SUCC: Configuration for 'nodesdn' created or updated
Will update '/whitelist' with /usr/share/opensearch/config/opensearch-security/whitelist.yml (legacy mode)
   SUCC: Configuration for 'whitelist' created or updated
Will update '/audit' with /usr/share/opensearch/config/opensearch-security/audit.yml (legacy mode)
   SUCC: Configuration for 'audit' created or updated
FAIL: 2 nodes reported failures. Failure is /{"_nodes":{"total":2,"successful":0,"failed":2,"failures":[{"type":"failed_node_exception","reason":"Failed node [E_Dyk7VUR_ee4wykVYJSoA]","node_id":"E_Dyk7VUR_ee4wykVYJSoA","caused_by":{"type":"static_resource_exception","reason":"static_resource_exception: Unable to load static tenants"}},{"type":"failed_node_exception","reason":"Failed node [G4U098vuRCGF8RTI3KPRPA]","node_id":"G4U098vuRCGF8RTI3KPRPA","caused_by":{"type":"static_resource_exception","reason":"Unable to load static tenants"}}]},"cluster_name":"my-first-cluster","configupdate_response":{"nodes":{},"node_size":0,"has_failures":true,"failures_size":2}}
FAIL: Expected 2 nodes to return response, but got 0
Done with failures

How can one reproduce the bug? Start the docker container with some persistence storage and when executed /usr/share/opensearch/plugins/opensearch-security/tools/securityadmin.sh throws this error.

What is the expected behavior? Executing Securityadmin script should create an security index as expected, when it works logs successful message as

**************************************************************************
** This tool will be deprecated in the next major release of OpenSearch **
** https://github.com/opensearch-project/security/issues/1755           **
**************************************************************************
Security Admin v7
Will connect to my-first-cluster.default.svc.cluster.local:9200 ... done
Connected as "CN=admin,OU=my-first-cluster"
OpenSearch Version: 2.0.1
Contacting opensearch cluster 'opensearch' and wait for YELLOW clusterstate ...
Clustername: my-first-cluster
Clusterstate: YELLOW
Number of nodes: 2
Number of data nodes: 1
.opendistro_security index already exists, so we do not need to create one.
Legacy index '.opendistro_security' (ES 6) detected (or forced). You should migrate the configuration!
Populate config from /usr/share/opensearch/config/opensearch-security/
Will update '/config' with /usr/share/opensearch/config/opensearch-security/config.yml (legacy mode)
   SUCC: Configuration for 'config' created or updated
Will update '/roles' with /usr/share/opensearch/config/opensearch-security/roles.yml (legacy mode)
   SUCC: Configuration for 'roles' created or updated
Will update '/rolesmapping' with /usr/share/opensearch/config/opensearch-security/roles_mapping.yml (legacy mode)
   SUCC: Configuration for 'rolesmapping' created or updated
Will update '/internalusers' with /usr/share/opensearch/config/opensearch-security/internal_users.yml (legacy mode)
   SUCC: Configuration for 'internalusers' created or updated
Will update '/actiongroups' with /usr/share/opensearch/config/opensearch-security/action_groups.yml (legacy mode)
   SUCC: Configuration for 'actiongroups' created or updated
Will update '/nodesdn' with /usr/share/opensearch/config/opensearch-security/nodes_dn.yml (legacy mode)
   SUCC: Configuration for 'nodesdn' created or updated
Will update '/whitelist' with /usr/share/opensearch/config/opensearch-security/whitelist.yml (legacy mode)
   SUCC: Configuration for 'whitelist' created or updated
Will update '/audit' with /usr/share/opensearch/config/opensearch-security/audit.yml (legacy mode)
   SUCC: Configuration for 'audit' created or updated
SUCC: Expected 7 config types for node {"updated_config_types":["config","roles","rolesmapping","internalusers","actiongroups","nodesdn","audit"],"updated_config_size":7,"message":null} is 7 (["config","roles","rolesmapping","internalusers","actiongroups","nodesdn","audit"]) due to: null
SUCC: Expected 7 config types for node {"updated_config_types":["config","roles","rolesmapping","internalusers","actiongroups","nodesdn","audit"],"updated_config_size":7,"message":null} is 7 (["config","roles","rolesmapping","internalusers","actiongroups","nodesdn","audit"]) due to: null
Done with success

What is your host/environment?

  • OS: 2.0.1
  • Version [e.g. 22]
  • Plugins: Docker container docker.io/opensearchproject/opensearch:2.0.1

Do you have any additional context? Following the issue in past https://github.com/opensearch-project/helm-charts/issues/158, this was not resolved with config_version: 2 in action_groups.yml, is there a co-relation with config_version: 2?

This issue is raised to help OpenSearch Kubernetes Operator compatible with 2.0.0 series of OpenSearch. https://github.com/Opster/opensearch-k8s-operator/issues/176

prudhvigodithi avatar Jun 21 '22 22:06 prudhvigodithi

This could be an issue

.opendistro_security index does not exists, attempt to create it ... ERR: An unexpected SocketTimeoutException occured: 30,000 milliseconds timeout on connection http-outgoing-5 [ACTIVE]
Trace:
java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-5 [ACTIVE]
	at org.opensearch.client.RestClient.extractAndWrapCause(RestClient.java:905)
	at org.opensearch.client.RestClient.performRequest(RestClient.java:307)
	at org.opensearch.client.RestClient.performRequest(RestClient.java:295)
	at org.opensearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1762)
	at org.opensearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1745)
	at org.opensearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1709)
	at org.opensearch.client.IndicesClient.create(IndicesClient.java:159)
	at org.opensearch.security.tools.SecurityAdmin.createConfigIndex(SecurityAdmin.java:1171)
	at org.opensearch.security.tools.SecurityAdmin.execute(SecurityAdmin.java:677)
	at org.opensearch.security.tools.SecurityAdmin.main(SecurityAdmin.java:161)
Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-5 [ACTIVE]
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
	at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92)
	at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39)
	at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:502)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
	at java.base/java.lang.Thread.run(Thread.java:833)

Trying to create an index early, causing this error, after a certain period, then I get Done with success, would be helpful, if this can be handled by /usr/share/opensearch/plugins/opensearch-security/tools/securityadmin.sh script.

prudhvigodithi avatar Jun 22 '22 00:06 prudhvigodithi

Can you please share a full docker-compose.yaml reproducing this error?

smlx avatar Jun 24 '22 13:06 smlx

Can you please share a full docker-compose.yaml reproducing this error?

Hey @smlx, I can even try with docker-compose.yaml but I have seen this with a k8s stateful set executing /usr/share/opensearch/plugins/opensearch-security/tools/securityadmin.sh -cacert /certs/ca.crt -cert /certs/tls.crt -key /certs/tls.key -cd /usr/share/opensearch/config/opensearch-security -icl -nhnv -h my-first-cluster.default.svc.cluster.local -p 9200 , just immediately the cluster is launched, but that said works after sleep 60's and then initiate the above command again, so from the above error , I think .opendistro_security was not properly created, hence it fails with error

FAIL: 2 nodes reported failures. Failure is /{"_nodes":{"total":2,"successful":0,"failed":2,"failures":[{"type":"failed_node_exception","reason":"Failed node [E_Dyk7VUR_ee4wykVYJSoA]","node_id":"E_Dyk7VUR_ee4wykVYJSoA","caused_by":{"type":"static_resource_exception","reason":"static_resource_exception: Unable to load static tenants"}},{"type":"failed_node_exception","reason":"Failed node [G4U098vuRCGF8RTI3KPRPA]","node_id":"G4U098vuRCGF8RTI3KPRPA","caused_by":{"type":"static_resource_exception","reason":"Unable to load static tenants"}}]},"cluster_name":"my-first-cluster","configupdate_response":{"nodes":{},"node_size":0,"has_failures":true,"failures_size":2}}
FAIL: Expected 2 nodes to return response, but got 0
Done with failures

Additionally even without sleep 60's I have not seen when using emptyDir: {}, only with persistence storage, as this will add little more time to provision the persistence storage and allow the contents to be written on it (pre-warning).

prudhvigodithi avatar Jun 24 '22 14:06 prudhvigodithi

Additionally, once I get this java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-5 [ACTIVE] , next time when I run the securityadmin.sh I get the message as .opendistro_security index already exists, so we do not need to create one. and not as .opendistro_security index does not exists, attempt to create it ... done (0-all replicas)

prudhvigodithi avatar Jun 24 '22 14:06 prudhvigodithi

I ran into this same issue with securityadmin.sh timing out (java.net.SocketTimeoutException: 30,000 milliseconds) when try to deploy an OpenSearch 2.0.0 cluster on Kubernetes. I am running my securityadmin.sh as a k8s Job, which fails with the timeout and re-creates the pod and yields the same .opendistro_security index already exists, so we do not need to create one. and Done with failures on the next run(s). My cluster health was green and I was running three manager nodes in a StatefulSet. I deleted the namespace I was running the OpenSearch cluster in and rebuilt the cluster using emptyDir: {} as @prudhvigodithi mentioned, which yielded a successful result when running securityadmin.sh! My suspicion is that the StorageClass I am using in my Kubernetes environment is too slow to run OpenSearch. I do know that the StorageClass is backed by slow spinning disks, whereas the nodes are on SSDs which would explain why emptyDir: {} worked without timeouts and using an actual persistent volume failed.

daxxog avatar Jun 27 '22 12:06 daxxog

Hey @daxxog just to add as a hack, this sleep scenario should work temporarily, but still there are some chances this can re-occur. Also yes this causing with slow disk startups with using StorageClass. CC. @bbarani @peterzhuamazon

prudhvigodithi avatar Jun 27 '22 12:06 prudhvigodithi

Even with the sleep I was unable to get it to work, as my cluster storage was just slow in general (not just at startup). What I ended up doing:

  • Created a 3-node manager cluster using emptyDir: {}
  • Ran securityadmin.sh, without errors
  • Added 6 manager-eligible data nodes to the cluster, using my slow StorageClass
  • Waited for cluster to be green
  • Deleted the original 3 manager nodes
  • Waited for cluster to be green
  • Added 3 manager nodes back, with slow StorageClass
  • Waited for cluster to be green
  • Deleted one data node at a time, waiting for cluster to be green in-between each deletion

daxxog avatar Jun 27 '22 12:06 daxxog

[Triage] Hey @prudhvigodithi, looks like you got this resolved. For anyone with an issue please open a new github issue. For anyone is looking for support please open an issue on the forum so we can track it.

DarshitChanpura avatar Jun 28 '22 19:06 DarshitChanpura

Hey @DarshitChanpura, I have used a hack to make it work but looks like for @daxxog its not yet resolved he ended up with another hack, the idea here is for securityadmin to handle this connection to the cluster and create a .opendistro_security index once the cluster is fully ready, could we consider reopening this and can be tracked here ? Thank you

prudhvigodithi avatar Jun 28 '22 19:06 prudhvigodithi

Hey @prudhvigodithi Would you mind providing more details on how to reproduce this as it seems like a migration issue from an older version. What are the steps needed to follow to reproduce the issue independently, including which previous version you were using?

DarshitChanpura avatar Jun 28 '22 19:06 DarshitChanpura

Hey @DarshitChanpura this is an installation from scratch for 2.0.1 version, here are the steps that we can re-produce in kubernetes. For 1.x version somehow this error does not show up as the connection is initiated via transport client using port 9300, have seen this with port 9200 (or with another port from http.port ) http connection.

  1. Create securityadmin.sh as a k8s Job that can connect to the cluster that has just started.
apiVersion: batch/v1
kind: Job
metadata:
  generation: 1
  labels:
    controller-uid: 0881c5cd-a44a-4d34-938f-2f05a14807de
    job-name: my-first-cluster-securityconfig-update
  name: my-first-cluster-securityconfig-update
  namespace: default
spec:
  backoffLimit: 0
  completionMode: NonIndexed
  completions: 1
  parallelism: 1
  selector:
    matchLabels:
      controller-uid: 0881c5cd-a44a-4d34-938f-2f05a14807de
  suspend: false
  template:
    metadata:
      creationTimestamp: null
      labels:
        controller-uid: 0881c5cd-a44a-4d34-938f-2f05a14807de
        job-name: my-first-cluster-securityconfig-update
      name: my-first-cluster-securityconfig-update
    spec:
      containers:
      - args:
        - ADMIN=/usr/share/opensearch/plugins/opensearch-security/tools/securityadmin.sh;chmod
          +x $ADMIN; count=0; until $ADMIN -cacert /certs/ca.crt -cert /certs/tls.crt -key /certs/tls.key
          -cd /usr/share/opensearch/config/opensearch-security -icl -nhnv -h my-first-cluster.default.svc.cluster.local
          -p 9200 || (( count++ >= 20 )); do  sleep 20; done
        command:
        - /bin/bash
        - -c
        image: docker.io/opensearchproject/opensearch:2.0.1
        imagePullPolicy: IfNotPresent
        name: updater
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/share/opensearch/config/tls-transport
          name: transport-cert
        - mountPath: /usr/share/opensearch/config/tls-http
          name: http-cert
        - mountPath: /usr/share/opensearch/config/opensearch-security/action_groups.yml
          name: securityconfig
          readOnly: true
          subPath: action_groups.yml
        - mountPath: /usr/share/opensearch/config/opensearch-security/config.yml
          name: securityconfig
          readOnly: true
          subPath: config.yml
        - mountPath: /usr/share/opensearch/config/opensearch-security/internal_users.yml
          name: securityconfig
          readOnly: true
          subPath: internal_users.yml
        - mountPath: /usr/share/opensearch/config/opensearch-security/nodes_dn.yml
          name: securityconfig
          readOnly: true
          subPath: nodes_dn.yml
        - mountPath: /usr/share/opensearch/config/opensearch-security/roles.yml
          name: securityconfig
          readOnly: true
          subPath: roles.yml
        - mountPath: /usr/share/opensearch/config/opensearch-security/roles_mapping.yml
          name: securityconfig
          readOnly: true
          subPath: roles_mapping.yml
        - mountPath: /usr/share/opensearch/config/opensearch-security/tenants.yml
          name: securityconfig
          readOnly: true
          subPath: tenants.yml
        - mountPath: /usr/share/opensearch/config/opensearch-security/whitelist.yml
          name: securityconfig
          readOnly: true
          subPath: whitelist.yml
        - mountPath: /certs
          name: admin-cert
      dnsPolicy: ClusterFirst
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 5
      volumes:
      - name: transport-cert
        secret:
          defaultMode: 420
          secretName: my-first-cluster-transport-cert
      - name: http-cert
        secret:
          defaultMode: 420
          secretName: my-first-cluster-http-cert
      - name: securityconfig
        secret:
          defaultMode: 420
          secretName: securityconfig-secret
      - name: admin-cert
        secret:
          defaultMode: 420
          secretName: my-first-cluster-admin-cert
  1. The cluster uses backend Persistence layer (StorageClass), backend can be AWS EBS or any cloud provider storage.

  2. While the cluster is being initialized, invoking securityadmin.sh from the above job using as

/usr/share/opensearch/plugins/opensearch-security/tools/securityadmin.sh -cacert /certs/ca.crt -cert /certs/tls.crt -key /certs/tls.key -cd /usr/share/opensearch/config/opensearch-security -icl -nhnv -h my-first-cluster.default.svc.cluster.local -p 9200
  1. The logs of the job are: Will connect to my-first-cluster.default.svc.cluster.local:9300 ... done Then it will error out as follows,
.opendistro_security index does not exists, attempt to create it ... ERR: An unexpected SocketTimeoutException occured: 30,000 milliseconds timeout on connection http-outgoing-5 [ACTIVE]
Trace:
java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-5 [ACTIVE]
	at org.opensearch.client.RestClient.extractAndWrapCause(RestClient.java:905)
	at org.opensearch.client.RestClient.performRequest(RestClient.java:307)
	at org.opensearch.client.RestClient.performRequest(RestClient.java:295)
	at org.opensearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1762)
	at org.opensearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1745)
	at org.opensearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1709)
	at org.opensearch.client.IndicesClient.create(IndicesClient.java:159)
	at org.opensearch.security.tools.SecurityAdmin.createConfigIndex(SecurityAdmin.java:1171)
	at org.opensearch.security.tools.SecurityAdmin.execute(SecurityAdmin.java:677)
	at org.opensearch.security.tools.SecurityAdmin.main(SecurityAdmin.java:161)
Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-5 [ACTIVE]

After few mins Upon retry

FAIL: 2 nodes reported failures. Failure is /{"_nodes":{"total":2,"successful":0,"failed":2,"failures":[{"type":"failed_node_exception","reason":"Failed node [tFlDyOnUTv2jmIxX7ZT3Gw]","node_id":"tFlDyOnUTv2jmIxX7ZT3Gw","caused_by":{"type":"static_resource_exception","reason":"static_resource_exception: Unable to load static tenants"}},{"type":"failed_node_exception","reason":"Failed node [-5cpR334Sm-4GGtL7f46yQ]","node_id":"-5cpR334Sm-4GGtL7f46yQ","caused_by":{"type":"static_resource_exception","reason":"Unable to load static tenants"}}]},"cluster_name":"my-first-cluster","configupdate_response":{"nodes":{},"node_size":0,"has_failures":true,"failures_size":2}}

The cluster logs meanwhile: This is catch 22, the cluster wont start as the securityadmin not initialized, but the job that is responsible to to run securityadmin fails with java.net.SocketTimeoutException: 30,000 logs: [2022-06-28T19:51:24,580][ERROR][o.o.s.a.BackendRegistry ] [my-first-cluster-bootstrap-0] Not yet initialized (you may need to run securityadmin)

prudhvigodithi avatar Jun 28 '22 20:06 prudhvigodithi

@prudhvigodithi The scenario that you have does not seem to be exactly same as the one originally being filed. Looking at the issue description, I see the following in the logs. That indicates that there was an existing .opendistro_security index in the cluster. So there are more than 1 use case that needs to be investigated here.

.opendistro_security index already exists, so we do not need to create one.
Legacy index '.opendistro_security' (ES 6) detected (or forced). You should migrate the configuration!

cliu123 avatar Jun 28 '22 20:06 cliu123

Hey @cliu123, it starts with java.net.SocketTimeoutException: 30,000, and then upon retry it says .opendistro_security index already exists, and error as FAIL: 2 nodes reported failures. Failure is /{"_nodes":{"total":2,"successful":0,"failed":2,"failures":[{"type":"failed_node_exception","reason":"Failed node Which is what I have raised the issue with.

I suspect the .opendistro_security gets created even with the error java.net.SocketTimeoutException: 30,000.

Full Error log

k logs my-first-cluster-securityconfig-update--1-9z55w -f
Waiting to connect to the cluster
OpenSearch Security not initialized.**************************************************************************
** This tool will be deprecated in the next major release of OpenSearch **
** https://github.com/opensearch-project/security/issues/1755           **
**************************************************************************
Security Admin v7
Will connect to my-first-cluster.default.svc.cluster.local:9400 ... done
Connected as "CN=admin,OU=my-first-cluster"
OpenSearch Version: 2.0.1
Contacting opensearch cluster 'opensearch' and wait for YELLOW clusterstate ...
Clustername: my-first-cluster
Clusterstate: GREEN
Number of nodes: 1
Number of data nodes: 0
.opendistro_security index does not exists, attempt to create it ... ERR: An unexpected SocketTimeoutException occured: 30,000 milliseconds timeout on connection http-outgoing-5 [ACTIVE]
Trace:
java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-5 [ACTIVE]
	at org.opensearch.client.RestClient.extractAndWrapCause(RestClient.java:905)
	at org.opensearch.client.RestClient.performRequest(RestClient.java:307)
	at org.opensearch.client.RestClient.performRequest(RestClient.java:295)
	at org.opensearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1762)
	at org.opensearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1745)
	at org.opensearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1709)
	at org.opensearch.client.IndicesClient.create(IndicesClient.java:159)
	at org.opensearch.security.tools.SecurityAdmin.createConfigIndex(SecurityAdmin.java:1171)
	at org.opensearch.security.tools.SecurityAdmin.execute(SecurityAdmin.java:677)
	at org.opensearch.security.tools.SecurityAdmin.main(SecurityAdmin.java:161)
Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-5 [ACTIVE]
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
	at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92)
	at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39)
	at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:502)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
	at java.base/java.lang.Thread.run(Thread.java:833)


**************************************************************************
** This tool will be deprecated in the next major release of OpenSearch **
** https://github.com/opensearch-project/security/issues/1755           **
**************************************************************************
Security Admin v7
Will connect to my-first-cluster.default.svc.cluster.local:9400 ... done
Connected as "CN=admin,OU=my-first-cluster"
OpenSearch Version: 2.0.1
Contacting opensearch cluster 'opensearch' and wait for YELLOW clusterstate ...
Clustername: my-first-cluster
Clusterstate: GREEN
Number of nodes: 2
Number of data nodes: 1
.opendistro_security index already exists, so we do not need to create one.
Legacy index '.opendistro_security' (ES 6) detected (or forced). You should migrate the configuration!
Populate config from /usr/share/opensearch/config/opensearch-security/
Will update '/config' with /usr/share/opensearch/config/opensearch-security/config.yml (legacy mode)
   SUCC: Configuration for 'config' created or updated
Will update '/roles' with /usr/share/opensearch/config/opensearch-security/roles.yml (legacy mode)
   SUCC: Configuration for 'roles' created or updated
Will update '/rolesmapping' with /usr/share/opensearch/config/opensearch-security/roles_mapping.yml (legacy mode)
   SUCC: Configuration for 'rolesmapping' created or updated
Will update '/internalusers' with /usr/share/opensearch/config/opensearch-security/internal_users.yml (legacy mode)
   SUCC: Configuration for 'internalusers' created or updated
Will update '/actiongroups' with /usr/share/opensearch/config/opensearch-security/action_groups.yml (legacy mode)
   SUCC: Configuration for 'actiongroups' created or updated
Will update '/nodesdn' with /usr/share/opensearch/config/opensearch-security/nodes_dn.yml (legacy mode)
   SUCC: Configuration for 'nodesdn' created or updated
Will update '/whitelist' with /usr/share/opensearch/config/opensearch-security/whitelist.yml (legacy mode)
   SUCC: Configuration for 'whitelist' created or updated
Will update '/audit' with /usr/share/opensearch/config/opensearch-security/audit.yml (legacy mode)
   SUCC: Configuration for 'audit' created or updated
FAIL: 2 nodes reported failures. Failure is /{"_nodes":{"total":2,"successful":0,"failed":2,"failures":[{"type":"failed_node_exception","reason":"Failed node [tFlDyOnUTv2jmIxX7ZT3Gw]","node_id":"tFlDyOnUTv2jmIxX7ZT3Gw","caused_by":{"type":"static_resource_exception","reason":"static_resource_exception: Unable to load static tenants"}},{"type":"failed_node_exception","reason":"Failed node [-5cpR334Sm-4GGtL7f46yQ]","node_id":"-5cpR334Sm-4GGtL7f46yQ","caused_by":{"type":"static_resource_exception","reason":"Unable to load static tenants"}}]},"cluster_name":"my-first-cluster","configupdate_response":{"nodes":{},"node_size":0,"has_failures":true,"failures_size":2}}
FAIL: Expected 2 nodes to return response, but got 0
Done with failures
**************************************************************************
** This tool will be deprecated in the next major release of OpenSearch **
** https://github.com/opensearch-project/security/issues/1755           **
**************************************************************************
Security Admin v7
Will connect to my-first-cluster.default.svc.cluster.local:9400 ... done
Connected as "CN=admin,OU=my-first-cluster"
OpenSearch Version: 2.0.1
Contacting opensearch cluster 'opensearch' and wait for YELLOW clusterstate ...
Clustername: my-first-cluster
Clusterstate: GREEN
Number of nodes: 2
Number of data nodes: 1
.opendistro_security index already exists, so we do not need to create one.
Legacy index '.opendistro_security' (ES 6) detected (or forced). You should migrate the configuration!
Populate config from /usr/share/opensearch/config/opensearch-security/
Will update '/config' with /usr/share/opensearch/config/opensearch-security/config.yml (legacy mode)
   SUCC: Configuration for 'config' created or updated
Will update '/roles' with /usr/share/opensearch/config/opensearch-security/roles.yml (legacy mode)
   SUCC: Configuration for 'roles' created or updated
Will update '/rolesmapping' with /usr/share/opensearch/config/opensearch-security/roles_mapping.yml (legacy mode)
   SUCC: Configuration for 'rolesmapping' created or updated
Will update '/internalusers' with /usr/share/opensearch/config/opensearch-security/internal_users.yml (legacy mode)
   SUCC: Configuration for 'internalusers' created or updated
Will update '/actiongroups' with /usr/share/opensearch/config/opensearch-security/action_groups.yml (legacy mode)
   SUCC: Configuration for 'actiongroups' created or updated
Will update '/nodesdn' with /usr/share/opensearch/config/opensearch-security/nodes_dn.yml (legacy mode)
   SUCC: Configuration for 'nodesdn' created or updated
Will update '/whitelist' with /usr/share/opensearch/config/opensearch-security/whitelist.yml (legacy mode)
   SUCC: Configuration for 'whitelist' created or updated
Will update '/audit' with /usr/share/opensearch/config/opensearch-security/audit.yml (legacy mode)
   SUCC: Configuration for 'audit' created or updated
FAIL: 2 nodes reported failures. Failure is /{"_nodes":{"total":2,"successful":0,"failed":2,"failures":[{"type":"failed_node_exception","reason":"Failed node [tFlDyOnUTv2jmIxX7ZT3Gw]","node_id":"tFlDyOnUTv2jmIxX7ZT3Gw","caused_by":{"type":"static_resource_exception","reason":"static_resource_exception: Unable to load static tenants"}},{"type":"failed_node_exception","reason":"Failed node [-5cpR334Sm-4GGtL7f46yQ]","node_id":"-5cpR334Sm-4GGtL7f46yQ","caused_by":{"type":"static_resource_exception","reason":"Unable to load static tenants"}}]},"cluster_name":"my-first-cluster","configupdate_response":{"nodes":{},"node_size":0,"has_failures":true,"failures_size":2}}
FAIL: Expected 2 nodes to return response, but got 0
Done with failures
**************************************************************************
** This tool will be deprecated in the next major release of OpenSearch **
** https://github.com/opensearch-project/security/issues/1755           **
**************************************************************************
Security Admin v7
Will connect to my-first-cluster.default.svc.cluster.local:9400 ... done
Connected as "CN=admin,OU=my-first-cluster"
OpenSearch Version: 2.0.1
Contacting opensearch cluster 'opensearch' and wait for YELLOW clusterstate ...
Clustername: my-first-cluster
Clusterstate: GREEN
Number of nodes: 2
Number of data nodes: 1
.opendistro_security index already exists, so we do not need to create one.
Legacy index '.opendistro_security' (ES 6) detected (or forced). You should migrate the configuration!
Populate config from /usr/share/opensearch/config/opensearch-security/
Will update '/config' with /usr/share/opensearch/config/opensearch-security/config.yml (legacy mode)
   SUCC: Configuration for 'config' created or updated
Will update '/roles' with /usr/share/opensearch/config/opensearch-security/roles.yml (legacy mode)
   SUCC: Configuration for 'roles' created or updated
Will update '/rolesmapping' with /usr/share/opensearch/config/opensearch-security/roles_mapping.yml (legacy mode)
   SUCC: Configuration for 'rolesmapping' created or updated
Will update '/internalusers' with /usr/share/opensearch/config/opensearch-security/internal_users.yml (legacy mode)
   SUCC: Configuration for 'internalusers' created or updated
Will update '/actiongroups' with /usr/share/opensearch/config/opensearch-security/action_groups.yml (legacy mode)
   SUCC: Configuration for 'actiongroups' created or updated
Will update '/nodesdn' with /usr/share/opensearch/config/opensearch-security/nodes_dn.yml (legacy mode)
   SUCC: Configuration for 'nodesdn' created or updated
Will update '/whitelist' with /usr/share/opensearch/config/opensearch-security/whitelist.yml (legacy mode)
   SUCC: Configuration for 'whitelist' created or updated
Will update '/audit' with /usr/share/opensearch/config/opensearch-security/audit.yml (legacy mode)
   SUCC: Configuration for 'audit' created or updated
FAIL: 2 nodes reported failures. Failure is /{"_nodes":{"total":2,"successful":0,"failed":2,"failures":[{"type":"failed_node_exception","reason":"Failed node [tFlDyOnUTv2jmIxX7ZT3Gw]","node_id":"tFlDyOnUTv2jmIxX7ZT3Gw","caused_by":{"type":"static_resource_exception","reason":"static_resource_exception: Unable to load static tenants"}},{"type":"failed_node_exception","reason":"Failed node [-5cpR334Sm-4GGtL7f46yQ]","node_id":"-5cpR334Sm-4GGtL7f46yQ","caused_by":{"type":"static_resource_exception","reason":"Unable to load static tenants"}}]},"cluster_name":"my-first-cluster","configupdate_response":{"nodes":{},"node_size":0,"has_failures":true,"failures_size":2}}
FAIL: Expected 2 nodes to return response, but got 0

prudhvigodithi avatar Jun 28 '22 20:06 prudhvigodithi

Hey @prudhvigodithi Would you mind providing more details on how to reproduce this as it seems like a migration issue from an older version. What are the steps needed to follow to reproduce the issue independently, including which previous version you were using?

My issue was also encountered during a from-scratch installation, not an upgrade.

daxxog avatar Jun 30 '22 16:06 daxxog

I have done research and have some findings for this, the securityadmin client should have .setSocketTimeout for the RestLevelClient, initially I assumed its because of net.ipv4.tcp_keepalive_time thats dropping the connections but its not true, I see the same even after passing this sysctls with PodSecurityContext. So socketTimeout should be set during client creation.

prudhvigodithi avatar Jul 07 '22 13:07 prudhvigodithi

[Triage] @cliu123 could you look into this and let us know what your findings are?

peternied avatar Jul 11 '22 19:07 peternied

The error does not happen when using emptyDir as @prudhvigodithi mentioned.

cliu123 avatar Jul 12 '22 15:07 cliu123

I have some findings for securityadmin error, the fix has to be from the client end (which is on our case the securityadmin)to use .setSocketTimeout for the RestHighLevelClient. So socketTimeout should be set during client creation. Something like .setSocketTimeout(OpenSearchConfig().getClientSocketTimeout());

prudhvigodithi avatar Jul 12 '22 15:07 prudhvigodithi

@prudhvigodithi Good finding! If you'd like to PR the fix, we'd love to review. Thanks!

cliu123 avatar Jul 12 '22 16:07 cliu123

hi any updates thank you !

elkh510 avatar Jul 19 '22 15:07 elkh510

I'm having a similar problem as well. Am trying to bootstrap a 2.2.0 cluster from scratch, and when running securityadmin I get:

OPENSEARCH_JAVA_HOME=/opt/opensearch/opensearch-2.2.0/jdk/ bash -x /opt/opensearch/opensearch-2.2.0/plugins/opensearch-security/tools/securityadmin.sh -cd /etc/opensearch/opensearch-security/ -icl -nhnv -cacert /etc/opensearch-pki/root-ca.pem -cert /etc/opensearch-pki/client-CLUSTERADMIN.pem -key /etc/opensearch-pki/client-CLUSTERADMIN-key.pem
+ echo '**************************************************************************'
**************************************************************************
+ echo '** This tool will be deprecated in the next major release of OpenSearch **'
** This tool will be deprecated in the next major release of OpenSearch **
+ echo '** https://github.com/opensearch-project/security/issues/1755           **'
** https://github.com/opensearch-project/security/issues/1755           **
+ echo '**************************************************************************'
**************************************************************************
+ SCRIPT_PATH=/opt/opensearch/opensearch-2.2.0/plugins/opensearch-security/tools/securityadmin.sh
++ command -v realpath
+ '[' -x /usr/bin/realpath ']'
++++ realpath /opt/opensearch/opensearch-2.2.0/plugins/opensearch-security/tools/securityadmin.sh
+++ dirname /opt/opensearch/opensearch-2.2.0/plugins/opensearch-security/tools/securityadmin.sh
++ cd /opt/opensearch/opensearch-2.2.0/plugins/opensearch-security/tools
++ pwd -P
+ DIR=/opt/opensearch/opensearch-2.2.0/plugins/opensearch-security/tools
+ BIN_PATH=java
+ '[' '!' -z /opt/opensearch/opensearch-2.2.0/jdk/ ']'
+ BIN_PATH=/opt/opensearch/opensearch-2.2.0/jdk//bin/java
+ /opt/opensearch/opensearch-2.2.0/jdk//bin/java -Dorg.apache.logging.log4j.simplelog.StatusLogger.level=OFF -cp '/opt/opensearch/opensearch-2.2.0/plugins/opensearch-security/tools/../*:/opt/opensearch/opensearch-2.2.0/plugins/opensearch-security/tools/../../../lib/*:/opt/opensearch/opensearch-2.2.0/plugins/opensearch-security/tools/../deps/*' org.opensearch.security.tools.SecurityAdmin -cd /etc/opensearch/opensearch-security/ -icl -nhnv -cacert /etc/opensearch-pki/root-ca.pem -cert /etc/opensearch-pki/client-CLUSTERADMIN.pem -key /etc/opensearch-pki/client-CLUSTERADMIN-key.pem
Security Admin v7
Will connect to localhost:9200 ... done
Connected as "CN=CLIENT-CLUSTERADMIN"
OpenSearch Version: 2.2.0
Contacting opensearch cluster 'opensearch' and wait for YELLOW clusterstate ...
Clustername: foobar-ote
Clusterstate: GREEN
Number of nodes: 1
Number of data nodes: 1
.opendistro_security index already exists, so we do not need to create one.
Legacy index '.opendistro_security' (ES 6) detected (or forced). You should migrate the configuration!
Populate config from /etc/opensearch/opensearch-security/
Will update '/config' with /etc/opensearch/opensearch-security/config.yml (legacy mode)
   SUCC: Configuration for 'config' created or updated
Will update '/roles' with /etc/opensearch/opensearch-security/roles.yml (legacy mode)
   SUCC: Configuration for 'roles' created or updated
Will update '/rolesmapping' with /etc/opensearch/opensearch-security/roles_mapping.yml (legacy mode)
   SUCC: Configuration for 'rolesmapping' created or updated
Will update '/internalusers' with /etc/opensearch/opensearch-security/internal_users.yml (legacy mode)
   SUCC: Configuration for 'internalusers' created or updated
Will update '/actiongroups' with /etc/opensearch/opensearch-security/action_groups.yml (legacy mode)
   SUCC: Configuration for 'actiongroups' created or updated
Will update '/nodesdn' with /etc/opensearch/opensearch-security/nodes_dn.yml (legacy mode)
   SUCC: Configuration for 'nodesdn' created or updated
Will update '/whitelist' with /etc/opensearch/opensearch-security/whitelist.yml (legacy mode)
   SUCC: Configuration for 'whitelist' created or updated
Will update '/audit' with /etc/opensearch/opensearch-security/audit.yml (legacy mode)
   SUCC: Configuration for 'audit' created or updated
FAIL: 1 nodes reported failures. Failure is /{"_nodes":{"total":1,"successful":0,"failed":1,"failures":[{"type":"failed_node_exception","reason":"Failed node [MerZFlm7TM-pyIlvp2QKwA]","node_id":"MerZFlm7TM-pyIlvp2QKwA","caused_by":{"type":"static_resource_exception","reason":"Unable to load static tenants"}}]},"cluster_name":"foobar-ote","configupdate_response":{"nodes":{},"node_size":0,"has_failures":true,"failures_size":1}}
FAIL: Expected 1 nodes to return response, but got 0
Done with failures
sh -x foo  3.35s user 0.17s system 156% cpu 2.252 total

The script finishes in ~3 seconds, so I'm not sure if it's related to a timeout. In my case it's run on Ubuntu, without k8s or anything. Is there an obvious fix I'm missing here?

Freeaqingme avatar Aug 17 '22 14:08 Freeaqingme

@Freeaqingme Have you examined the local opensearch log? Exceptions/errors should be visible to help you troubleshoot what is at issue.

peternied avatar Aug 17 '22 17:08 peternied

Actually I was about to update this. It's possible that this specific node was installed with 2.1.0, and right after upgraded to 2.2.0. I did follow the logs, but the only exception I noticed was "Unable to load static tenants". Having said that, I removed its datadir and afterwards everything ran just fine. So in this case, that was one way of solving the problem...

Freeaqingme avatar Aug 17 '22 19:08 Freeaqingme

For me I got this with fresh installation itself, so nothing to do with the old version nodes ;), just following up to see if there is any update with this issue? @peternied @cliu123 Thank you

prudhvigodithi avatar Sep 16 '22 15:09 prudhvigodithi

Hello All, I have tried to explain my scenario in points.

  1. Installed opensearch and opensearch dashboards - version 2.2.1 using yum
  2. Was able to start both opensearch and opensearch dashboards without any issues
  3. Accessed OS dashboards using Chrome and even data ingestion from metric beat was successful
  4. As it was single node cluster, tried building a cluster formation with one cluster-master node and two data nodes.
  5. After making required changes as listed in the documentation, I was able to start the cluster-master.
  6. But while doing a CURL command to the cluster, was getting "OpenSearch Security not initialized"
  7. So as per my research following the link to sort out the security admin issue

Request the group's help to see if this a known issue and confirm any workaround in place?

Execution of security admin is leading up to the following errors: ./securityadmin.sh -cd /etc/opensearch/opensearch-security -rev -icl -nhnv -cacert ../../../config/root-ca.pem -cert ../../../config/kirk.pem -key ../../../config/kirk-key.pem -h <private-IP> -p 9200 --accept-red-cluster

Logs from security admin execution:

Security Admin v7
Will connect to 10.1.4.117:9200 ... done
Connected as "CN=kirk,OU=client,O=client,L=test,C=de"
OpenSearch Version: 2.2.1
Contacting opensearch cluster 'opensearch' ...
Clustername: opensearch-cluster
Clusterstate: RED
Number of nodes: 1
Number of data nodes: 0
.opendistro_security index already exists, so we do not need to create one.
ERR: .opendistro_security index state is RED.
Legacy index '.opendistro_security' (ES 6) detected (or forced). You should migrate the configuration!
Populate config from /etc/opensearch/opensearch-security/
Will update '/config' with /etc/opensearch/opensearch-security/config.yml (legacy mode)
   FAIL: Configuration for 'config' failed because of java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-6 [ACTIVE]
Will update '/roles' with /etc/opensearch/opensearch-security/roles.yml (legacy mode)
   FAIL: Configuration for 'roles' failed because of java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-7 [ACTIVE]
Will update '/rolesmapping' with /etc/opensearch/opensearch-security/roles_mapping.yml (legacy mode)
   FAIL: Configuration for 'rolesmapping' failed because of java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-8 [ACTIVE]
Will update '/internalusers' with /etc/opensearch/opensearch-security/internal_users.yml (legacy mode)
   FAIL: Configuration for 'internalusers' failed because of java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-9 [ACTIVE]
Will update '/actiongroups' with /etc/opensearch/opensearch-security/action_groups.yml (legacy mode)
   FAIL: Configuration for 'actiongroups' failed because of java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-10 [ACTIVE]
Will update '/nodesdn' with /etc/opensearch/opensearch-security/nodes_dn.yml (legacy mode)
   FAIL: Configuration for 'nodesdn' failed because of java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-11 [ACTIVE]
Will update '/whitelist' with /etc/opensearch/opensearch-security/whitelist.yml (legacy mode)
   FAIL: Configuration for 'whitelist' failed because of java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-12 [ACTIVE]
Will update '/audit' with /etc/opensearch/opensearch-security/audit.yml (legacy mode)
   FAIL: Configuration for 'audit' failed because of java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-13 [ACTIVE]
ERR: cannot upload configuration, see errors above

kcharikrish avatar Sep 16 '22 15:09 kcharikrish

[TRIAGE] @peternied can you follow up with this issue to make sure the issue remains. Thank you.

stephen-crawford avatar Oct 17 '22 19:10 stephen-crawford

@kcharikrish To capture the discussion about your issue, I've created https://github.com/opensearch-project/security/issues/2173 to so we can close out this older issue.

peternied avatar Oct 17 '22 20:10 peternied

Hey @peternied I would suggest to keep this issue open and track the progress on this same issue as there are lot of good points/discussions from multiple users. Also the concern raised by @kcharikrish is the same that was raised earlier by myself and others, only difference I see is the installation is by yum and not by docker or k8s but the error is same. Long story short lets keep this issue open until we have a fix. Thank you

prudhvigodithi avatar Oct 17 '22 20:10 prudhvigodithi

@prudhvigodithi There is a lot of discussion on this issue making it hard to understand what is a problem that needs resolution, what was an incidental issue, and what remains. What scenario is broken that you think needs to be resolved?

peternied avatar Oct 17 '22 21:10 peternied

@peternied the issue is still the same as stated in https://github.com/opensearch-project/security/issues/1898#issue-1279185390, https://github.com/opensearch-project/security/issues/1898#issuecomment-1169191535 https://github.com/opensearch-project/security/issues/1898#issuecomment-1218092220 https://github.com/opensearch-project/security/issues/1898#issuecomment-1249530839. I have proposed a solution here https://github.com/opensearch-project/security/issues/1898#issuecomment-1181936007, this could help fix the issue.

There is a lot of discussion on this issue making it hard to understand

One way its good to have lot of discussion it helps to validate the scenarios once we have the solution. Finally i'm just trying to help/keep this issue open to everyone facing this problem are aware once problem is fixed. I'm fine as well however you wanted to proceed further. Thanks

prudhvigodithi avatar Oct 18 '22 15:10 prudhvigodithi