opensearch-k8s-operator icon indicating copy to clipboard operation
opensearch-k8s-operator copied to clipboard

[BUG] OpenSearch operator panics and crashes when adding an OpenSearchISMPolicy

Open nilushancosta opened this issue 9 months ago • 1 comments

What is the bug?

When adding an OpenSearchISMPolicy while the OpenSearch cluster is getting created, the controller panics resulting in a container crash

2024-05-06T18:19:54.202Z	INFO	Reconciling OpensearchISMPolicy	{"controller": "opensearchismpolicy", "controllerGroup": "opensearch.opster.io", "controllerKind": "OpenSearchISMPolicy", "OpenSearchISMPolicy": {"name":"sample-policy","namespace":"test"}, "namespace": "test", "name": "sample-policy", "reconcileID": "adc1b967-662a-42d0-9c17-95e048ad0ad6", "tenant": {"name":"sample-policy","namespace":"test"}}
2024-05-06T18:19:54.279Z	DEBUG	events	error creating opensearch client	{"type": "Warning", "object": {"kind":"OpenSearchISMPolicy","namespace":"test","name":"sample-policy","uid":"abab26b9-2ca0-4882-a167-4cf37994dcb9","apiVersion":"opensearch.opster.io/v1","resourceVersion":"463314"}, "reason": "OpensearchError"}
2024-05-06T18:19:54.284Z	INFO	Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference	{"controller": "opensearchismpolicy", "controllerGroup": "opensearch.opster.io", "controllerKind": "OpenSearchISMPolicy", "OpenSearchISMPolicy": {"name":"sample-policy","namespace":"test"}, "namespace": "test", "name": "sample-policy", "reconcileID": "adc1b967-662a-42d0-9c17-95e048ad0ad6"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x11f2d64]

goroutine 442 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:115 +0x1a4
panic({0x141dec0?, 0x27073d0?})
	/usr/local/go/src/runtime/panic.go:770 +0x124
github.com/Opster/opensearch-k8s-operator/opensearch-operator/opensearch-gateway/services.(*OsClusterClient).GetISMConfig(0x0, {0x18fcd30, 0x4000e77dd0}, {0x4000c5a410?, 0x0?})
	/workspace/opensearch-gateway/services/os_client.go:314 +0x44
github.com/Opster/opensearch-k8s-operator/opensearch-operator/opensearch-gateway/services.PolicyExists({0x18fcd30?, 0x4000e77dd0?}, 0x4001436700?, {0x4000c5a410?, 0x7?})
	/workspace/opensearch-gateway/services/os_ism_service.go:31 +0x4c
github.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/reconcilers.(*IsmPolicyReconciler).Reconcile(0x40008d5d00)
	/workspace/pkg/reconcilers/ismpolicy.go:159 +0x72c
github.com/Opster/opensearch-k8s-operator/opensearch-operator/controllers.(*OpensearchISMPolicyReconciler).Reconcile(0x400051abe0, {0x18fcd30, 0x4000e77dd0}, {{{0x4001558638, 0x4}, {0x4001558640, 0xd}}})
	/workspace/controllers/opensearchism_controller.go:53 +0x2ec
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x18fcd30?, {0x18fcd30?, 0x4000e77dd0?}, {{{0x4001558638?, 0x1348fc0?}, {0x4001558640?, 0x4000677e08?}}})
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118 +0x8c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0x400028a640, {0x18fcd68, 0x400051b630}, {0x149b600, 0x40002689e0})
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:314 +0x294
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0x400028a640, {0x18fcd68, 0x400051b630})
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265 +0x198
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226 +0x74
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 129
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222 +0x404

The operator pod will crash several times and then continue running.

How can one reproduce the bug?

  1. Install the operator
helm install opensearch-operator opensearch-operator/opensearch-operator --version 2.6.0 -n test
  1. Create an OpenSearch cluster using kubectl apply. This is the cluster definition I used
apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: my-first-cluster
  namespace: test
spec:
  general:
    serviceName: my-first-cluster
    version: 2.11.1
  dashboards:
    enable: false
    version: 2.11.1
    replicas: 0
  nodePools:
    - component: nodes
      replicas: 3
      diskSize: "5Gi"
      nodeSelector:
      resources:
         requests:
            memory: "1Gi"
            cpu: "500m"
         limits:
            memory: "1Gi"
            cpu: "500m"
      roles:
        - "cluster_manager"
        - "data"
  1. Apply the following ISM policy using kubectl apply
apiVersion: opensearch.opster.io/v1
kind: OpenSearchISMPolicy
metadata:
   name: sample-policy
   namespace: test
spec:
   opensearchCluster:
      name: my-first-cluster
   description: Sample policy
   policyId: sample-policy
   defaultState: hot
   states:
      - name: hot
        actions:
           - replicaCount:
                numberOfReplicas: 4
        transitions:
           - stateName: warm
             conditions:
                minIndexAge: "10d"
      - name: warm
        actions:
           - replicaCount:
                numberOfReplicas: 2
        transitions:
           - stateName: delete
             conditions:
                minIndexAge: "30d"
      - name: delete
        actions:
           - delete: {}

At this point, the operator pod would exit with an error

What is the expected behavior?

EXpected the ISM Policy to be added without an issue

What is your host/environment?

Kubernetes 1.25 OpenSearch 2.11.1 OpenSearch operator 2.6.0

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

If I do step 2 above and wait for the OpenSearch cluster to complete getting created (i.e. the 3 nodes come to a running state and the cluster health is green) and then do step 3 (add ISM policy), the panic does not happen. But if I do step 3 immediately after step 2, then the operator panics and crashes several times and.

However, when using deployment pipelines, we cannot control the delay between resources

nilushancosta avatar May 06 '24 08:05 nilushancosta

Hi @nilushancosta. Thanks for reporting this. This is clearly a bug and the operator should just wait if the cluster is not yet correctly reachable.

swoehrl-mw avatar May 07 '24 12:05 swoehrl-mw