opensearch-k8s-operator
opensearch-k8s-operator copied to clipboard
[BUG] OpenSearch operator panics and crashes when adding an OpenSearchISMPolicy
What is the bug?
When adding an OpenSearchISMPolicy
while the OpenSearch cluster is getting created, the controller panics resulting in a container crash
2024-05-06T18:19:54.202Z INFO Reconciling OpensearchISMPolicy {"controller": "opensearchismpolicy", "controllerGroup": "opensearch.opster.io", "controllerKind": "OpenSearchISMPolicy", "OpenSearchISMPolicy": {"name":"sample-policy","namespace":"test"}, "namespace": "test", "name": "sample-policy", "reconcileID": "adc1b967-662a-42d0-9c17-95e048ad0ad6", "tenant": {"name":"sample-policy","namespace":"test"}}
2024-05-06T18:19:54.279Z DEBUG events error creating opensearch client {"type": "Warning", "object": {"kind":"OpenSearchISMPolicy","namespace":"test","name":"sample-policy","uid":"abab26b9-2ca0-4882-a167-4cf37994dcb9","apiVersion":"opensearch.opster.io/v1","resourceVersion":"463314"}, "reason": "OpensearchError"}
2024-05-06T18:19:54.284Z INFO Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference {"controller": "opensearchismpolicy", "controllerGroup": "opensearch.opster.io", "controllerKind": "OpenSearchISMPolicy", "OpenSearchISMPolicy": {"name":"sample-policy","namespace":"test"}, "namespace": "test", "name": "sample-policy", "reconcileID": "adc1b967-662a-42d0-9c17-95e048ad0ad6"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x11f2d64]
goroutine 442 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:115 +0x1a4
panic({0x141dec0?, 0x27073d0?})
/usr/local/go/src/runtime/panic.go:770 +0x124
github.com/Opster/opensearch-k8s-operator/opensearch-operator/opensearch-gateway/services.(*OsClusterClient).GetISMConfig(0x0, {0x18fcd30, 0x4000e77dd0}, {0x4000c5a410?, 0x0?})
/workspace/opensearch-gateway/services/os_client.go:314 +0x44
github.com/Opster/opensearch-k8s-operator/opensearch-operator/opensearch-gateway/services.PolicyExists({0x18fcd30?, 0x4000e77dd0?}, 0x4001436700?, {0x4000c5a410?, 0x7?})
/workspace/opensearch-gateway/services/os_ism_service.go:31 +0x4c
github.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/reconcilers.(*IsmPolicyReconciler).Reconcile(0x40008d5d00)
/workspace/pkg/reconcilers/ismpolicy.go:159 +0x72c
github.com/Opster/opensearch-k8s-operator/opensearch-operator/controllers.(*OpensearchISMPolicyReconciler).Reconcile(0x400051abe0, {0x18fcd30, 0x4000e77dd0}, {{{0x4001558638, 0x4}, {0x4001558640, 0xd}}})
/workspace/controllers/opensearchism_controller.go:53 +0x2ec
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x18fcd30?, {0x18fcd30?, 0x4000e77dd0?}, {{{0x4001558638?, 0x1348fc0?}, {0x4001558640?, 0x4000677e08?}}})
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118 +0x8c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0x400028a640, {0x18fcd68, 0x400051b630}, {0x149b600, 0x40002689e0})
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:314 +0x294
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0x400028a640, {0x18fcd68, 0x400051b630})
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265 +0x198
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226 +0x74
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 129
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222 +0x404
The operator pod will crash several times and then continue running.
How can one reproduce the bug?
- Install the operator
helm install opensearch-operator opensearch-operator/opensearch-operator --version 2.6.0 -n test
- Create an OpenSearch cluster using
kubectl apply
. This is the cluster definition I used
apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
name: my-first-cluster
namespace: test
spec:
general:
serviceName: my-first-cluster
version: 2.11.1
dashboards:
enable: false
version: 2.11.1
replicas: 0
nodePools:
- component: nodes
replicas: 3
diskSize: "5Gi"
nodeSelector:
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "500m"
roles:
- "cluster_manager"
- "data"
- Apply the following ISM policy using
kubectl apply
apiVersion: opensearch.opster.io/v1
kind: OpenSearchISMPolicy
metadata:
name: sample-policy
namespace: test
spec:
opensearchCluster:
name: my-first-cluster
description: Sample policy
policyId: sample-policy
defaultState: hot
states:
- name: hot
actions:
- replicaCount:
numberOfReplicas: 4
transitions:
- stateName: warm
conditions:
minIndexAge: "10d"
- name: warm
actions:
- replicaCount:
numberOfReplicas: 2
transitions:
- stateName: delete
conditions:
minIndexAge: "30d"
- name: delete
actions:
- delete: {}
At this point, the operator pod would exit with an error
What is the expected behavior?
EXpected the ISM Policy to be added without an issue
What is your host/environment?
Kubernetes 1.25 OpenSearch 2.11.1 OpenSearch operator 2.6.0
Do you have any screenshots?
If applicable, add screenshots to help explain your problem.
Do you have any additional context?
If I do step 2 above and wait for the OpenSearch cluster to complete getting created (i.e. the 3 nodes come to a running state and the cluster health is green) and then do step 3 (add ISM policy), the panic does not happen. But if I do step 3 immediately after step 2, then the operator panics and crashes several times and.
However, when using deployment pipelines, we cannot control the delay between resources
Hi @nilushancosta. Thanks for reporting this. This is clearly a bug and the operator should just wait if the cluster is not yet correctly reachable.