opensearch-k8s-operator
opensearch-k8s-operator copied to clipboard
Failure to downsize nodepools
Let's say cluster is up and running with 1 master node. Now i want to increase cluster master nodes to 3 so i set crd node pools replica:3 Operator will try to add one master node at a time. Operator will stay stuck for always because second master node will never be ready due to split brain issue.
As we officially not (yet) support single node clusters I think this is more an enhancement request than a bug. And personally I think this is an edge case that we should not support as IMO we also should not support single-node clusters.
I just tried increasing from 3 to 4 masters, and it worked.
2022-08-19T08:42:07,177][INFO ][o.o.c.s.ClusterApplierService] [testos2-masters-3] master node changed {previous [], current [{testos2-masters-0}{XG6XWKeBQ76eCKVXcu9r4w}{NTmNue6dTsGJuiogVsDcHw}{testos2-masters-0}{10.92.4.209:9300}{dm}{shard_indexing_pressure_enabled=true}]}, added {{testos2-masters-2}{H7E9eVHWQHGIQyIPahCxfA}{QTZ0QJyuSCm1OlNuQR3uXQ}{testos2-masters-2}{10.92.0.69:9300}{dm}{shard_indexing_pressure_enabled=true},{testos2-masters-1}{tGEMPaRkSbG7VVnGdy22SQ}{1_ZtUToGTXyYJ1jmMc2Ihw}{testos2-masters-1}{10.92.5.26:9300}{dm}{shard_indexing_pressure_enabled=true},{testos2-masters-0}{XG6XWKeBQ76eCKVXcu9r4w}{NTmNue6dTsGJuiogVsDcHw}{testos2-masters-0}{10.92.4.209:9300}{dm}{shard_indexing_pressure_enabled=true}}, term: 3, version: 42, reason: ApplyCommitRequest{term=3, version=42, sourceNode={testos2-masters-0}{XG6XWKeBQ76eCKVXcu9r4w}{NTmNue6dTsGJuiogVsDcHw}{testos2-masters-0}{10.92.4.209:9300}{dm}{shard_indexing_pressure_enabled=true}}
[2022-08-19T08:42:07,279][INFO ][o.o.c.s.ClusterSettings ] [testos2-masters-3] updating [plugins.index_state_management.metadata_migration.status] from [0] to [1]
[2022-08-19T08:42:07,279][INFO ][o.o.c.s.ClusterSettings ] [testos2-masters-3] updating [plugins.index_state_management.template_migration.control] from [0] to [-1]
[2022-08-19T08:42:07,372][INFO ][o.o.a.c.HashRing ] [testos2-masters-3] Node added: [yQCCGJQoRbyhFUQpezoqPg, tGEMPaRkSbG7VVnGdy22SQ, H7E9eVHWQHGIQyIPahCxfA, XG6XWKeBQ76eCKVXcu9r4w]
[2022-08-19T08:42:07,378][INFO ][o.o.a.c.ADClusterEventListener] [testos2-masters-3] Cluster node changed, node removed: false, node added: true
[2022-08-19T08:42:07,379][INFO ][o.o.a.c.HashRing ] [testos2-masters-3] AD version hash ring change is in progress. Can't build hash ring for node delta event.
[2022-08-19T08:42:07,379][INFO ][o.o.a.c.ADClusterEventListener] [testos2-masters-3] Hash ring build result: false
[2022-08-19T08:42:07,573][INFO ][o.o.h.AbstractHttpServerTransport] [testos2-masters-3] publish_address {testos2-masters-3/10.92.1.4:9200}, bound_addresses {0.0.0.0:9200}
[2022-08-19T08:42:07,573][INFO ][o.o.n.Node ] [testos2-masters-3] started
inversely, can it be possible to decrease the number of nodes? I tried applying a decreased number of nodes, but the statefulset is not updated, thus kubernetes will not reduce the number of nodes.
Hi @dickescheid:
I tried applying a decreased number of nodes, but the statefulset is not updated, thus kubernetes will not reduce the number of nodes.
Could you please check the logs of the operator itself for any reported errors when doing the scale down?
I tried to recreate your problem by creating a 4 node cluster and then resizing to 3 and got an error about failed to create os client
. it would be interesting to see if that is only my local test setup or if that could be the source of your problem also.
I tried scaling down.
TL;DR: yes
I'm scaling down from 3 to 2 master/data nodes. The statefulset also does not reflect my wishes and commands
NAME READY AGE
opensearch-masters 3/3 27h
opensearch-nodes 0/0 27h
NAME READY STATUS RESTARTS AGE
opensearch-dashboards-667fbdc95c-d2hsb 1/1 Running 1 (27h ago) 27h
opensearch-masters-0 1/1 Running 0 25h
opensearch-masters-1 1/1 Running 0 11h
opensearch-masters-2 1/1 Running 0 4h11m
opensearch-securityconfig-update-xhs96 0/1 Completed 0 27h
Edit: if you are wondering about the different lifetimes, they are running on preemptive nodes.
These are the logs from the operator.
2022-08-25T16:20:18+02:00 0 repository-s3
2022-08-25T16:20:18+02:00 1 https://github.com/aiven/prometheus-exporter-plugin-for-opensearch/releases/download/2.2.0.0/prometheus-exporter-2.2.0.0.zip
2022-08-25T16:20:18+02:00 1.6614372182985618e+09 DEBUG controller.opensearchcluster resource is in sync {"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "opensearch", "namespace": "opensearch", "reconciler": "cluster", "name": "opensearch-nodes", "namespace": "opensearch", "apiVersion": "v1", "kind": "Service"}
2022-08-25T16:20:18+02:00 1.6614372182987397e+09 INFO controller.opensearchcluster The existing statefulset VolumeClaimTemplate disk size is: 30Gi {"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "opensearch", "namespace": "opensearch"}
2022-08-25T16:20:18+02:00 1.661437218298763e+09 INFO controller.opensearchcluster The cluster definition nodePool disk size is: 30Gi {"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "opensearch", "namespace": "opensearch"}
2022-08-25T16:20:18+02:00 1.661437218298767e+09 INFO controller.opensearchcluster The existing disk size 30Gi is same as passed in disk size 30Gi {"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "opensearch", "namespace": "opensearch"}
2022-08-25T16:20:18+02:00 1.6614372183051503e+09 DEBUG controller.opensearchcluster resource is in sync {"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "opensearch", "namespace": "opensearch", "reconciler": "cluster", "name": "opensearch-nodes", "namespace": "opensearch", "apiVersion": "apps/v1", "kind": "StatefulSet"}
2022-08-25T16:20:18+02:00 1.6614372183053544e+09 DEBUG events Normal {"object": {"kind":"OpenSearchCluster","namespace":"opensearch","name":"opensearch","uid":"aa21dd67-d031-4d26-af11-cd76daaf067c","apiVersion":"opensearch.opster.io/v1","resourceVersion":"207791264"}, "reason": "Scaler", "message": "Starting to scaling"}
2022-08-25T16:20:18+02:00 1.6614372183267283e+09 INFO controller.opensearchcluster service created successfully {"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "opensearch", "namespace": "opensearch"}
2022-08-25T16:20:18+02:00 1.66143721832778e+09 ERROR controller.opensearchcluster failed to create os client {"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "opensearch", "namespace": "opensearch", "error": "dial tcp 127.0.0.1:31613: connect: connection refused"}
2022-08-25T16:20:18+02:00 opensearch.opster.io/pkg/reconcilers.(*ScalerReconciler).reconcileNodePool
2022-08-25T16:20:18+02:00 /workspace/pkg/reconcilers/scaler.go:98
2022-08-25T16:20:18+02:00 opensearch.opster.io/pkg/reconcilers.(*ScalerReconciler).Reconcile
2022-08-25T16:20:18+02:00 /workspace/pkg/reconcilers/scaler.go:57
2022-08-25T16:20:18+02:00 opensearch.opster.io/controllers.(*OpenSearchClusterReconciler).reconcilePhaseRunning
2022-08-25T16:20:18+02:00 /workspace/controllers/opensearchController.go:326
2022-08-25T16:20:18+02:00 opensearch.opster.io/controllers.(*OpenSearchClusterReconciler).Reconcile
2022-08-25T16:20:18+02:00 /workspace/controllers/opensearchController.go:141
2022-08-25T16:20:18+02:00 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
2022-08-25T16:20:18+02:00 /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
2022-08-25T16:20:18+02:00 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
2022-08-25T16:20:18+02:00 /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
2022-08-25T16:20:18+02:00 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
2022-08-25T16:20:18+02:00 /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
2022-08-25T16:20:18+02:00 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
2022-08-25T16:20:18+02:00 /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-08-25T16:20:18+02:00 1.661437218327896e+09 DEBUG events Warning {"object": {"kind":"OpenSearchCluster","namespace":"opensearch","name":"opensearch","uid":"aa21dd67-d031-4d26-af11-cd76daaf067c","apiVersion":"opensearch.opster.io/v1","resourceVersion":"207791264"}, "reason": "Scaler", "message": "Failed to create os client for scaling"}
2022-08-25T16:20:18+02:00 1.661437218370882e+09 ERROR controller.opensearchcluster Reconciler error {"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "opensearch", "namespace": "opensearch", "error": "dial tcp 127.0.0.1:31613: connect: connection refused"}
2022-08-25T16:20:18+02:00 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
2022-08-25T16:20:18+02:00 /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
2022-08-25T16:20:18+02:00 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
2022-08-25T16:20:18+02:00 /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-08-25T16:20:18+02:00 1.6614372183709502e+09 INFO controller.opensearchcluster Reconciling OpenSearchCluster {"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "opensearch", "namespace": "opensearch", "cluster": "opensearch/opensearch"}
2022-08-25T16:20:18+02:00 1.6614372183804903e+09 INFO controller.opensearchcluster Generating certificates {"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "opensearch", "namespace": "opensearch", "interface": "transport"}
2022-08-25T16:20:18+02:00 1.6614372183805695e+09 INFO controller.opensearchcluster Generating certificates {"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "opensearch", "namespace": "opensearch", "interface": "http"}
2022-08-25T16:20:18+02:00 1.6614372183812106e+09 DEBUG controller.opensearchcluster resource is in sync {"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "opensearch", "namespace": "opensearch", "reconciler": "configuration", "name": "opensearch-config", "namespace": "opensearch", "apiVersion": "v1", "kind": "ConfigMap"}
2022-08-25T16:20:18+02:00 1.6614372183821986e+09 DEBUG controller.opensearchcluster resource is in sync {"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "opensearch", "namespace": "opensearch", "reconciler": "cluster", "name": "opensearch", "namespace": "opensearch", "apiVersion": "v1", "kind": "Service"}
2022-08-25T16:20:18+02:00 1.6614372183830593e+09 DEBUG controller.opensearchcluster resource is in sync {"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "opensearch", "namespace": "opensearch", "reconciler": "cluster", "name": "opensearch-discovery", "namespace": "opensearch", "apiVersion": "v1", "kind": "Service"}
2022-08-25T16:20:18+02:00 1.6614372183836598e+09 DEBUG controller.opensearchcluster resource diff {"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "opensearch", "namespace": "opensearch", "reconciler": "cluster", "name": "opensearch-admin-password", "namespace": "opensearch", "apiVersion": "v1", "kind": "Secret"}
2022-08-25T16:20:18+02:00 1.6614372183839364e+09 DEBUG controller.opensearchcluster updating resource {"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "opensearch", "namespace": "opensearch", "reconciler": "cluster", "name": "opensearch-admin-password", "namespace": "opensearch", "apiVersion": "v1", "kind": "Secret"}
2022-08-25T16:20:18+02:00 1.6614372183899963e+09 DEBUG controller.opensearchcluster resource updated {"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "opensearch", "namespace": "opensearch", "reconciler": "cluster", "name": "opensearch-admin-password", "namespace": "opensearch", "apiVersion": "v1", "kind": "Secret"}
2022-08-25T16:20:18+02:00 0 repository-s3
Thanks for testing @dickescheid. This proves my theory. When downsizing the cluster the operator first drains and excludes the opensearch node about to be removed. For this it uses the OpenSearch REST API. Currently the code is a bit of a mix, some calls connect to opensearch using the service cluster dns name, while others create a nodeport service and connect through that. And these calls fail. I'll open a PR to move all calls to use cluster dns, that should fix the problem.
nice, thanks!
Should be fixed with 2.1.0. Closing as completed.