opensearch-k8s-operator icon indicating copy to clipboard operation
opensearch-k8s-operator copied to clipboard

Operator > 2.0.0 fails to create Bootstrap Pod

Open edwardsmit opened this issue 2 years ago • 7 comments

Using the following ResourceDefinition

---
apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: our-cluster
  namespace: our-namespace
spec:
  general:
    serviceName: open-search
    version: 2.2.0
    setVMMaxMapCount: true
  security:
    tls:
      transport:
        generate: true
        perNode: true
      http:
        generate: true
  dashboards:
    enable: true
    tls:
      enable: true
      generate: true
    version: 2.2.0
    replicas: 1
    resources:
      requests:
        memory: "512Mi"
        cpu: "200m"
      limits:
        memory: "512Mi"
        cpu: "200m"
  nodePools:
    - component: open-search-nodes
      replicas: 3
      diskSize: "5Gi"
      nodeSelector:
        cloud.google.com/gke-spot: "true"
      tolerations:
        - key: cloud.google.com/gke-spot
          operator: Equal
          value: "true"
          effect: NoSchedule
      resources:
        requests:
          memory: "2Gi"
          cpu: "500m"
        limits:
          memory: "2Gi"
          cpu: "500m"
      persistence:
        pvc:
          storageClass: standard
          accessModes:
            - ReadWriteOnce
      roles:
        - "data"
        - "master"

and the 2.0.3, 2.0.2 and 2.0.1 operator, the bootstrap pod isn't created, and the Operator fails to create a working OpenSearch cluster. Using the 2.0.0 operator, the bootstrap-pod is created successfully, however, the 2.0.0 operator does not support OpenSearch 2.x

edwardsmit avatar Aug 17 '22 14:08 edwardsmit

fixed in the latest release, try to take a look

idanl21 avatar Aug 18 '22 14:08 idanl21

I can confirm that the bootstrap pod is now created as expected. Thanks!

edwardsmit avatar Aug 18 '22 15:08 edwardsmit

Hi @idanl21 / @edwardsmit : I am trying to run the same manifest (without the tolerations/nodeselectors) locally in my kind cluster. Updated the operator helm to 2.0.4 version. But still could not see the boostrap pod. Could you please help me understand if I am missing something here?

madhukarmmallia-plivo avatar Aug 19 '22 06:08 madhukarmmallia-plivo

@madhukarmmallia-plivo You're right. I was deploying a 1.3.2 version of OpenSearch (replace both 2.2.0 values with 1.3.2 in the above manifest). Deploying a brand new 2.2.0 with above manifest still does not work. First deploying a 1.3.2 cluster and then upgrade it to 2.2.0 does work.

edwardsmit avatar Aug 19 '22 07:08 edwardsmit

Thanks for confirming @edwardsmit . I tried some debugging from my end. Looks like v2+ of opensearch is not picking up 'cluster_manager' node role which is assigned to the bootstrap instance of opensearch.

@idanl21 Can we reopen this issue?

madhukarmmallia-plivo avatar Aug 19 '22 08:08 madhukarmmallia-plivo

@madhukarmmallia-plivo I also did some checking. The bootstrap pod is only created if cluster status is not initialized (https://github.com/Opster/opensearch-k8s-operator/blob/main/opensearch-operator/pkg/reconcilers/cluster.go#L81) and that is determined based on if all master pods are ready (https://github.com/Opster/opensearch-k8s-operator/blob/main/opensearch-operator/controllers/opensearchController.go#L249). To determine which pods are master pods the operator checks the roles of the pods and there is a switch in the operator code based on the version: For 1.x it will use master, for 2.x it will use cluster_manager. Thus if you deploy a 2.x cluster but use role master the code directly sets Initialized=true and the bootstrap pod is never started. I'm not sure if role master i still valid for opensearch 2.x, if yes we should extend the operator to use both roles as indications of master, if not we need to fix the example yaml and maybe even let the operator warn the user (e.g. in the status or via event) that the master role cannot be used for a 2.x cluster.

swoehrl-mw avatar Sep 03 '22 09:09 swoehrl-mw

I've just lost half a day because of this. Luckily I came accross this ticket. Using the example and then replacing role 'master' with 'cluster_manager' worked. Inclusion can be hard sometimes...

martin-schaefer avatar Sep 05 '22 10:09 martin-schaefer

The solution there is to change master role to cluster_manager role in the OS YAML file. a fix PR and an 2.0 cluster YAML file was uploaded and merged

idanl21 avatar Nov 07 '22 13:11 idanl21