opensearch-k8s-operator icon indicating copy to clipboard operation
opensearch-k8s-operator copied to clipboard

OpenSearch upgrade to v2.x stuck midway

Open ghiya-arpit opened this issue 2 years ago • 2 comments

Hi Team

I have upgraded the Opensearch cluster to v2.2.1 (from v1.3.1) and looks like the Operator did upgraded the Data Nodes and Dashboard, but masters are stuck at v1.3.1

  • Operator re-launched both master & data pod along with dashboard pod

Here's some more details

Statefulsets:

❯ kubectl get sts dev-opensearch-logging-cluster-masters -oyaml | egrep -A2 -i 'image|upgradeStrategy'
        image: docker.io/opensearchproject/opensearch:1.3.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 10
--
        image: public.ecr.aws/opsterio/busybox:1.27.2-buildx
        imagePullPolicy: IfNotPresent
        name: init
        resources: {}
❯ kubectl get sts dev-opensearch-logging-cluster-nodes -oyaml | egrep -A2 -i 'image|upgradeStrategy'
        image: docker.io/opensearchproject/opensearch:2.2.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 10
--
        image: public.ecr.aws/opsterio/busybox:1.27.2-buildx
        imagePullPolicy: IfNotPresent
        name: init
        resources: {}
~ ❯

OpenSearch cluster (seems like stuck on upgrading)

Name:         dev-opensearch-logging-cluster
Namespace:    dev-opensearch
Labels:       app=dev-opensearch-logging-cluster
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/version=2.1.0
              helm.sh/chart=opensearch-cluster-1.0.0
Annotations:  meta.helm.sh/release-name: dev-opensearch
              meta.helm.sh/release-namespace: default
API Version:  opensearch.opster.io/v1
Kind:         OpenSearchCluster
Metadata:
  Creation Timestamp:  2022-09-05T05:07:28Z
  Finalizers:
    Opster
  Generation:  13
  Managed Fields:
    API Version:  opensearch.opster.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"Opster":
      f:spec:
        f:bootstrap:
          .:
          f:resources:
        f:dashboards:
          f:opensearchCredentialsSecret:
          f:tls:
            f:caSecret:
            f:secret:
        f:security:
          f:tls:
            f:http:
              f:caSecret:
              f:secret:
            f:transport:
              f:caSecret:
              f:secret:
    Manager:      manager
    Operation:    Update
    Time:         2022-09-05T05:07:28Z
    API Version:  opensearch.opster.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:componentsStatus:
        f:initialized:
        f:phase:
        f:version:
    Manager:      manager
    Operation:    Update
    Subresource:  status
    Time:         2022-09-05T05:15:46Z
    API Version:  opensearch.opster.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:meta.helm.sh/release-name:
          f:meta.helm.sh/release-namespace:
        f:labels:
          .:
          f:app:
          f:app.kubernetes.io/managed-by:
          f:app.kubernetes.io/version:
          f:helm.sh/chart:
      f:spec:
        .:
        f:confMgmt:
          .:
          f:smartScaler:
        f:dashboards:
          .:
          f:enable:
          f:replicas:
          f:resources:
            .:
            f:limits:
              .:
              f:cpu:
              f:memory:
            f:requests:
              .:
              f:cpu:
              f:memory:
          f:tls:
            .:
            f:enable:
            f:generate:
          f:version:
        f:general:
          .:
          f:httpPort:
          f:pluginsList:
          f:serviceName:
          f:vendor:
          f:version:
        f:nodePools:
        f:security:
          .:
          f:tls:
            .:
            f:http:
              .:
              f:generate:
            f:transport:
              .:
              f:generate:
              f:perNode:
    Manager:         helm
    Operation:       Update
    Time:            2022-09-07T11:10:50Z
  Resource Version:  8122795
  UID:               62b8456d-abec-4625-bb85-3e9f4a99b9fa
Spec:
  Bootstrap:
    Resources:
  Conf Mgmt:
    Smart Scaler:  true
  Dashboards:
    Enable:  true
    Opensearch Credentials Secret:
    Replicas:  1
    Resources:
      Limits:
        Cpu:     500m
        Memory:  2Gi
      Requests:
        Cpu:     500m
        Memory:  2Gi
    Tls:
      Ca Secret:
      Enable:    true
      Generate:  true
      Secret:
    Version:  2.2.1
  General:
    Http Port:  9200
    Plugins List:
      repository-s3
    Service Name:  dev-opensearch-logging-cluster
    Vendor:        opensearch
    Version:       2.2.1
  Node Pools:
    Component:  masters
    Disk Size:  3Gi
    Persistence:
      Pvc:
        Access Modes:
          ReadWriteOnce
        Storage Class:  aws-ebs-standard-persistent
    Replicas:           3
    Resources:
      Limits:
        Cpu:     1000m
        Memory:  2Gi
      Requests:
        Cpu:     500m
        Memory:  2Gi
    Roles:
      master
    Component:  nodes
    Disk Size:  20Gi
    Persistence:
      Pvc:
        Access Modes:
          ReadWriteOnce
        Storage Class:  aws-ebs-standard-persistent
    Replicas:           3
    Resources:
      Limits:
        Cpu:     500m
        Memory:  2Gi
      Requests:
        Cpu:     500m
        Memory:  2Gi
    Roles:
      data
  Security:
    Tls:
      Http:
        Ca Secret:
        Generate:  true
        Secret:
      Transport:
        Ca Secret:
        Generate:  true
        Per Node:  true
        Secret:
Status:
  Components Status:
    Component:    Upgrader
    Description:  nodes
    Status:       Upgrading
    Component:    Upgrader
    Description:  nodes
    Status:       Upgrading
  Initialized:    true
  Phase:          RUNNING
  Version:        1.3.1
Events:           <none>

Now, I tried to manually update statefulset to update the tag to v2.2.1 but operator controller manager seems to be reverting it and syncing it with the change it has.

1.6631431563239067e+09	DEBUG	controller.opensearchcluster	resource diff	{"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "dev-opensearch-logging-cluster", "namespace": "dev-opensearch", "reconciler": "cluster", "name": "dev-opensearch-logging-cluster-masters", "namespace": "dev-opensearch", "apiVersion": "apps/v1", "kind": "StatefulSet", "patch": "{\"spec\":{\"template\":{\"spec\":{\"$setElementOrder/containers\":[{\"name\":\"opensearch\"}],\"containers\":[{\"image\":\"docker.io/opensearchproject/opensearch:1.3.1\",\"name\":\"opensearch\"}]}}}}"}
1.6631431563244636e+09	DEBUG	controller.opensearchcluster	updating resource	{"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "dev-opensearch-logging-cluster", "namespace": "dev-opensearch", "reconciler": "cluster", "name": "dev-opensearch-logging-cluster-masters", "namespace": "dev-opensearch", "apiVersion": "apps/v1", "kind": "StatefulSet"}
0 repository-s3
1.663143156337469e+09	DEBUG	controller.opensearchcluster	resource updated	{"reconciler group": "opensearch.opster.io", "reconciler kind": "OpenSearchCluster", "name": "dev-opensearch-logging-cluster", "namespace": "dev-opensearch", "reconciler": "cluster", "name": "dev-opensearch-logging-cluster-masters", "namespace": "dev-opensearch", "apiVersion": "apps/v1", "kind": "StatefulSet"}

I am kind of out of option here on how to get this sorted out. Any help would be appreciated.

FYI - I started the upgrade to v2.2.0 initially and when this got stuck, thought of pushing v.2.2.1 to see if it moved anything.

ghiya-arpit avatar Sep 14 '22 08:09 ghiya-arpit

Hi @ghiya-arpit. I'm not the most knowledgable in regards to the upgrade component of the operator, but this looks like the operator is still waiting for the data nodes to finish their upgrade. Can you check the status of the data nodes statefulset, if updatedReplicas is set to 3?

Now, I tried to manually update statefulset to update the tag to v2.2.1 but operator controller manager seems to be reverting it and syncing it with the change it has.

That is correct, manual changes of the objects are not possible, the operator will always overwrite with the state configured via custom resource.

I tried to reproduce your problem with a local cluster, but couldn't. I started a cluster on 1.3.1, added sample data, then updated the versions to 2.2.1. After a few minutes all pods were recreated and had the new image.

swoehrl-mw avatar Sep 27 '22 09:09 swoehrl-mw

@ghiya-arpit One more thing: Can you try while also changing the role from "master" to "cluster_manager" during the upgrade?

swoehrl-mw avatar Sep 27 '22 09:09 swoehrl-mw

Closing, as there was no further response from the issue reporter.

swoehrl-mw avatar Dec 06 '22 12:12 swoehrl-mw