moco icon indicating copy to clipboard operation
moco copied to clipboard

Endless loop of reconciliation

Open rassie opened this issue 10 months ago • 3 comments

Describe the bug

After getting a cluster running at long last (OpenShift woes), it seems to be stuck in some kind of endless reconciliation loop:

{
  "level": "info",
  "ts": "2025-02-20T19:41:42Z",
  "logger": "cluster-manager.myapp/myapp",
  "msg": "start operation",
  "operationId": "op-wbdjt",
  "origin": "81293e20-0258-4ddf-83e0-d81e2c3dd4ed"
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:42Z",
  "msg": "Updated partition",
  "controller": "statefulset",
  "controllerGroup": "apps",
  "controllerKind": "StatefulSet",
  "StatefulSet": {
    "name": "moco-myapp",
    "namespace": "myapp"
  },
  "namespace": "myapp",
  "name": "moco-myapp",
  "reconcileID": "32721860-8378-4a82-a05d-24e19fe78c18",
  "newPartition": 2,
  "oldPartition": 3
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:42Z",
  "msg": "partition is updated",
  "controller": "statefulset",
  "controllerGroup": "apps",
  "controllerKind": "StatefulSet",
  "StatefulSet": {
    "name": "moco-myapp",
    "namespace": "myapp"
  },
  "namespace": "myapp",
  "name": "moco-myapp",
  "reconcileID": "32721860-8378-4a82-a05d-24e19fe78c18"
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:42Z",
  "msg": "Updated partition",
  "controller": "statefulset",
  "controllerGroup": "apps",
  "controllerKind": "StatefulSet",
  "StatefulSet": {
    "name": "moco-myapp",
    "namespace": "myapp"
  },
  "namespace": "myapp",
  "name": "moco-myapp",
  "reconcileID": "6b9c0377-f70e-456d-82f4-22e7e414de73",
  "newPartition": 1,
  "oldPartition": 2
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:42Z",
  "msg": "partition is updated",
  "controller": "statefulset",
  "controllerGroup": "apps",
  "controllerKind": "StatefulSet",
  "StatefulSet": {
    "name": "moco-myapp",
    "namespace": "myapp"
  },
  "namespace": "myapp",
  "name": "moco-myapp",
  "reconcileID": "6b9c0377-f70e-456d-82f4-22e7e414de73"
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:42Z",
  "msg": "Updated partition",
  "controller": "statefulset",
  "controllerGroup": "apps",
  "controllerKind": "StatefulSet",
  "StatefulSet": {
    "name": "moco-myapp",
    "namespace": "myapp"
  },
  "namespace": "myapp",
  "name": "moco-myapp",
  "reconcileID": "a41c7399-b9f8-4a53-9e62-e030e9d1ea4e",
  "newPartition": 0,
  "oldPartition": 1
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:42Z",
  "msg": "partition is updated",
  "controller": "statefulset",
  "controllerGroup": "apps",
  "controllerKind": "StatefulSet",
  "StatefulSet": {
    "name": "moco-myapp",
    "namespace": "myapp"
  },
  "namespace": "myapp",
  "name": "moco-myapp",
  "reconcileID": "a41c7399-b9f8-4a53-9e62-e030e9d1ea4e"
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:42Z",
  "logger": "cluster-manager.myapp/myapp",
  "msg": "cluster state is Healthy",
  "operationId": "op-wbdjt"
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:42Z",
  "logger": "cluster-manager.myapp/myapp",
  "msg": "finish",
  "operationId": "op-wbdjt",
  "duration": 0.114975796
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:42Z",
  "msg": "reconciled StatefulSet",
  "controller": "mysqlcluster",
  "controllerGroup": "moco.cybozu.com",
  "controllerKind": "MySQLCluster",
  "MySQLCluster": {
    "name": "myapp",
    "namespace": "myapp"
  },
  "namespace": "myapp",
  "name": "myapp",
  "reconcileID": "d3640c7a-d4bf-4408-9a8c-e0aad62e7758",
  "statefulSetName": "moco-myapp"
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:42Z",
  "msg": "reconciled CronJob for backup",
  "controller": "mysqlcluster",
  "controllerGroup": "moco.cybozu.com",
  "controllerKind": "MySQLCluster",
  "MySQLCluster": {
    "name": "myapp",
    "namespace": "myapp"
  },
  "namespace": "myapp",
  "name": "myapp",
  "reconcileID": "d3640c7a-d4bf-4408-9a8c-e0aad62e7758",
  "cronJobName": "moco-backup-myapp"
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:42Z",
  "logger": "cluster-manager.myapp/myapp",
  "msg": "start operation",
  "operationId": "op-fn89n",
  "origin": "d3640c7a-d4bf-4408-9a8c-e0aad62e7758"
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:42Z",
  "msg": "update status successfully",
  "controller": "mysqlcluster",
  "controllerGroup": "moco.cybozu.com",
  "controllerKind": "MySQLCluster",
  "MySQLCluster": {
    "name": "myapp",
    "namespace": "myapp"
  },
  "namespace": "myapp",
  "name": "myapp",
  "reconcileID": "d3640c7a-d4bf-4408-9a8c-e0aad62e7758"
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:42Z",
  "msg": "Updated partition",
  "controller": "statefulset",
  "controllerGroup": "apps",
  "controllerKind": "StatefulSet",
  "StatefulSet": {
    "name": "moco-myapp",
    "namespace": "myapp"
  },
  "namespace": "myapp",
  "name": "moco-myapp",
  "reconcileID": "bbab67e9-2011-4481-bd7c-ecc0a433d2e8",
  "newPartition": 2,
  "oldPartition": 3
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:42Z",
  "msg": "partition is updated",
  "controller": "statefulset",
  "controllerGroup": "apps",
  "controllerKind": "StatefulSet",
  "StatefulSet": {
    "name": "moco-myapp",
    "namespace": "myapp"
  },
  "namespace": "myapp",
  "name": "moco-myapp",
  "reconcileID": "bbab67e9-2011-4481-bd7c-ecc0a433d2e8"
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:42Z",
  "msg": "Updated partition",
  "controller": "statefulset",
  "controllerGroup": "apps",
  "controllerKind": "StatefulSet",
  "StatefulSet": {
    "name": "moco-myapp",
    "namespace": "myapp"
  },
  "namespace": "myapp",
  "name": "moco-myapp",
  "reconcileID": "ff21a7d5-0c90-4ddf-b98e-0b3996c3c5a2",
  "newPartition": 1,
  "oldPartition": 2
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:42Z",
  "msg": "partition is updated",
  "controller": "statefulset",
  "controllerGroup": "apps",
  "controllerKind": "StatefulSet",
  "StatefulSet": {
    "name": "moco-myapp",
    "namespace": "myapp"
  },
  "namespace": "myapp",
  "name": "moco-myapp",
  "reconcileID": "ff21a7d5-0c90-4ddf-b98e-0b3996c3c5a2"
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:43Z",
  "msg": "Updated partition",
  "controller": "statefulset",
  "controllerGroup": "apps",
  "controllerKind": "StatefulSet",
  "StatefulSet": {
    "name": "moco-myapp",
    "namespace": "myapp"
  },
  "namespace": "myapp",
  "name": "moco-myapp",
  "reconcileID": "487bdf78-2a05-489c-9473-578ea26ee907",
  "newPartition": 0,
  "oldPartition": 1
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:43Z",
  "msg": "partition is updated",
  "controller": "statefulset",
  "controllerGroup": "apps",
  "controllerKind": "StatefulSet",
  "StatefulSet": {
    "name": "moco-myapp",
    "namespace": "myapp"
  },
  "namespace": "myapp",
  "name": "moco-myapp",
  "reconcileID": "487bdf78-2a05-489c-9473-578ea26ee907"
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:43Z",
  "logger": "cluster-manager.myapp/myapp",
  "msg": "cluster state is Healthy",
  "operationId": "op-fn89n"
}
{
  "level": "info",
  "ts": "2025-02-20T19:41:43Z",
  "logger": "cluster-manager.myapp/myapp",
  "msg": "finish",
  "operationId": "op-fn89n",
  "duration": 0.113314412
}

I have no idea (yet) what could be causing this. An extremely wild hunch is that maybe https://github.com/cybozu-go/moco/pull/781#issuecomment-2667916155 is a fix for this, but I have currently no way to check that.

Every now and then the logs also contain this:

{
  "level": "error",
  "ts": "2025-02-20T19:41:43Z",
  "msg": "failed to update status",
  "controller": "mysqlcluster",
  "controllerGroup": "moco.cybozu.com",
  "controllerKind": "MySQLCluster",
  "MySQLCluster": {
    "name": "myapp",
    "namespace": "myapp"
  },
  "namespace": "myapp",
  "name": "myapp",
  "reconcileID": "e6f09dd0-b294-4a46-b12a-81879497244a",
  "error": "Operation cannot be fulfilled on mysqlclusters.moco.cybozu.com \"myapp\": the object has been modified; please apply your changes to the latest version and try again",
  "stacktrace": "github.com/cybozu-go/moco/controllers.(*MySQLClusterReconciler).reconcileV1.func1\n\t/work/controllers/mysqlcluster_controller.go:260\ngithub.com/cybozu-go/moco/controllers.(*MySQLClusterReconciler).reconcileV1\n\t/work/controllers/mysqlcluster_controller.go:328\ngithub.com/cybozu-go/moco/controllers.(*MySQLClusterReconciler).Reconcile\n\t/work/controllers/mysqlcluster_controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:224"
}
{
  "level": "error",
  "ts": "2025-02-20T19:41:43Z",
  "msg": "Reconciler error",
  "controller": "mysqlcluster",
  "controllerGroup": "moco.cybozu.com",
  "controllerKind": "MySQLCluster",
  "MySQLCluster": {
    "name": "myapp",
    "namespace": "myapp"
  },
  "namespace": "myapp",
  "name": "myapp",
  "reconcileID": "e6f09dd0-b294-4a46-b12a-81879497244a",
  "error": "Operation cannot be fulfilled on mysqlclusters.moco.cybozu.com \"myapp\": the object has been modified; please apply your changes to the latest version and try again",
  "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:224"
}

but then again, I'm pretty sure I didn't change the object (fully synced with ArgoCD).

Environments

  • Version: 0.25.1

To Reproduce Steps to reproduce the behavior:

  1. Create a new cluster
  2. Wait for the cluster to finish setting up
  3. Look into moco-controller's logs and see the above logs repeating

Expected behavior After initial reconciliation, it should happen again only when something has changed.

rassie avatar Feb 21 '25 09:02 rassie

@rassie We've released moco v0.26.0 including #781, so please try again with this version.

yamatcha avatar Mar 05 '25 01:03 yamatcha

Unfortunately, this still happens with 0.26.0.

rassie avatar Mar 05 '25 09:03 rassie

@rassie These logs seem to occur during the construction of MySQLCluster. Do these logs stop appearing once the construction of MySQLCluster myapp is completed? Or does the construction fail to complete?

shunki-fujita avatar Mar 10 '25 05:03 shunki-fujita

I'm also having the same problem Getting alot of logs in controller container logs and PVC are not getting expanded (But statefulset is recreated correctly)

Getting

{"level":"error","ts":"2025-09-14T12:16:24Z","msg":"Reconciler error","controller":"mysqlcluster","controllerGroup":"moco.cybozu.com","controllerKind":"MySQLCluster","MySQLCluster":{"name":"moco-test","namespace":"default"},"namespace":"default","name":"moco-test","reconcileID":"1826bd2a-ba6f-4b06-875b-727f7d0d1c20","error":"Operation cannot be fulfilled on mysqlclusters.moco.cybozu.com \"moco-test\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:332\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:279\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:240"}

in controller logs continuously

mhkarimi1383 avatar Sep 14 '25 12:09 mhkarimi1383

I think due to this problem my volume resizes are not working...

mhkarimi1383 avatar Sep 14 '25 21:09 mhkarimi1383

@mhkarimi1383 I tried to reproduce this locally. I do see the same log:

{
  "level": "error",
  "ts": "2025-09-16T05:48:33.979Z",
  "msg": "failed to update status",
  "controller": "mysqlcluster",
  "controllerGroup": "moco.cybozu.com",
  "controllerKind": "MySQLCluster",
  "MySQLCluster": {
    "name": "test",
    "namespace": "test"
  },
  "namespace": "test",
  "name": "test",
  "reconcileID": "69d10043-b841-4eea-8c9d-7de4b44f7675",
  "error": "Operation cannot be fulfilled on mysqlclusters.moco.cybozu.com \"test\
   the object has been modified; please apply your changes to the latest version and try again",
  "stacktrace": "github.com/cybozu-go/moco/controllers.(*MySQLClusterReconciler).reconcileV1.func1\n\t/work/controllers/mysqlcluster_controller.go:262\ngithub.com/cybozu-go/moco/controllers.(*MySQLClusterReconciler).reconcileV1\n\t/work
/controllers/mysqlcluster_controller.go:330\ngithub.com/cybozu-go/moco/controllers.(*MySQLClusterReconciler).Reconcile\n\t/work/controllers/mysqlcluster_controller.go:210\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Control
ler[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/c
[email protected]/pkg/internal/controller/controller.go:319\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/contro
ller/controller.go:279\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:240"
}

But it only appears a few times and then stops; the PVC expansion finishes successfully.

I changed the storage size from 100Gi to 101Gi.:

$ kubectl get MySQLCluster -n test -ojson test | jq .spec.volumeClaimTemplates
[
  {
    "metadata": {
      "name": "mysql-data"
    },
    "spec": {
      "accessModes": [
        "ReadWriteOnce"
      ],
      "resources": {
        "requests": {
          "storage": "101Gi"
        }
      },
      "storageClassName": "topolvm-flash1-xfs"
    }
  }
]

$ kubectl get pvc -n test
NAME                      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS         VOLUMEATTRIBUTESCLASS   AGE
mysql-data-moco-test-0   Bound    pvc-e3e90df9-4f5a-4a9e-8ae6-1a872f9347e0   101Gi      RWO            topolvm-flash1-xfs   <unset>                 103d
mysql-data-moco-test-1   Bound    pvc-9df5a5ea-5b9c-4f85-bac6-4297b8117ed7   101Gi      RWO            topolvm-flash1-xfs   <unset>                 103d
mysql-data-moco-test-2   Bound    pvc-fb57262e-2ac0-4d81-8777-2ebc5987cd84   101Gi      RWO            topolvm-flash1-xfs   <unset>                 103d

To dig in further, could you share a bit more about your environment and exact steps?

shunki-fujita avatar Sep 16 '25 06:09 shunki-fujita

@shunki-fujita I have a two node minikube setup (with KVM2 driver) and OpenEBS Rawfile-localpv as storage engine

Installed moco using manifests

I will share my exact commands right now

mhkarimi1383 avatar Sep 16 '25 06:09 mhkarimi1383

@shunki-fujita

#!/usr/bin/env bash

set -exu -o pipefail

if ! minikube status -o json | jq -e 'length > 0 and all(.[]; .Host == "Running" and .Kubelet == "Running")'; then
    minikube start -d kvm2 -n 2 --cpus='4' --memory='4GiB' --disk-size='50000mb'
fi
if [[ "$(kubectl config current-context)" != "minikube" ]]; then
    kubectl config use-context minikube
fi
helm repo add rawfile-localpv https://openebs.github.io/rawfile-localpv || true # Add the repo if not already added
helm repo update rawfile-localpv
helm upgrade \
    --install \
    rawfile-localpv \
    rawfile-localpv/rawfile-localpv \
    --namespace kube-system \
    --create-namespace \
    --set node.dataDirPath=/mnt/vda1/rawfile

kubectl apply --server-side -f \
    https://github.com/jetstack/cert-manager/releases/latest/download/cert-manager.yaml

kubectl apply --server-side -f \
    https://github.com/cybozu-go/moco/releases/latest/download/moco.yaml

I have ran this script, then applied my manifest file

mhkarimi1383 avatar Sep 16 '25 06:09 mhkarimi1383

apiVersion: moco.cybozu.com/v1beta2
kind: MySQLCluster
metadata:
  namespace: default
  name: moco-test
spec:
  serverIDBase: 831841464
  replicas: 3
  podTemplate:
    spec:
      containers:
        - name: mysqld
          image: ghcr.io/cybozu-go/moco/mysql:8.4.6
          resources: {}
  volumeClaimTemplates:
    - metadata:
        name: mysql-data
      spec:
        storageClassName: rawfile-localpv
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 5Gi

Here is my manifest

Also I had to manually set serverIDBase when performing kubectl replace -f ...

mhkarimi1383 avatar Sep 16 '25 06:09 mhkarimi1383

It would be great if we could have randomly generated serverIDBase stored at k8s resource

mhkarimi1383 avatar Sep 16 '25 06:09 mhkarimi1383

Something to notice is that

At first it was 1Gi then I filled it up (100% usage) after that I have tried to change it to 5Gi

mhkarimi1383 avatar Sep 16 '25 06:09 mhkarimi1383

@shunki-fujita Am I doing something wrong?

mhkarimi1383 avatar Sep 16 '25 07:09 mhkarimi1383

@mhkarimi1383 I was debugging why increasing MySQLCluster.spec.volumeClaimTemplates[].resources.requests.storage (e.g., 5Gi → 10Gi) didn’t raise the PVC spec.resources.requests.storage. It turned out to be caused by putting an empty resources map on the mysqld container.

Because nil vs {} are treated as different, the controller keeps seeing a diff in the Pod template and reconciles the StatefulSet repeatedly, never reaching the PVC resize logic.

Please try removing resources: {} and the PVC resize should trigger.

shunki-fujita avatar Sep 16 '25 09:09 shunki-fujita

Thanks, I will test is today

mhkarimi1383 avatar Sep 16 '25 09:09 mhkarimi1383

@shunki-fujita

Storage resize worked after removing resources key

But I have also tested that when I completely make my storage usage to be at 100%, Storage extend will not work and also master changes somehow

It will not extend volumes when my nodes are not healthy, So When this happens I have to manually extend that

and replace cluster manifest when cluster become healthy

I will ignore this if that's expected behavior

mhkarimi1383 avatar Sep 16 '25 13:09 mhkarimi1383

@mhkarimi1383 This is expected behavior. MOCO prioritizes StatefulSet reconciliation (resolving Pod template diffs / rolling updates). As a result, PVC reconciliation only runs once all Pods are Healthy/Ready. If nodes/Pods aren’t healthy (e.g., disk full), the controller won’t proceed with the PVC step.

shunki-fujita avatar Sep 17 '25 04:09 shunki-fujita

@mhkarimi1383 This is expected behavior. MOCO prioritizes StatefulSet reconciliation (resolving Pod template diffs / rolling updates). As a result, PVC reconciliation only runs once all Pods are Healthy/Ready. If nodes/Pods aren’t healthy (e.g., disk full), the controller won’t proceed with the PVC step.

Nice I will continue testing operator :) Thanks

mhkarimi1383 avatar Sep 17 '25 08:09 mhkarimi1383