cockroach-operator Status of CR not updated, despite cluster being started

After the cluster has been initialized, the status in the resource is still clusterStatus: Failed, although the database is fully usable. Due to that the status in the resource can't be used to monitor the cluster initialization.

Setup:

Minikube running Kubernetes 1.24.7
CRD and Operator deployed v2.8.0
CR deployed in Namespace keycloak - see full output below including status
Operator complains about a concurrent modification, apparently it didn't retry? - see log below.

The cluster is fully usable, still the status in the resource is not updated. Restart of the operator didn't fix the status/deliver any reconciliation.

CR with status

kind: CrdbCluster
metadata:
  annotations:
    crdb.io/certexpiration: "2027-11-14T07:57:21Z"
    crdb.io/containerimage: cockroachdb/cockroach:v22.1.10
    crdb.io/version: v22.1.10
    meta.helm.sh/release-name: keycloak
    meta.helm.sh/release-namespace: default
  creationTimestamp: "2022-11-10T11:44:22Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
  name: cockroach
  namespace: keycloak
  resourceVersion: "112311"
  uid: 88e847e4-e321-420b-a425-3ddf19c53226
spec:
  dataStore:
    pvc:
      source:
        claimName: ""
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 20Gi
        volumeMode: Filesystem
    supportsAutoResize: false
  grpcPort: 26258
  httpPort: 8080
  image:
    name: cockroachdb/cockroach:v22.1.10
    pullPolicy: IfNotPresent
  maxUnavailable: 1
  nodes: 3
  resources: {}
  sqlPort: 26257
  tlsEnabled: true
status:
  clusterStatus: Failed
  conditions:
  - lastTransitionTime: "2022-11-10T11:44:34Z"
    status: "True"
    type: Initialized
  - lastTransitionTime: "2022-11-10T11:44:26Z"
    status: "True"
    type: CrdbVersionChecked
  - lastTransitionTime: "2022-11-10T11:44:27Z"
    status: "True"
    type: CertificateGenerated
  crdbcontainerimage: cockroachdb/cockroach:v22.1.10
  operatorActions:
  - lastTransitionTime: "2022-11-10T11:44:22Z"
    message: job changed
    status: Failed
    type: VersionCheckerAction
  - lastTransitionTime: "2022-11-10T11:44:27Z"
    message: pod is not running
    status: Failed
    type: Initialize
  version: v22.1.10

operator log

{"level":"warn","ts":1668080674.5940526,"logger":"controller.CrdbCluster","msg":"completed initializing database","CrdbCluster":"keycloak/cockroach","ReconcileId":"LC8CQLDR9WNGuQxK93EqWB"} {"level":"error","ts":1668080674.9439986,"logger":"controller.CrdbCluster","msg":"failed to update cluster status","CrdbCluster":"keycloak/cockroach","ReconcileId":"LC8CQLDR9WNGuQxK93EqWB","error":"Operation cannot be fulfilled on crdbclusters.crdb.cockroachlabs.com "cockroach": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\texternal/com_github_go_logr_zapr/zapr.go:132\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:196\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99"} {"level":"error","ts":1668080674.9446912,"logger":"controller-runtime.manager.controller.crdbcluster","msg":"Reconciler error","reconciler group":"crdb.cockroachlabs.com","reconciler kind":"CrdbCluster","name":"cockroach","namespace":"keycloak","error":"Operation cannot be fulfilled on crdbclusters.crdb.cockroachlabs.com "cockroach": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\texternal/com_github_go_logr_zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:301\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99"} {"level":"info","ts":1668080674.9448378,"logger":"controller.CrdbCluster","msg":"reconciling CockroachDB cluster","CrdbCluster":"keycloak/cockroach","ReconcileId":"NhLUCi9szZJ9LNKQbpdt59"} {"level":"info","ts":1668080674.9448888,"logger":"webhooks","msg":"default","name":"cockroach"}

Nov 10 '22 12:11 ahus1

Might possibly related to #592 - unfortunately I see this all the time in my setup, not only occasionally.

Nov 10 '22 14:11 ahus1

Looks like it might be related to https://github.com/cockroachdb/cockroach-operator/issues/906 by the looks of it.

Nov 10 '22 18:11 neurodrone

Hi @neurodrone - as I only see the error message

Operation cannot be fulfilled on crdbclusters.crdb.cockroachlabs.com "cockroach": the object has been modified; please apply your changes to the latest version and try again

and no other error message around connecting to the database, I assume this is not related to #906. Reading the other issue it might still affect me once I try to updating a cluster, or once it passes the stage I'm currently in.

Nov 11 '22 11:11 ahus1

I'm also encountering this on both 2.8 and 2.7.

I also had to monkey-patch in a fix to allow the VersionChecker to succeed since it would attempt to grab the output of the container before the version had even been output/it entered sleep.

I'm yet to be able to provision any functional CR cluster with this operator project.

Name:         testcrdb
Namespace:    app-ns
Labels:       app.kubernetes.io/managed-by=pulumi
Annotations:  crdb.io/certexpiration: 2027-11-30T23:58:17Z
API Version:  crdb.cockroachlabs.com/v1alpha1
Kind:         CrdbCluster
Metadata:
  Creation Timestamp:  2022-11-29T05:04:05Z
  Generation:          1
    Manager:      cockroach-operator
    Operation:    Update
    Time:         2022-11-29T05:04:06Z
    API Version:  crdb.cockroachlabs.com/v1alpha1
    Manager:         cockroach-operator
    Operation:       Update
    Subresource:     status
    Time:            2022-11-29T05:04:07Z
  Resource Version:  11066027
  UID:               24b56ac2-5f9a-41d6-900b-ae834679e189
Spec:
  Additional Labels:
    App:                  cockroachdb
    Cockroach - Cluster:  testcrdb
  Cockroach DB Version:   v22.1.2
  Data Store:
    Pvc:
      Source:
        Claim Name:
      Spec:
        Access Modes:
          ReadWriteOnce
        Resources:
          Requests:
            Storage:         10Gi
        Storage Class Name:  premium-rwo
        Volume Mode:         Filesystem
    Supports Auto Resize:    false
  Grpc Port:                 26258
  Http Port:                 8080
  Max Unavailable:           1
  Nodes:                     3
  Resources:
  Sql Port:     26257
  Tls Enabled:  true
Status:
  Cluster Status:  Failed
  Conditions:
    Last Transition Time:  2022-11-29T05:04:37Z
    Status:                True
    Type:                  Initialized
    Last Transition Time:  2022-11-29T05:04:05Z
    Status:                True
    Type:                  CrdbVersionChecked
    Last Transition Time:  2022-11-29T05:04:06Z
    Status:                True
    Type:                  CertificateGenerated
  Crdbcontainerimage:      cockroachdb/cockroach:v22.1.2
  Operator Actions:
    Last Transition Time:  2022-11-29T05:04:32Z
    Message:               pod is not running
    Status:                Failed
    Type:                  Initialize
  Version:                 v22.1.2
Events:                    <none>

I've also deployed the workaround described in #906 for good measure and encounter the same problem. It seems to be that the Initialize actor updates the status object at which point we return to the main Reconcile loop and immediately try to update the cluster status. This fails with

{"level":"info","ts":1669698277.3480625,"logger":"controller.CrdbCluster","msg":"Running action with name: Initialize","CrdbCluster":"app-ns/testcrdb","ReconcileId":"acqUPmaepZhbPisLqUVm2Z"}
{"level":"warn","ts":1669698277.348095,"logger":"controller.CrdbCluster","msg":"initializing CockroachDB","CrdbCluster":"app-ns/testcrdb","ReconcileId":"acqUPmaepZhbPisLqUVm2Z"}
{"level":"warn","ts":1669698277.355068,"logger":"controller.CrdbCluster","msg":"Pod is ready","CrdbCluster":"app-ns/testcrdb","ReconcileId":"acqUPmaepZhbPisLqUVm2Z"}
{"level":"warn","ts":1669698277.355116,"logger":"controller.CrdbCluster","msg":"Executing init in pod testcrdb-0 with phase Running","CrdbCluster":"app-ns/testcrdb","ReconcileId":"acqUPmaepZhbPisLqUVm2Z"}
{"level":"warn","ts":1669698277.8978968,"logger":"controller.CrdbCluster","msg":"Executed init in pod","CrdbCluster":"app-ns/testcrdb","ReconcileId":"acqUPmaepZhbPisLqUVm2Z"}
{"level":"info","ts":1669698277.89797,"logger":"webhooks","msg":"default","name":"testcrdb"}
{"level":"warn","ts":1669698277.9101963,"logger":"controller.CrdbCluster","msg":"completed initializing database","CrdbCluster":"app-ns/testcrdb","ReconcileId":"acqUPmaepZhbPisLqUVm2Z"}
{"level":"error","ts":1669698278.2586818,"logger":"controller.CrdbCluster","msg":"failed to update cluster status","CrdbCluster":"app-ns/testcrdb","ReconcileId":"acqUPmaepZhbPisLqUVm2Z","error":"Operation cannot be fulfilled on crdbclusters.crdb.cockroachlabs.com \"testcrdb\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\texternal/com_github_go_logr_zapr/zapr.go:132\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:196\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99"}
{"level":"error","ts":1669698278.2593884,"logger":"controller-runtime.manager.controller.crdbcluster","msg":"Reconciler error","reconciler group":"crdb.cockroachlabs.com","reconciler kind":"CrdbCluster","name":"testcrdb","namespace":"app-ns","error":"Operation cannot be fulfilled on crdbclusters.crdb.cockroachlabs.com \"testcrdb\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\texternal/com_github_go_logr_zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:301\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99"}
{"level":"info","ts":1669698278.2595415,"logger":"controller.CrdbCluster","msg":"reconciling CockroachDB cluster","CrdbCluster":"app-ns/testcrdb","ReconcileId":"ApnFDWosB4MGhzuoSR6JVW"}
{"level":"info","ts":1669698278.2596006,"logger":"webhooks","msg":"default","name":"testcrdb"}
{"level":"info","ts":1669698278.2819178,"logger":"controller.CrdbCluster","msg":"No actor to run; not requeueing","CrdbCluster":"app-ns/testcrdb","ReconcileId":"ApnFDWosB4MGhzuoSR6JVW"}
{"level":"info","ts":1669698278.2820442,"logger":"controller.CrdbCluster","msg":"reconciling CockroachDB cluster","CrdbCluster":"app-ns/testcrdb","ReconcileId":"W5ykg2NfQY2WcspxBQ7VZ"}
{"level":"info","ts":1669698278.282086,"logger":"webhooks","msg":"default","name":"testcrdb"}
{"level":"info","ts":1669698278.305721,"logger":"controller.CrdbCluster","msg":"No actor to run; not requeueing","CrdbCluster":"app-ns/testcrdb","ReconcileId":"W5ykg2NfQY2WcspxBQ7VZ"}
<and more of these>

And we never try again. The pods all report healthy and the cluster did in fact Initialize, however we seemingly just give up trying to ever re-update the cluster status to running?

For reference this is happening on a private GKE cluster (with NAT based internet access available) running 1.24.5-gke.600 on the control plane.

Yes, the operator and the CR cluster are in different namespaces but it doesn't appear that we have any issues Initializing the cluster, rather just updating the state afterwards.

For good measure I've also tried the example.yaml included in the repo (to remove Pulumi from the equation) and the same thing happens. And subsequent attempts to modify the cluster aren't able to get it out of the Failed state.

Nov 29 '22 05:11 davefinster

cockroach-operator cockroach-operator copied to clipboard

Status of CR not updated, despite cluster being started

cockroach-operator
cockroach-operator copied to clipboard