cockroach-operator
cockroach-operator copied to clipboard
Status of CR not updated, despite cluster being started
After the cluster has been initialized, the status in the resource is still clusterStatus: Failed
, although the database is fully usable.
Due to that the status in the resource can't be used to monitor the cluster initialization.
Setup:
- Minikube running Kubernetes 1.24.7
- CRD and Operator deployed v2.8.0
- CR deployed in Namespace
keycloak
- see full output below including status - Operator complains about a concurrent modification, apparently it didn't retry? - see log below.
The cluster is fully usable, still the status in the resource is not updated. Restart of the operator didn't fix the status/deliver any reconciliation.
CR with status
kind: CrdbCluster
metadata:
annotations:
crdb.io/certexpiration: "2027-11-14T07:57:21Z"
crdb.io/containerimage: cockroachdb/cockroach:v22.1.10
crdb.io/version: v22.1.10
meta.helm.sh/release-name: keycloak
meta.helm.sh/release-namespace: default
creationTimestamp: "2022-11-10T11:44:22Z"
generation: 1
labels:
app.kubernetes.io/managed-by: Helm
name: cockroach
namespace: keycloak
resourceVersion: "112311"
uid: 88e847e4-e321-420b-a425-3ddf19c53226
spec:
dataStore:
pvc:
source:
claimName: ""
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
volumeMode: Filesystem
supportsAutoResize: false
grpcPort: 26258
httpPort: 8080
image:
name: cockroachdb/cockroach:v22.1.10
pullPolicy: IfNotPresent
maxUnavailable: 1
nodes: 3
resources: {}
sqlPort: 26257
tlsEnabled: true
status:
clusterStatus: Failed
conditions:
- lastTransitionTime: "2022-11-10T11:44:34Z"
status: "True"
type: Initialized
- lastTransitionTime: "2022-11-10T11:44:26Z"
status: "True"
type: CrdbVersionChecked
- lastTransitionTime: "2022-11-10T11:44:27Z"
status: "True"
type: CertificateGenerated
crdbcontainerimage: cockroachdb/cockroach:v22.1.10
operatorActions:
- lastTransitionTime: "2022-11-10T11:44:22Z"
message: job changed
status: Failed
type: VersionCheckerAction
- lastTransitionTime: "2022-11-10T11:44:27Z"
message: pod is not running
status: Failed
type: Initialize
version: v22.1.10
operator log
{"level":"warn","ts":1668080674.5940526,"logger":"controller.CrdbCluster","msg":"completed initializing database","CrdbCluster":"keycloak/cockroach","ReconcileId":"LC8CQLDR9WNGuQxK93EqWB"} {"level":"error","ts":1668080674.9439986,"logger":"controller.CrdbCluster","msg":"failed to update cluster status","CrdbCluster":"keycloak/cockroach","ReconcileId":"LC8CQLDR9WNGuQxK93EqWB","error":"Operation cannot be fulfilled on crdbclusters.crdb.cockroachlabs.com "cockroach": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\texternal/com_github_go_logr_zapr/zapr.go:132\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:196\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99"} {"level":"error","ts":1668080674.9446912,"logger":"controller-runtime.manager.controller.crdbcluster","msg":"Reconciler error","reconciler group":"crdb.cockroachlabs.com","reconciler kind":"CrdbCluster","name":"cockroach","namespace":"keycloak","error":"Operation cannot be fulfilled on crdbclusters.crdb.cockroachlabs.com "cockroach": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\texternal/com_github_go_logr_zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:301\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99"} {"level":"info","ts":1668080674.9448378,"logger":"controller.CrdbCluster","msg":"reconciling CockroachDB cluster","CrdbCluster":"keycloak/cockroach","ReconcileId":"NhLUCi9szZJ9LNKQbpdt59"} {"level":"info","ts":1668080674.9448888,"logger":"webhooks","msg":"default","name":"cockroach"}
Might possibly related to #592 - unfortunately I see this all the time in my setup, not only occasionally.
Looks like it might be related to https://github.com/cockroachdb/cockroach-operator/issues/906 by the looks of it.
Hi @neurodrone - as I only see the error message
Operation cannot be fulfilled on crdbclusters.crdb.cockroachlabs.com "cockroach": the object has been modified; please apply your changes to the latest version and try again
and no other error message around connecting to the database, I assume this is not related to #906. Reading the other issue it might still affect me once I try to updating a cluster, or once it passes the stage I'm currently in.
I'm also encountering this on both 2.8 and 2.7.
I also had to monkey-patch in a fix to allow the VersionChecker to succeed since it would attempt to grab the output of the container before the version had even been output/it entered sleep.
I'm yet to be able to provision any functional CR cluster with this operator project.
Name: testcrdb
Namespace: app-ns
Labels: app.kubernetes.io/managed-by=pulumi
Annotations: crdb.io/certexpiration: 2027-11-30T23:58:17Z
API Version: crdb.cockroachlabs.com/v1alpha1
Kind: CrdbCluster
Metadata:
Creation Timestamp: 2022-11-29T05:04:05Z
Generation: 1
Manager: cockroach-operator
Operation: Update
Time: 2022-11-29T05:04:06Z
API Version: crdb.cockroachlabs.com/v1alpha1
Manager: cockroach-operator
Operation: Update
Subresource: status
Time: 2022-11-29T05:04:07Z
Resource Version: 11066027
UID: 24b56ac2-5f9a-41d6-900b-ae834679e189
Spec:
Additional Labels:
App: cockroachdb
Cockroach - Cluster: testcrdb
Cockroach DB Version: v22.1.2
Data Store:
Pvc:
Source:
Claim Name:
Spec:
Access Modes:
ReadWriteOnce
Resources:
Requests:
Storage: 10Gi
Storage Class Name: premium-rwo
Volume Mode: Filesystem
Supports Auto Resize: false
Grpc Port: 26258
Http Port: 8080
Max Unavailable: 1
Nodes: 3
Resources:
Sql Port: 26257
Tls Enabled: true
Status:
Cluster Status: Failed
Conditions:
Last Transition Time: 2022-11-29T05:04:37Z
Status: True
Type: Initialized
Last Transition Time: 2022-11-29T05:04:05Z
Status: True
Type: CrdbVersionChecked
Last Transition Time: 2022-11-29T05:04:06Z
Status: True
Type: CertificateGenerated
Crdbcontainerimage: cockroachdb/cockroach:v22.1.2
Operator Actions:
Last Transition Time: 2022-11-29T05:04:32Z
Message: pod is not running
Status: Failed
Type: Initialize
Version: v22.1.2
Events: <none>
I've also deployed the workaround described in #906 for good measure and encounter the same problem. It seems to be that the Initialize actor updates the status object at which point we return to the main Reconcile loop and immediately try to update the cluster status. This fails with
{"level":"info","ts":1669698277.3480625,"logger":"controller.CrdbCluster","msg":"Running action with name: Initialize","CrdbCluster":"app-ns/testcrdb","ReconcileId":"acqUPmaepZhbPisLqUVm2Z"}
{"level":"warn","ts":1669698277.348095,"logger":"controller.CrdbCluster","msg":"initializing CockroachDB","CrdbCluster":"app-ns/testcrdb","ReconcileId":"acqUPmaepZhbPisLqUVm2Z"}
{"level":"warn","ts":1669698277.355068,"logger":"controller.CrdbCluster","msg":"Pod is ready","CrdbCluster":"app-ns/testcrdb","ReconcileId":"acqUPmaepZhbPisLqUVm2Z"}
{"level":"warn","ts":1669698277.355116,"logger":"controller.CrdbCluster","msg":"Executing init in pod testcrdb-0 with phase Running","CrdbCluster":"app-ns/testcrdb","ReconcileId":"acqUPmaepZhbPisLqUVm2Z"}
{"level":"warn","ts":1669698277.8978968,"logger":"controller.CrdbCluster","msg":"Executed init in pod","CrdbCluster":"app-ns/testcrdb","ReconcileId":"acqUPmaepZhbPisLqUVm2Z"}
{"level":"info","ts":1669698277.89797,"logger":"webhooks","msg":"default","name":"testcrdb"}
{"level":"warn","ts":1669698277.9101963,"logger":"controller.CrdbCluster","msg":"completed initializing database","CrdbCluster":"app-ns/testcrdb","ReconcileId":"acqUPmaepZhbPisLqUVm2Z"}
{"level":"error","ts":1669698278.2586818,"logger":"controller.CrdbCluster","msg":"failed to update cluster status","CrdbCluster":"app-ns/testcrdb","ReconcileId":"acqUPmaepZhbPisLqUVm2Z","error":"Operation cannot be fulfilled on crdbclusters.crdb.cockroachlabs.com \"testcrdb\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\texternal/com_github_go_logr_zapr/zapr.go:132\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:196\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99"}
{"level":"error","ts":1669698278.2593884,"logger":"controller-runtime.manager.controller.crdbcluster","msg":"Reconciler error","reconciler group":"crdb.cockroachlabs.com","reconciler kind":"CrdbCluster","name":"testcrdb","namespace":"app-ns","error":"Operation cannot be fulfilled on crdbclusters.crdb.cockroachlabs.com \"testcrdb\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\texternal/com_github_go_logr_zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:301\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99"}
{"level":"info","ts":1669698278.2595415,"logger":"controller.CrdbCluster","msg":"reconciling CockroachDB cluster","CrdbCluster":"app-ns/testcrdb","ReconcileId":"ApnFDWosB4MGhzuoSR6JVW"}
{"level":"info","ts":1669698278.2596006,"logger":"webhooks","msg":"default","name":"testcrdb"}
{"level":"info","ts":1669698278.2819178,"logger":"controller.CrdbCluster","msg":"No actor to run; not requeueing","CrdbCluster":"app-ns/testcrdb","ReconcileId":"ApnFDWosB4MGhzuoSR6JVW"}
{"level":"info","ts":1669698278.2820442,"logger":"controller.CrdbCluster","msg":"reconciling CockroachDB cluster","CrdbCluster":"app-ns/testcrdb","ReconcileId":"W5ykg2NfQY2WcspxBQ7VZ"}
{"level":"info","ts":1669698278.282086,"logger":"webhooks","msg":"default","name":"testcrdb"}
{"level":"info","ts":1669698278.305721,"logger":"controller.CrdbCluster","msg":"No actor to run; not requeueing","CrdbCluster":"app-ns/testcrdb","ReconcileId":"W5ykg2NfQY2WcspxBQ7VZ"}
<and more of these>
And we never try again. The pods all report healthy and the cluster did in fact Initialize, however we seemingly just give up trying to ever re-update the cluster status to running?
For reference this is happening on a private GKE cluster (with NAT based internet access available) running 1.24.5-gke.600 on the control plane.
Yes, the operator and the CR cluster are in different namespaces but it doesn't appear that we have any issues Initializing the cluster, rather just updating the state afterwards.
For good measure I've also tried the example.yaml included in the repo (to remove Pulumi from the equation) and the same thing happens. And subsequent attempts to modify the cluster aren't able to get it out of the Failed
state.