kubeblocks icon indicating copy to clipboard operation
kubeblocks copied to clipboard

[BUG] Cluster creation & OpsRequest Reconfiguring races when PVC provisioning delays first Pod start (MySQL)

Open elderapo opened this issue 1 month ago • 8 comments

Describe the bug
Applying a MySQL Cluster and an OpsRequest (type: Reconfiguring with at least one restart-required parameter) in the same apply for new clusters leads to a crashloop/broken cluster when PVC provisioning delays the first Pod start. The OpsRequest is queued and processed by the operator before the MySQL cluster has completed its first boot. When the volume is finally provisioned and the Pod starts, the already-processed OpsRequest immediately triggers the restart-required reconfigure (e.g., innodb_buffer_pool_instances), and the component fails to complete initial bootstrap reliably.

To Reproduce

  1. Apply the following at once (single kubectl apply -f), using a storage class that takes a few seconds to provision a PVC:

    ---
    kind: Namespace
    apiVersion: v1
    metadata:
      name: kubeblocks-test
    ---
    apiVersion: apps.kubeblocks.io/v1
    kind: Cluster
    metadata:
      name: cluster1
      namespace: kubeblocks-test
    spec:
      clusterDef: mysql
      topology: semisync
      terminationPolicy: Delete
      componentSpecs:
        - name: mysql
          componentDef: "mysql-8.0"
          serviceVersion: 8.0.33
          replicas: 1
          volumeClaimTemplates:
            - name: data
              spec:
                accessModes: ["ReadWriteOnce"]
                resources:
                  requests:
                    storage: 10Gi
    ---
    apiVersion: operations.kubeblocks.io/v1alpha1
    kind: OpsRequest
    metadata:
      name: mysql-reconfiguring
      namespace: kubeblocks-test
    spec:
      clusterName: cluster1
      force: false
      reconfigures:
        - componentName: mysql
          parameters:
            - key: innodb_buffer_pool_instances
              value: "5"
      preConditionDeadlineSeconds: 60
      type: Reconfiguring
    
  2. Observe: PVC provisioning keeps the Pod at Pending; the OpsRequest is processed and ready to execute before the Pod exists.

  3. When the Pod finally starts, the restart-required reconfigure is executed immediately (before first-boot completes), and the component fails to finish initialization / enters restart loops.

Expected behavior
The OpsRequest should not be processed until the MySQL Pod is running and all init containers have completed; applying Cluster + OpsRequest together for new clusters should be safe for GitOps workflows even when PVC provisioning is slow.

Additional context

  • Kubernetes: 1.33.5+k3s1
  • KubeBlocks: v1.0.1
  • MySQL add-on: 1.0.3
  • Storage class / CSI: hetzner-csi

Does not happen if

  • The OpsRequest is applied after the Cluster successfully bootstraps (all init containers successfully exit).
  • The Cluster has no volumeClaimTemplates (Pod starts quickly).

elderapo avatar Oct 08 '25 19:10 elderapo

Hi @elderapo

Reconfiguration is a special ops that can be executed when cluster is running/updating/abornaml/failed. OpsRequests are designed as one-time action.

shanshanying avatar Oct 09 '25 03:10 shanshanying

Hi @elderapo And you can set preConditionDeadlineSeconds to delay the execution of operations until the cluster is running.

wangyelei avatar Oct 09 '25 06:10 wangyelei

Hi @shanshanying, I understand, but applying OpsRequest with type: Reconfigure should wait for the cluster to be in a state that can accept the reconfiguration. If applying when the cluster is in the creation state, it breaks it (I believe because it interrupts the first init containers setup, and this process never recovers); the cluster ends up unusable in an infinite crash loop.

Hi @wangyelei, in the above example, I've already used preConditionDeadlineSeconds. Without it, the ops would fail right away after being applied; setting it to 60 causes it to wait for mysql Pod to become running, but interrupts its init containers (that do some bootstraping job, I believe), which results in a crashed cluster that, after a restart (caused by OpsRequest) ends up in the restart loop.

elderapo avatar Oct 09 '25 08:10 elderapo

these are the four cluster phases Reconfiguration Ops can be applie: running/updating/abornaml/failed (did the cluster status goes from creating to updating? otherwise the ops will still wait in Pending). It is recommended that application layer should control when to apply the reconfiguration.

shanshanying avatar Oct 09 '25 10:10 shanshanying

It seems that the cluster goes from Creating => Running right after the containers in Pod start; but before the init containers finish. Because of that, during init container run, the OpsRequest causes the pod to restart, interrupting init jobs. I think it would be fixed if Cluster transitioned from Creating => Running only when:

  • pod started
  • Init containers in the pod finish their work

elderapo avatar Oct 09 '25 10:10 elderapo

failed to reproduce the case. But in KB a cluster is running only when all pod are running ( and pods must be running roles for mysql clusters). It would be helpful if you can provide in detail hwo to reproduce the case when pods are not init-ed but cluster is running.

shanshanying avatar Oct 09 '25 11:10 shanshanying

What CSI did you use to provision the PVC? In my case, it's Hetzner CSI, which takes like 5-15 seconds to provision and bind the volume.

elderapo avatar Oct 09 '25 12:10 elderapo

i used ebd and binding mode is WAITFORFIRSTCONSUMER.

shanshanying avatar Oct 17 '25 04:10 shanshanying

This issue has been marked as stale because it has been open for 30 days with no activity

github-actions[bot] avatar Nov 17 '25 00:11 github-actions[bot]