percona-server-mysql-operator Readiness Probe Causes Infinite Restart Loop During Group Replication Recovery

Report

Note: this was written up with the help of AI, but based on experience with several crashes that the operator couldn't recover from. I either had to restore from backup, or disable the operator and "hand recover" (and afterwards I'd reset from scratch to get an "operator operated" cluster again).

Bug Description

The Percona MySQL Operator's readiness probe treats RECOVERING nodes as unhealthy, causing Kubernetes to restart pods that are actively recovering data through MySQL Group Replication. This creates an infinite loop where nodes can never complete recovery because they are killed mid-process.

The Problem

When a MySQL node is in RECOVERING state (legitimately catching up with group replication), the readiness probe fails and Kubernetes restarts the pod. This prevents the node from ever completing recovery, creating a deadlock situation.

Current Readiness Probe Behavior

kubectl exec ntpdb-mysql-1 -n ntpdb -- /opt/percona/healthcheck readiness
# Output: 2025/09/27 20:46:18 readiness check failed: Member state: RECOVERING
# Exit code: 1

Group Replication Status

SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE FROM performance_schema.replication_group_members;
+-----------------------------------+--------------+-------------+
| MEMBER_HOST                       | MEMBER_STATE | MEMBER_ROLE |
+-----------------------------------+--------------+-------------+
| ntpdb-mysql-0.ntpdb-mysql.ntpdb  | ONLINE       | PRIMARY     |
| ntpdb-mysql-1.ntpdb-mysql.ntpdb  | RECOVERING   | SECONDARY   |
+-----------------------------------+--------------+-------------+

Steps to reproduce

The Deadlock Cycle

Node starts recovery: MySQL node joins group replication in RECOVERING state
Readiness probe fails: /opt/percona/healthcheck readiness returns exit code 1 for RECOVERING nodes
Kubernetes restarts pod: After readiness probe failure threshold, pod is marked unready and restarted
Recovery interruption: Node loses recovery progress and must start over
StatefulSet blocking: Next pod (mysql-2) cannot start until mysql-1 is Ready
Operator deadlock: Cannot complete initialization without all pods healthy
Infinite loop: Process repeats indefinitely

Versions

Operator Version: v0.12.0
MySQL Version: percona/percona-server:8.0.43-34
Kubernetes Version: v1.32.6
Cluster Configuration: 3-node MySQL Group Replication with StatefulSet

Anything else?

StatefulSet Cascade Failure

Pod ordering: StatefulSets require pod N to be Ready before creating pod N+1
Blocked scaling: mysql-2 never gets created because mysql-1 never becomes Ready
Cluster incomplete: Operator cannot complete initialization with missing pods

Operational Impact

Extended outages: Clusters stuck in initialization for hours/days
Data consistency issues: Repeated recovery interruptions
Resource waste: Continuous pod restarts consume CPU/memory
Manual intervention required: No automatic recovery possible

Expected Behavior

Readiness Probe Should Accept RECOVERING

The readiness probe should distinguish between:

Healthy States (Ready=True):
- ONLINE - Node is fully operational
- RECOVERING - Node is actively syncing data (healthy progress)
Unhealthy States (Ready=False):
- ERROR - Node has encountered an error
- OFFLINE - Node is not participating in group
- UNREACHABLE - Network connectivity issues

Rationale

RECOVERING is expected: During cluster recovery, nodes legitimately spend time catching up
Progress is healthy: RECOVERING means the node is actively syncing data
Time is required: Large datasets may take hours to sync
Interruption is harmful: Restarting during recovery loses progress

Proposed Solution

1. Modify Readiness Check Logic

# Current logic (problematic):
if [ "$MEMBER_STATE" != "ONLINE" ]; then
    echo "readiness check failed: Member state: $MEMBER_STATE"
    exit 1
fi

# Proposed logic:
case "$MEMBER_STATE" in
    "ONLINE"|"RECOVERING")
        exit 0  # Ready
        ;;
    "ERROR"|"OFFLINE"|"UNREACHABLE"|"")
        echo "readiness check failed: Member state: $MEMBER_STATE"
        exit 1  # Not ready
        ;;
esac

2. Add Recovery Progress Monitoring

For RECOVERING nodes, optionally check if progress is being made:

Monitor GTID_EXECUTED advancement
Allow reasonable timeout for large recoveries
Only fail if recovery is truly stuck (no progress for extended period)

3. Configuration Options

Add operator configuration to control readiness behavior:

spec:
  mysql:
    readinessProbe:
      allowRecovering: true  # Default: true
      recoveryTimeout: 3600  # Seconds to allow recovery before failing

Reproduction Steps

Create 3-node cluster with some data
Simulate node failure (delete 2 pods, leaving 1 with most recent data)
Start recovery: Let operator attempt to recover cluster
Observe deadlock:
- Node 1 reaches RECOVERING state
- Readiness probe fails repeatedly
- Pod restarts interrupt recovery
- Node 2 never starts due to StatefulSet ordering
- Cluster never completes initialization

Current Workaround

None available - The issue is fundamental to the readiness probe logic and cannot be worked around without code changes.

Evidence

Pod Restart Pattern

kubectl get pods -n ntpdb | grep mysql-1
# ntpdb-mysql-1    1/2     Running     4 (2m ago)     8m
# Shows continuous restarts due to readiness failures

StatefulSet Status

kubectl get statefulset ntpdb-mysql -n ntpdb
# Shows replicas: 2/3 because mysql-2 cannot start

Operator Status

kubectl get perconaservermysql ntpdb -n ntpdb -o jsonpath='{.status.state}'
# Shows: "initializing" (stuck indefinitely)

Related Issues

This issue compounds with the broader operator reconciliation problems:

Status update conflicts prevent progress
Crash recovery logic is overly aggressive
Initialization state never completes

Best I can tell this:

Prevents cluster recovery in common failure scenarios
Requires manual operator intervention/restart
Can cause extended production outages
Has no viable workaround

Sep 27 '25 20:09 abh

@abh I think it is not about readiness but liveness probe :) We will recheck it tomorrow. Thanks for the report.

Sep 28 '25 07:09 hors

@abh if we are talking about readiness, the behaviour is correct because pod can't be ready when it sync data.

Sep 28 '25 08:09 hors

@abh I tried to reproduce your issue but couldn't.

I created a fresh cluster and filled it with 22GB of data. Then deleted the PVC of a secondary and restarted the pod. It re-joined cluster and started doing incremental recovery.

It performed the recovery for ~14 minutes:

2025-10-31T14:55:05.715007Z 41 [Note] [MY-010581] [Repl] Replica SQL thread for channel 'group_replication_recovery' initialized, starting replication in log 'FIRST' at position 0, relay log './cluster1-mysql-2-relay-bin-group_replication_recovery.000001' position: 4
2025-10-31T14:55:05.719178Z 40 [System] [MY-014002] [Repl] Replica receiver thread for channel 'group_replication_recovery': connected to source 'mysql_innodb_cluster_38088222@cluster1-mysql-0.cluster1-mysql.ps-6348:3306' with server_uuid=89c9cc35-b665-11f0-8bfc-561239c76a31, server_id=38088220. Starting GTID-based replication.
2025-10-31T14:55:06.280450Z 15 [System] [MY-014010] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' has been started.'
2025-10-31T14:57:05.129266Z 41 [Note] [MY-010559] [Repl] Multi-threaded replica statistics for channel 'group_replication_recovery': seconds elapsed = 120; events assigned = 172211; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 103232851200 waited (count) when Workers occupied = 73 waited when Workers occupied = 5586529300
2025-10-31T14:59:05.281746Z 41 [Note] [MY-010559] [Repl] Multi-threaded replica statistics for channel 'group_replication_recovery': seconds elapsed = 120; events assigned = 335197; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 203214933800 waited (count) when Workers occupied = 136 waited when Workers occupied = 11051319400
2025-10-31T15:01:05.324772Z 41 [Note] [MY-010559] [Repl] Multi-threaded replica statistics for channel 'group_replication_recovery': seconds elapsed = 120; events assigned = 496138; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 302758974500 waited (count) when Workers occupied = 196 waited when Workers occupied = 15384357200
2025-10-31T15:03:05.582740Z 41 [Note] [MY-010559] [Repl] Multi-threaded replica statistics for channel 'group_replication_recovery': seconds elapsed = 120; events assigned = 667319; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 398635310700 waited (count) when Workers occupied = 261 waited when Workers occupied = 22328535800
2025-10-31T15:05:05.333573Z 41 [Note] [MY-010559] [Repl] Multi-threaded replica statistics for channel 'group_replication_recovery': seconds elapsed = 120; events assigned = 862082; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 494669954800 waited (count) when Workers occupied = 342 waited when Workers occupied = 28439384000
2025-10-31T15:07:05.371152Z 41 [Note] [MY-010559] [Repl] Multi-threaded replica statistics for channel 'group_replication_recovery': seconds elapsed = 120; events assigned = 1057875; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 592575830200 waited (count) when Workers occupied = 403 waited when Workers occupied = 33465058300
2025-10-31T15:08:31.625975Z 41 [System] [MY-010379] [Repl] Replica SQL thread stopped because it reached UNTIL SQL_AFTER_GTIDS 89c9cc35-b665-11f0-8bfc-561239c76a31:1-4, 9bec0270-b665-11f0-a3ec-561239c76a31:1-13928

The pod stayed unready (as it should) during recovery and there were no restarts:

cluster1-mysql-2                                 2/2     Terminating   0          26m
cluster1-mysql-2                                 0/2     Completed     0          26m
cluster1-mysql-2                                 0/2     Completed     0          26m
cluster1-mysql-2                                 0/2     Completed     0          26m
cluster1-mysql-2                                 0/2     Pending       0          0s
cluster1-mysql-2                                 0/2     Pending       0          4s
cluster1-mysql-2                                 0/2     Init:0/1      0          4s
cluster1-mysql-2                                 0/2     Init:0/1      0          14s
cluster1-mysql-2                                 0/2     PodInitializing   0          15s
cluster1-mysql-2                                 1/2     Running           0          16s
cluster1-mysql-2                                 1/2     Running           0          14m
cluster1-mysql-2                                 2/2     Running           0          14m

Oct 31 '25 15:10 egegunes

Readiness Probe Causes Infinite Restart Loop During Group Replication Recovery

Report

Bug Description

More about the problem

The Problem

Current Readiness Probe Behavior

Group Replication Status

Steps to reproduce

The Deadlock Cycle

Versions

Anything else?

StatefulSet Cascade Failure

Operational Impact

Expected Behavior

Readiness Probe Should Accept RECOVERING

Rationale

Proposed Solution

1. Modify Readiness Check Logic

2. Add Recovery Progress Monitoring

3. Configuration Options

Reproduction Steps

Current Workaround

Evidence

Pod Restart Pattern

StatefulSet Status

Operator Status

Related Issues