percona-server-mysql-operator
percona-server-mysql-operator copied to clipboard
K8SPG-859 [POC] Percona Server MySQL Hibernation Feature
CHANGE DESCRIPTION
Problem: This PR implements a hibernation feature for Percona Server MySQL clusters that allows automatic pausing and unpausing based on cron schedules. This is particularly useful for development environments, test clusters, or any scenario where you want to automatically stop MySQL clusters during off-hours to save resources.
🎯 Key Features
✅ Core Hibernation Functionality
- Automatic Pause/Unpause: Schedule-based hibernation using cron expressions
- Manual Override: Manual pause/unpause via
spec.pausefield - State Synchronization: Hibernation state automatically syncs with cluster state
- Health Checks: Only allows hibernation when cluster is in
Readystate - Backup/Restore Awareness: Prevents hibernation during active backups or restores
✅ Smart Scheduling Logic
- Next Window Scheduling: If cluster is unhealthy during scheduled time, automatically schedules for next window
- Schedule Change Detection: Automatically updates next pause/unpause times when schedules change
- First-time Evaluation: Handles initial hibernation setup correctly
- Proactive Scheduling: Prevents immediate pausing when cluster becomes ready after being unready
✅ Robust Error Handling
- Invalid Schedule Handling: Gracefully handles invalid cron expressions
- Cluster State Management: Proper handling of
Initializing,Error,Stopping,Paused, andReadystates - Race Condition Prevention: Prevents state flipping during cluster startup/recovery
🏗️ Architecture
New Controller: PerconaServerMySQLHibernationReconciler
- Dedicated controller for hibernation logic
- Registered in
cmd/manager/main.go - RBAC permissions for PS objects and backup/restore resources
Enhanced CRD Fields
spec:
hibernation:
enabled: true
schedule:
pause: "0 18 * * 1-5" # 6 PM Mon-Fri
unpause: "0 8 * * 1-5" # 8 AM Mon-Fri
pause: false # Manual override
Status Fields
status:
hibernation:
state: "Active" # Active, Paused, Scheduled, Blocked, Disabled
nextPauseTime: "2025-09-24T18:00:00Z"
nextUnpauseTime: "2025-09-25T08:00:00Z"
lastPauseTime: "2025-09-23T18:00:00Z"
lastUnpauseTime: "2025-09-24T08:00:00Z"
reason: "Cluster not ready during scheduled time"
CHECKLIST
Jira
- [ ] Is the Jira ticket created and referenced properly?
- [ ] Does the Jira ticket have the proper statuses for documentation (
Needs Doc) and QA (Needs QA)? - [ ] Does the Jira ticket link to the proper milestone (Fix Version field)?
Tests
- [ ] Is an E2E test/test case added for the new feature/change?
- [ ] Are unit tests added where appropriate?
Config/Logging/Testability
- [ ] Are all needed new/changed options added to default YAML files?
- [ ] Are all needed new/changed options added to the Helm Chart?
- [ ] Did we add proper logging messages for operator actions?
- [ ] Did we ensure compatibility with the previous version or cluster upgrade process?
- [ ] Does the change support oldest and newest supported PS version?
- [ ] Does the change support oldest and newest supported Kubernetes version?
| Test Name | Result | Time |
|---|---|---|
| async-ignore-annotations-8-4 | passed | 00:06:21 |
| async-global-metadata-8-4 | passed | 00:14:48 |
| async-upgrade-8-0 | passed | 00:12:34 |
| async-upgrade-8-4 | passed | 00:12:19 |
| auto-config-8-4 | passed | 00:24:35 |
| config-8-4 | passed | 00:16:25 |
| config-router-8-0 | passed | 00:07:30 |
| config-router-8-4 | passed | 00:07:19 |
| demand-backup-minio-8-0 | passed | 00:20:09 |
| demand-backup-minio-8-4 | passed | 00:19:52 |
| demand-backup-cloud-8-4 | passed | 00:20:55 |
| demand-backup-retry-8-4 | passed | 00:14:58 |
| async-data-at-rest-encryption-8-0 | passed | 00:12:58 |
| async-data-at-rest-encryption-8-4 | passed | 00:13:18 |
| gr-global-metadata-8-4 | failure | 00:14:20 |
| gr-data-at-rest-encryption-8-0 | failure | 00:17:38 |
| gr-data-at-rest-encryption-8-4 | failure | 00:17:43 |
| gr-demand-backup-minio-8-4 | failure | 00:13:28 |
| gr-demand-backup-cloud-8-4 | failure | 00:13:03 |
| gr-demand-backup-haproxy-8-4 | passed | 00:10:09 |
| gr-finalizer-8-4 | passed | 00:06:56 |
| gr-haproxy-8-0 | passed | 00:04:17 |
| gr-haproxy-8-4 | passed | 00:04:09 |
| gr-ignore-annotations-8-4 | passed | 00:04:56 |
| gr-init-deploy-8-0 | passed | 00:09:06 |
| gr-init-deploy-8-4 | passed | 00:09:33 |
| gr-one-pod-8-4 | failure | 00:09:30 |
| gr-recreate-8-4 | failure | 00:06:27 |
| gr-scaling-8-4 | passed | 00:07:39 |
| gr-scheduled-backup-8-4 | passed | 00:17:07 |
| gr-security-context-8-4 | passed | 00:09:52 |
| gr-self-healing-8-4 | passed | 00:21:56 |
| gr-tls-cert-manager-8-4 | passed | 00:10:44 |
| gr-users-8-4 | passed | 00:05:38 |
| gr-upgrade-8-0 | passed | 00:08:23 |
| gr-upgrade-8-4 | passed | 00:10:30 |
| haproxy-8-0 | passed | 00:08:53 |
| haproxy-8-4 | passed | 00:09:47 |
| init-deploy-8-0 | passed | 00:05:48 |
| init-deploy-8-4 | passed | 00:07:18 |
| limits-8-4 | passed | 00:05:31 |
| monitoring-8-4 | passed | 00:19:19 |
| one-pod-8-0 | passed | 00:06:56 |
| one-pod-8-4 | passed | 00:06:16 |
| operator-self-healing-8-4 | passed | 00:12:13 |
| pvc-resize-8-4 | passed | 00:09:28 |
| recreate-8-4 | passed | 00:13:11 |
| scaling-8-4 | passed | 00:10:25 |
| scheduled-backup-8-0 | passed | 00:16:24 |
| scheduled-backup-8-4 | failure | 00:21:49 |
| service-per-pod-8-4 | passed | 00:07:53 |
| sidecars-8-4 | passed | 00:06:09 |
| smart-update-8-4 | passed | 00:08:57 |
| storage-8-4 | passed | 00:03:53 |
| telemetry-8-4 | passed | 00:06:19 |
| tls-cert-manager-8-4 | passed | 00:10:21 |
| users-8-0 | passed | 00:08:19 |
| users-8-4 | passed | 00:07:38 |
| version-service-8-4 | passed | 00:19:21 |
| Summary | Value |
|---|---|
| Tests Run | 59/59 |
| Job Duration | 02:16:58 |
| Total Test Time | 11:09:41 |
commit: https://github.com/percona/percona-server-mysql-operator/pull/1092/commits/94907b7a0f4db5fe90fa1a52d3bed5c82a276c9e
image: perconalab/percona-server-mysql-operator:PR-1092-94907b7a