percona-server-mysql-operator icon indicating copy to clipboard operation
percona-server-mysql-operator copied to clipboard

K8SPG-859 [POC] Percona Server MySQL Hibernation Feature

Open hors opened this issue 2 months ago • 1 comments

K8SPG-859 Powered by Pull Request Badge

CHANGE DESCRIPTION

Problem: This PR implements a hibernation feature for Percona Server MySQL clusters that allows automatic pausing and unpausing based on cron schedules. This is particularly useful for development environments, test clusters, or any scenario where you want to automatically stop MySQL clusters during off-hours to save resources.

🎯 Key Features

Core Hibernation Functionality

  • Automatic Pause/Unpause: Schedule-based hibernation using cron expressions
  • Manual Override: Manual pause/unpause via spec.pause field
  • State Synchronization: Hibernation state automatically syncs with cluster state
  • Health Checks: Only allows hibernation when cluster is in Ready state
  • Backup/Restore Awareness: Prevents hibernation during active backups or restores

Smart Scheduling Logic

  • Next Window Scheduling: If cluster is unhealthy during scheduled time, automatically schedules for next window
  • Schedule Change Detection: Automatically updates next pause/unpause times when schedules change
  • First-time Evaluation: Handles initial hibernation setup correctly
  • Proactive Scheduling: Prevents immediate pausing when cluster becomes ready after being unready

Robust Error Handling

  • Invalid Schedule Handling: Gracefully handles invalid cron expressions
  • Cluster State Management: Proper handling of Initializing, Error, Stopping, Paused, and Ready states
  • Race Condition Prevention: Prevents state flipping during cluster startup/recovery

🏗️ Architecture

New Controller: PerconaServerMySQLHibernationReconciler

  • Dedicated controller for hibernation logic
  • Registered in cmd/manager/main.go
  • RBAC permissions for PS objects and backup/restore resources

Enhanced CRD Fields

spec:
  hibernation:
    enabled: true
    schedule:
      pause: "0 18 * * 1-5"    # 6 PM Mon-Fri
      unpause: "0 8 * * 1-5"   # 8 AM Mon-Fri
  pause: false  # Manual override

Status Fields

status:
  hibernation:
    state: "Active"  # Active, Paused, Scheduled, Blocked, Disabled
    nextPauseTime: "2025-09-24T18:00:00Z"
    nextUnpauseTime: "2025-09-25T08:00:00Z"
    lastPauseTime: "2025-09-23T18:00:00Z"
    lastUnpauseTime: "2025-09-24T08:00:00Z"
    reason: "Cluster not ready during scheduled time"

CHECKLIST

Jira

  • [ ] Is the Jira ticket created and referenced properly?
  • [ ] Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • [ ] Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • [ ] Is an E2E test/test case added for the new feature/change?
  • [ ] Are unit tests added where appropriate?

Config/Logging/Testability

  • [ ] Are all needed new/changed options added to default YAML files?
  • [ ] Are all needed new/changed options added to the Helm Chart?
  • [ ] Did we add proper logging messages for operator actions?
  • [ ] Did we ensure compatibility with the previous version or cluster upgrade process?
  • [ ] Does the change support oldest and newest supported PS version?
  • [ ] Does the change support oldest and newest supported Kubernetes version?

hors avatar Sep 23 '25 12:09 hors

Test Name Result Time
async-ignore-annotations-8-4 passed 00:06:21
async-global-metadata-8-4 passed 00:14:48
async-upgrade-8-0 passed 00:12:34
async-upgrade-8-4 passed 00:12:19
auto-config-8-4 passed 00:24:35
config-8-4 passed 00:16:25
config-router-8-0 passed 00:07:30
config-router-8-4 passed 00:07:19
demand-backup-minio-8-0 passed 00:20:09
demand-backup-minio-8-4 passed 00:19:52
demand-backup-cloud-8-4 passed 00:20:55
demand-backup-retry-8-4 passed 00:14:58
async-data-at-rest-encryption-8-0 passed 00:12:58
async-data-at-rest-encryption-8-4 passed 00:13:18
gr-global-metadata-8-4 failure 00:14:20
gr-data-at-rest-encryption-8-0 failure 00:17:38
gr-data-at-rest-encryption-8-4 failure 00:17:43
gr-demand-backup-minio-8-4 failure 00:13:28
gr-demand-backup-cloud-8-4 failure 00:13:03
gr-demand-backup-haproxy-8-4 passed 00:10:09
gr-finalizer-8-4 passed 00:06:56
gr-haproxy-8-0 passed 00:04:17
gr-haproxy-8-4 passed 00:04:09
gr-ignore-annotations-8-4 passed 00:04:56
gr-init-deploy-8-0 passed 00:09:06
gr-init-deploy-8-4 passed 00:09:33
gr-one-pod-8-4 failure 00:09:30
gr-recreate-8-4 failure 00:06:27
gr-scaling-8-4 passed 00:07:39
gr-scheduled-backup-8-4 passed 00:17:07
gr-security-context-8-4 passed 00:09:52
gr-self-healing-8-4 passed 00:21:56
gr-tls-cert-manager-8-4 passed 00:10:44
gr-users-8-4 passed 00:05:38
gr-upgrade-8-0 passed 00:08:23
gr-upgrade-8-4 passed 00:10:30
haproxy-8-0 passed 00:08:53
haproxy-8-4 passed 00:09:47
init-deploy-8-0 passed 00:05:48
init-deploy-8-4 passed 00:07:18
limits-8-4 passed 00:05:31
monitoring-8-4 passed 00:19:19
one-pod-8-0 passed 00:06:56
one-pod-8-4 passed 00:06:16
operator-self-healing-8-4 passed 00:12:13
pvc-resize-8-4 passed 00:09:28
recreate-8-4 passed 00:13:11
scaling-8-4 passed 00:10:25
scheduled-backup-8-0 passed 00:16:24
scheduled-backup-8-4 failure 00:21:49
service-per-pod-8-4 passed 00:07:53
sidecars-8-4 passed 00:06:09
smart-update-8-4 passed 00:08:57
storage-8-4 passed 00:03:53
telemetry-8-4 passed 00:06:19
tls-cert-manager-8-4 passed 00:10:21
users-8-0 passed 00:08:19
users-8-4 passed 00:07:38
version-service-8-4 passed 00:19:21
Summary Value
Tests Run 59/59
Job Duration 02:16:58
Total Test Time 11:09:41

commit: https://github.com/percona/percona-server-mysql-operator/pull/1092/commits/94907b7a0f4db5fe90fa1a52d3bed5c82a276c9e image: perconalab/percona-server-mysql-operator:PR-1092-94907b7a

JNKPercona avatar Nov 04 '25 12:11 JNKPercona