community icon indicating copy to clipboard operation
community copied to clipboard

Automatic Recovery from Releases Stuck in PendingUpgrade

Open godhanipayal opened this issue 8 months ago • 0 comments

Abstract

This proposal introduces an enhancement to Helm upgrade behavior.When a release becomes stuck in the PENDING_UPGRADE state, Helm currently blocks further operations on that release without recovery. This HIP proposes adding a --force-rollback-on-pending-upgrade flag (or similar) to allow automatic safe rollback to the last successful revision when a release is detected in PENDING_UPGRADE, thereby unblocking operations without requiring manual intervention.

Motivation

Currently, when a helm upgrade fails or is interrupted unexpectedly (e.g., crash, timeout), the Helm release may become stuck in a PENDING_UPGRADE state. Once a release is in PENDING_UPGRADE:

Any subsequent helm upgrade, helm rollback, or helm uninstall on the same release fails.

Users receive an error like:

"Another operation (upgrade/rollback) is in progress for release"

Users must manually delete or modify Helm storage (Secrets) to recover — a risky and manual operation.

This behavior breaks CI/CD pipelines, which are designed for automatic, unattended deployments.

When manual intervention is required:

  • Pipelines fail unexpectedly.
  • Automated rollout and recovery processes are halted.
  • Human operators must step in, leading to delays, production risks, and increased operational burden.

A native, safe Helm mechanism to auto-recover releases stuck in PENDING_UPGRADE will significantly improve reliability, automation, and user experience, especially for large-scale environments. With all major cloud providers, rolling out multiple regions and trying to create k8s service, this problem becomes more and more important to solve to provide a smooth experience.

Proposal

Introduce a new flag for helm upgrade:

--force-rollback-on-pending-upgrade

Behavior when this flag is used:

Before starting an upgrade, Helm checks the current release's status.

If the release status is PENDING_UPGRADE:

Perform an automatic rollback to the last successful revision.

Log a clear message:

"Release was in PENDING_UPGRADE state. Rolling back to revision before proceeding."

After rollback, proceed with the requested upgrade.

If no previous successful revision is available, Helm should fail gracefully with a clear error message like:

"No successful revision found to rollback for release . Manual intervention required."

godhanipayal avatar Apr 27 '25 20:04 godhanipayal