kubeblocks icon indicating copy to clipboard operation
kubeblocks copied to clipboard

chore: improve addon controller job management to prevent blocking updates

Open ldming opened this issue 2 months ago • 1 comments

Problem Description

The addon controller had issues with stuck jobs that could prevent addon updates from being applied:

  1. No timeout mechanism: Jobs without activeDeadlineSeconds could run indefinitely if pods failed to start (e.g., due to image pull errors)
  2. Missing generation tracking: When addon was updated, old jobs might still be running but controller didn't actively clean them up
  3. Blocking behavior: New addon updates would wait indefinitely for stuck jobs to complete

Root Cause

When addon updates occur (e.g., configuration changes), the addon's generation increases. However, if there are existing jobs that are stuck (due to ImagePullBackOff or other pod startup issues), the controller would wait for these jobs indefinitely, preventing new updates from being applied.

Solution

1. Job Timeout Configuration

  • Added activeDeadlineSeconds to all helm jobs (default: 5 minutes)
  • Configurable via environment variable KUBEBLOCKS_ADDON_JOB_TIMEOUT
  • Prevents jobs from running indefinitely

2. Generation Tracking

  • Added generation annotation addon.kubeblocks.io/generation to jobs
  • Tracks which addon generation each job belongs to
  • Enables automatic cleanup of outdated jobs

3. Outdated Job Detection and Cleanup

  • Added isJobOutdated() function to check if job belongs to older generation
  • Automatically delete outdated jobs when addon is updated
  • Allows new jobs to be created for the current generation

Changes Made

  • controllers/extensions/addon_controller_stages.go:

    • Add job timeout configuration (5 minutes default)
    • Add generation annotation to all helm jobs
    • Implement outdated job detection and cleanup logic
    • Add isJobOutdated() helper function
  • controllers/extensions/const.go:

    • Add AddonGeneration constant for annotation key
  • controllers/extensions/addon_controller_test.go:

    • Add comprehensive test case for job cleanup scenarios
    • Verify generation tracking and timeout configuration
  • Configuration files:

    • Add support for KUBEBLOCKS_ADDON_JOB_TIMEOUT environment variable

Testing

  • Added test case "should cleanup outdated jobs when addon is updated"
  • Verifies that outdated jobs are properly deleted when addon generation changes
  • Ensures new jobs are created with correct generation annotation and timeout

Benefits

  1. Prevents blocking: New addon updates can proceed without waiting for stuck jobs
  2. Resource cleanup: Outdated jobs are automatically cleaned up
  3. Configurable timeout: Administrators can adjust job timeout based on their environment
  4. Better reliability: Reduces the chance of addon updates getting stuck indefinitely

This change ensures that addon updates are more reliable and responsive, especially in environments where image pull issues or other pod startup problems might occur.

ldming avatar Oct 14 '25 01:10 ldming

Auto Cherry-pick Instructions

Usage:
  - /nopick: Not auto cherry-pick when PR merged.
  - /pick: release-x.x [release-x.x]: Auto cherry-pick to the specified branch when PR merged.

Example:
  - /nopick
  - /pick release-1.0

apecloud-bot avatar Oct 14 '25 01:10 apecloud-bot

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment

github-actions[bot] avatar Dec 01 '25 00:12 github-actions[bot]