chore: improve addon controller job management to prevent blocking updates
Problem Description
The addon controller had issues with stuck jobs that could prevent addon updates from being applied:
- No timeout mechanism: Jobs without
activeDeadlineSecondscould run indefinitely if pods failed to start (e.g., due to image pull errors) - Missing generation tracking: When addon was updated, old jobs might still be running but controller didn't actively clean them up
- Blocking behavior: New addon updates would wait indefinitely for stuck jobs to complete
Root Cause
When addon updates occur (e.g., configuration changes), the addon's generation increases. However, if there are existing jobs that are stuck (due to ImagePullBackOff or other pod startup issues), the controller would wait for these jobs indefinitely, preventing new updates from being applied.
Solution
1. Job Timeout Configuration
- Added
activeDeadlineSecondsto all helm jobs (default: 5 minutes) - Configurable via environment variable
KUBEBLOCKS_ADDON_JOB_TIMEOUT - Prevents jobs from running indefinitely
2. Generation Tracking
- Added generation annotation
addon.kubeblocks.io/generationto jobs - Tracks which addon generation each job belongs to
- Enables automatic cleanup of outdated jobs
3. Outdated Job Detection and Cleanup
- Added
isJobOutdated()function to check if job belongs to older generation - Automatically delete outdated jobs when addon is updated
- Allows new jobs to be created for the current generation
Changes Made
-
controllers/extensions/addon_controller_stages.go:
- Add job timeout configuration (5 minutes default)
- Add generation annotation to all helm jobs
- Implement outdated job detection and cleanup logic
- Add
isJobOutdated()helper function
-
controllers/extensions/const.go:
- Add
AddonGenerationconstant for annotation key
- Add
-
controllers/extensions/addon_controller_test.go:
- Add comprehensive test case for job cleanup scenarios
- Verify generation tracking and timeout configuration
-
Configuration files:
- Add support for
KUBEBLOCKS_ADDON_JOB_TIMEOUTenvironment variable
- Add support for
Testing
- Added test case "should cleanup outdated jobs when addon is updated"
- Verifies that outdated jobs are properly deleted when addon generation changes
- Ensures new jobs are created with correct generation annotation and timeout
Benefits
- Prevents blocking: New addon updates can proceed without waiting for stuck jobs
- Resource cleanup: Outdated jobs are automatically cleaned up
- Configurable timeout: Administrators can adjust job timeout based on their environment
- Better reliability: Reduces the chance of addon updates getting stuck indefinitely
This change ensures that addon updates are more reliable and responsive, especially in environments where image pull issues or other pod startup problems might occur.
Auto Cherry-pick Instructions
Usage:
- /nopick: Not auto cherry-pick when PR merged.
- /pick: release-x.x [release-x.x]: Auto cherry-pick to the specified branch when PR merged.
Example:
- /nopick
- /pick release-1.0
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment