operator-controller icon indicating copy to clipboard operation
operator-controller copied to clipboard

:bug: add PDB to make sure at least 1 pod is always available during upgrade

Open jianzhangbjz opened this issue 6 days ago • 4 comments

Description

To address OCPBUGS-62517. Currently, the operator-controller lacks PodDisruptionBudget configuration. During node drain operations or cluster upgrades, all controller pods can be evicted Simultaneously, causing the operator to report Available=False, which violates the OpenShift API contract:

"A component must not report Available=False during the course of a normal upgrade." — OpenShift API Contract

Add PodDisruptionBudget resources with minAvailable: 1 for both controllers to ensure at least one pod remains available during:

  • Rolling updates
  • Node drain operations
  • Cluster upgrades

Reviewer Checklist

  • [ ] API Go Documentation
  • [ ] Tests: Unit Tests (and E2E Tests, if appropriate)
  • [x] Comprehensive Commit Messages
  • [ ] Links to related GitHub Issue(s)

Assisted-by: Claude code

jianzhangbjz avatar Nov 26 '25 02:11 jianzhangbjz

Deploy Preview for olmv1 ready!

Name Link
Latest commit c217a17418679753874760abe4379e7913185346
Latest deploy log https://app.netlify.com/projects/olmv1/deploys/6927a06932c198000845676b
Deploy Preview https://deploy-preview-2362--olmv1.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

netlify[bot] avatar Nov 26 '25 02:11 netlify[bot]

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 74.39%. Comparing base (0fecf3f) to head (c217a17). :warning: Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2362      +/-   ##
==========================================
+ Coverage   70.50%   74.39%   +3.89%     
==========================================
  Files          93       93              
  Lines        7300     7300              
==========================================
+ Hits         5147     5431     +284     
+ Misses       1719     1435     -284     
  Partials      434      434              
Flag Coverage Δ
e2e 44.51% <ø> (ø)
experimental-e2e 48.72% <ø> (?)
unit 58.47% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Nov 26 '25 02:11 codecov[bot]

/approve It's needed, just may need a few tweeks.

tmshort avatar Nov 26 '25 14:11 tmshort

/retest e2e / experimental-e2e (pull_request)

jianzhangbjz avatar Nov 27 '25 00:11 jianzhangbjz

@jianzhangbjz: No presubmit jobs available for operator-framework/operator-controller@main

In response to this:

/retest e2e / experimental-e2e (pull_request)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci[bot] avatar Nov 27 '25 00:11 openshift-ci[bot]

/retest experimental-e2e

jianzhangbjz avatar Nov 27 '25 00:11 jianzhangbjz

@jianzhangbjz: No presubmit jobs available for operator-framework/operator-controller@main

In response to this:

/retest experimental-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci[bot] avatar Nov 27 '25 00:11 openshift-ci[bot]

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tmshort

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Nov 27 '25 00:11 openshift-ci[bot]

Updated the hack/test/install-prometheus.sh to add timeout to address the below error:

Waiting for Prometheus Operator pod to become ready...
error: no matching resources found
Cleaning up /tmp/tmp.Gv71SsiUrG
make: *** [Makefile:295: prometheus] Error 1
Error: Process completed with exit code 2.

jianzhangbjz avatar Nov 27 '25 00:11 jianzhangbjz

/lgtm

dtfranz avatar Nov 27 '25 01:11 dtfranz