OSD-25179 - Avoid deadlock when lockfile is presetn
What type of PR is this?
bug
What this PR does / why we need it?
MUO can sometimes get stuck in a deadlock if a pre-existing lockfile has ownerReference mismatch. This PR pre-emtively removes the lock file to avoid such scenario.
Which Jira/Github issue(s) this PR fixes?
Special notes for your reviewer:
Pre-checks (if applicable):
- [x] Tested latest changes against a cluster
Codecov Report
Attention: Patch coverage is 0% with 44 lines in your changes missing coverage. Please review.
Project coverage is 53.86%. Comparing base (
b887c87) to head (6d67af1). Report is 7 commits behind head on master.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| main.go | 0.00% | 44 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #476 +/- ##
==========================================
- Coverage 54.22% 53.86% -0.37%
==========================================
Files 122 122
Lines 5953 5993 +40
==========================================
Hits 3228 3228
- Misses 2523 2563 +40
Partials 202 202
| Flag | Coverage Δ | |
|---|---|---|
? |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Files with missing lines | Coverage Δ | |
|---|---|---|
| main.go | 0.00% <0.00%> (ø) |
Tests conducted
Scenario #1: No pre-existing lock file
ts=2024-10-04T04:36:51.642269253Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"
ts=2024-10-04T04:36:51.643078303Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
ts=2024-10-04T04:36:51.654621698Z level=error logger=setup msg="Could not find a lock with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator" error="configmaps \"managed-upgrade-operator-lock\" not found" stacktrace="main.cleanupLockForLeader\n\t/workdir/main.go:130\nmain.main\n\t/workdir/main.go:205\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:271"
ts=2024-10-04T04:36:51.654672297Z level=info logger=leader msg="Trying to become the leader."
ts=2024-10-04T04:36:51.654979545Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-845b4486dc-crgs9
ts=2024-10-04T04:36:51.740944065Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-845b4486dc-crgs9
ts=2024-10-04T04:36:51.744960788Z level=info logger=leader msg="No pre-existing lock was found."
ts=2024-10-04T04:36:51.750775161Z level=info logger=leader msg="Became the leader."
Scenario #2: Lock file exists already
ts=2024-10-04T04:38:14.342253208Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"
ts=2024-10-04T04:38:14.342967346Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
ts=2024-10-04T04:38:14.355400982Z level=info logger=setup msg="Lock already exists with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator, attempting to delete"
ts=2024-10-04T04:38:14.367709954Z level=info logger=setup msg="Lock deleted with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator"
ts=2024-10-04T04:38:14.367722464Z level=info logger=leader msg="Trying to become the leader."
ts=2024-10-04T04:38:14.368020958Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-845b4486dc-p92gg
ts=2024-10-04T04:38:14.372204868Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-845b4486dc-p92gg
ts=2024-10-04T04:38:14.374331127Z level=info logger=leader msg="No pre-existing lock was found."
ts=2024-10-04T04:38:14.440535791Z level=info logger=leader msg="Became the leader."
Another scenario where lock initially had different owner. But when the owner got deleted, the existing Pod took leadership.
ts=2024-10-11T08:30:35.112964071Z level=info logger=cmd msg="Operator Version: 1.0.0"
ts=2024-10-11T08:30:35.112996891Z level=info logger=cmd msg="Go Version: go1.22.4 (Red Hat 1.22.4-1.module+el8.10.0+21981+729b5bf8) X:boringcrypto,strictfipsruntime"
ts=2024-10-11T08:30:35.113000995Z level=info logger=cmd msg="Go OS/Arch: linux/amd64"
ts=2024-10-11T08:30:35.113004066Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"
ts=2024-10-11T08:30:35.113945252Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
ts=2024-10-11T08:30:35.310814746Z level=info logger=setup msg="Lock already exists with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator"
Looking in test-managed-upgrade-operator
ts=2024-10-11T08:30:35.313873601Z level=info logger=setup msg="Lock already has existing owner managed-upgrade-operator-5c477fcfd-fx22m, skipping deletion"
ts=2024-10-11T08:30:35.313887197Z level=info logger=leader msg="Trying to become the leader."
ts=2024-10-11T08:30:35.314202447Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-5c477fcfd-52nr2
ts=2024-10-11T08:30:35.317899721Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-5c477fcfd-52n
r2
ts=2024-10-11T08:30:35.319785785Z level=info logger=leader msg="Found existing lock" LockOwner=managed-upgrade-operator-5c477fcfd-fx22m
ts=2024-10-11T08:30:35.334807308Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:36.393479566Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:38.654971987Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:43.061261542Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:52.420026904Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:31:10.910719652Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:31:27.496689267Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:31:45.082753841Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:32:03.419040735Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:32:21.773196599Z level=info logger=leader msg="Became the leader."
ts=2024-10-11T08:32:21.773605014Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
While the other pod stayed waiting to become leader
ts=2024-10-11T08:30:37.694627426Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"
ts=2024-10-11T08:30:37.695468461Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
ts=2024-10-11T08:30:37.707641854Z level=info logger=setup msg="Lock already exists with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator"
Looking in test-managed-upgrade-operator
ts=2024-10-11T08:30:37.79454308Z level=info logger=setup msg="Lock already has existing owner managed-upgrade-operator-5c477fcfd-fx22m, skipping deletion"
ts=2024-10-11T08:30:37.794646038Z level=info logger=leader msg="Trying to become the leader."
ts=2024-10-11T08:30:37.893467151Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-5c477fcfd-gt5v5
ts=2024-10-11T08:30:37.996775964Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-5c477fcfd-gt5v5
ts=2024-10-11T08:30:37.998917257Z level=info logger=leader msg="Found existing lock" LockOwner=managed-upgrade-operator-5c477fcfd-fx22m
ts=2024-10-11T08:30:38.013081751Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:39.093839453Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:41.171560483Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:45.826541009Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:55.342271994Z level=info logger=leader msg="Not the leader. Waiting."
t
/lgtm
@Tafhim: all tests passed!
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
/lgtm
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: devppratik, rendhalver, Tafhim Once this PR has been reviewed and has the lgtm label, please assign ravitri for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/hold
Thanks for reviews so far! Adding temporary hold on this PR till it's further reviewed. I did have some comments
/close
Based on discussion with @Tafhim, we will go ahead with PR https://github.com/openshift/managed-upgrade-operator/pull/483 changes for now and observe if it helps ahead. We believe that the operator-lib bump would help address the issue we were facing (atleast partially). Incase it still continues to impact us, we will open a new PR based on this PR's work (might need to align the solution a bit based on what exact problem we are trying to address).
However, this PR has helped us revise some of the implementation details for the lockfile which was great and will serve good reference for future too. Thanks for your help and work on this Tafhim :+1:
Closing this PR for now.
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
@ravitri: Closed this PR.
In response to this:
/close
Based on discussion with @Tafhim, we will go ahead with PR https://github.com/openshift/managed-upgrade-operator/pull/483 changes for now and observe if it helps ahead. We believe that the operator-lib bump would help address the issue we were facing (atleast partially). Incase it still continues to impact us, we will open a new PR based on this PR's work (might need to align the solution a bit based on what exact problem we are trying to address).
However, this PR has helped us revise some of the implementation details for the lockfile which was great and will serve good reference for future too. Thanks for your help and work on this Tafhim :+1:
Closing this PR for now.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.