managed-upgrade-operator icon indicating copy to clipboard operation
managed-upgrade-operator copied to clipboard

OSD-25179 - Avoid deadlock when lockfile is presetn

Open Tafhim opened this issue 1 year ago • 5 comments

What type of PR is this?

bug

What this PR does / why we need it?

MUO can sometimes get stuck in a deadlock if a pre-existing lockfile has ownerReference mismatch. This PR pre-emtively removes the lock file to avoid such scenario.

Which Jira/Github issue(s) this PR fixes?

OSD-25179

Special notes for your reviewer:

Pre-checks (if applicable):

  • [x] Tested latest changes against a cluster

Tafhim avatar Oct 04 '24 04:10 Tafhim

Codecov Report

Attention: Patch coverage is 0% with 44 lines in your changes missing coverage. Please review.

Project coverage is 53.86%. Comparing base (b887c87) to head (6d67af1). Report is 7 commits behind head on master.

Files with missing lines Patch % Lines
main.go 0.00% 44 Missing :warning:
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #476      +/-   ##
==========================================
- Coverage   54.22%   53.86%   -0.37%     
==========================================
  Files         122      122              
  Lines        5953     5993      +40     
==========================================
  Hits         3228     3228              
- Misses       2523     2563      +40     
  Partials      202      202              
Flag Coverage Δ
?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
main.go 0.00% <0.00%> (ø)

codecov-commenter avatar Oct 04 '24 04:10 codecov-commenter

Tests conducted

Scenario #1: No pre-existing lock file

ts=2024-10-04T04:36:51.642269253Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"
ts=2024-10-04T04:36:51.643078303Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
ts=2024-10-04T04:36:51.654621698Z level=error logger=setup msg="Could not find a lock with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator" error="configmaps \"managed-upgrade-operator-lock\" not found" stacktrace="main.cleanupLockForLeader\n\t/workdir/main.go:130\nmain.main\n\t/workdir/main.go:205\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:271"
ts=2024-10-04T04:36:51.654672297Z level=info logger=leader msg="Trying to become the leader."
ts=2024-10-04T04:36:51.654979545Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-845b4486dc-crgs9
ts=2024-10-04T04:36:51.740944065Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-845b4486dc-crgs9
ts=2024-10-04T04:36:51.744960788Z level=info logger=leader msg="No pre-existing lock was found."
ts=2024-10-04T04:36:51.750775161Z level=info logger=leader msg="Became the leader."

Scenario #2: Lock file exists already

ts=2024-10-04T04:38:14.342253208Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"
ts=2024-10-04T04:38:14.342967346Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
ts=2024-10-04T04:38:14.355400982Z level=info logger=setup msg="Lock already exists with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator, attempting to delete"
ts=2024-10-04T04:38:14.367709954Z level=info logger=setup msg="Lock deleted with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator"
ts=2024-10-04T04:38:14.367722464Z level=info logger=leader msg="Trying to become the leader."
ts=2024-10-04T04:38:14.368020958Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-845b4486dc-p92gg
ts=2024-10-04T04:38:14.372204868Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-845b4486dc-p92gg
ts=2024-10-04T04:38:14.374331127Z level=info logger=leader msg="No pre-existing lock was found."
ts=2024-10-04T04:38:14.440535791Z level=info logger=leader msg="Became the leader."

Tafhim avatar Oct 04 '24 04:10 Tafhim

Another scenario where lock initially had different owner. But when the owner got deleted, the existing Pod took leadership.

ts=2024-10-11T08:30:35.112964071Z level=info logger=cmd msg="Operator Version: 1.0.0"                                                                                  
ts=2024-10-11T08:30:35.112996891Z level=info logger=cmd msg="Go Version: go1.22.4 (Red Hat 1.22.4-1.module+el8.10.0+21981+729b5bf8) X:boringcrypto,strictfipsruntime"  
ts=2024-10-11T08:30:35.113000995Z level=info logger=cmd msg="Go OS/Arch: linux/amd64"                                                                                  
ts=2024-10-11T08:30:35.113004066Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"                                                                         
ts=2024-10-11T08:30:35.113945252Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator                                             
ts=2024-10-11T08:30:35.310814746Z level=info logger=setup msg="Lock already exists with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator" 
Looking in test-managed-upgrade-operator                                                                                                                               
ts=2024-10-11T08:30:35.313873601Z level=info logger=setup msg="Lock already has existing owner managed-upgrade-operator-5c477fcfd-fx22m, skipping deletion"            
ts=2024-10-11T08:30:35.313887197Z level=info logger=leader msg="Trying to become the leader."                                                                          
ts=2024-10-11T08:30:35.314202447Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-5c477fcfd-52nr2                                      
ts=2024-10-11T08:30:35.317899721Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-5c477fcfd-52n
r2                                                                                                                                                                     
ts=2024-10-11T08:30:35.319785785Z level=info logger=leader msg="Found existing lock" LockOwner=managed-upgrade-operator-5c477fcfd-fx22m                                
ts=2024-10-11T08:30:35.334807308Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:30:36.393479566Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:30:38.654971987Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:30:43.061261542Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:30:52.420026904Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:31:10.910719652Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:31:27.496689267Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:31:45.082753841Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:32:03.419040735Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:32:21.773196599Z level=info logger=leader msg="Became the leader." 
ts=2024-10-11T08:32:21.773605014Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator

While the other pod stayed waiting to become leader

ts=2024-10-11T08:30:37.694627426Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"
ts=2024-10-11T08:30:37.695468461Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
ts=2024-10-11T08:30:37.707641854Z level=info logger=setup msg="Lock already exists with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator"
Looking in test-managed-upgrade-operator
ts=2024-10-11T08:30:37.79454308Z level=info logger=setup msg="Lock already has existing owner managed-upgrade-operator-5c477fcfd-fx22m, skipping deletion"
ts=2024-10-11T08:30:37.794646038Z level=info logger=leader msg="Trying to become the leader."
ts=2024-10-11T08:30:37.893467151Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-5c477fcfd-gt5v5
ts=2024-10-11T08:30:37.996775964Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-5c477fcfd-gt5v5
ts=2024-10-11T08:30:37.998917257Z level=info logger=leader msg="Found existing lock" LockOwner=managed-upgrade-operator-5c477fcfd-fx22m
ts=2024-10-11T08:30:38.013081751Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:39.093839453Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:41.171560483Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:45.826541009Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:55.342271994Z level=info logger=leader msg="Not the leader. Waiting."
t

Tafhim avatar Oct 11 '24 08:10 Tafhim

/lgtm

devppratik avatar Oct 18 '24 07:10 devppratik

@Tafhim: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Nov 01 '24 06:11 openshift-ci[bot]

/lgtm

rendhalver avatar Nov 04 '24 21:11 rendhalver

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: devppratik, rendhalver, Tafhim Once this PR has been reviewed and has the lgtm label, please assign ravitri for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Nov 04 '24 21:11 openshift-ci[bot]

/hold

Thanks for reviews so far! Adding temporary hold on this PR till it's further reviewed. I did have some comments

ravitri avatar Nov 05 '24 02:11 ravitri

/close

Based on discussion with @Tafhim, we will go ahead with PR https://github.com/openshift/managed-upgrade-operator/pull/483 changes for now and observe if it helps ahead. We believe that the operator-lib bump would help address the issue we were facing (atleast partially). Incase it still continues to impact us, we will open a new PR based on this PR's work (might need to align the solution a bit based on what exact problem we are trying to address).

However, this PR has helped us revise some of the implementation details for the lockfile which was great and will serve good reference for future too. Thanks for your help and work on this Tafhim :+1:

Closing this PR for now.

ravitri avatar Nov 08 '24 08:11 ravitri

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-merge-robot avatar Nov 08 '24 08:11 openshift-merge-robot

@ravitri: Closed this PR.

In response to this:

/close

Based on discussion with @Tafhim, we will go ahead with PR https://github.com/openshift/managed-upgrade-operator/pull/483 changes for now and observe if it helps ahead. We believe that the operator-lib bump would help address the issue we were facing (atleast partially). Incase it still continues to impact us, we will open a new PR based on this PR's work (might need to align the solution a bit based on what exact problem we are trying to address).

However, this PR has helped us revise some of the implementation details for the lockfile which was great and will serve good reference for future too. Thanks for your help and work on this Tafhim :+1:

Closing this PR for now.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci[bot] avatar Nov 08 '24 08:11 openshift-ci[bot]