addon-manager
addon-manager copied to clipboard
BUG: Delete and Install Workflows both running during delete operation
Is this a BUG REPORT or FEATURE REQUEST?:
BUG
What happened:
Delete workflow was started and install was started. During delete install
wf should never be run.
kubectl get pods -n addon-manager-system
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
addon-manager-argo-server-55bdcf5698-x98bs 0/1 Running 0 7s 10.214.74.145 ip-10-214-113-27.us-west-2.compute.internal <none> <none>
addon-manager-controller-799585bbdd-fbq9w 2/2 Running 0 7s 10.214.91.83 ip-10-214-114-211.us-west-2.compute.internal <none> <none>
addon-manager-workflow-controller-5c67d9b569-jv62s 1/1 Running 0 7s 10.214.71.186 ip-10-214-113-27.us-west-2.compute.internal <none> <none>
fluentd-delete-9b92a37e-wf-1944323580 0/2 ContainerCreating 0 1s <none> ip-10-214-117-208.us-west-2.compute.internal <none> <none>
fluentd-install-9520a37d-wf-1356763799 0/2 ContainerCreating 0 1s <none> ip-10-214-114-211.us-west-2.compute.internal <none> <none>
fluentd-install-9b92a37e-wf-2882905720 0/2 ContainerCreating 0 1s <none> ip-10-214-114-211.us-west-2.compute.internal <none> <none>
fluentd-install-a626a37f-wf-3816956954 0/2 ContainerCreating 0 1s <none> ip-10-214-117-208.us-west-2.compute.internal <none> <none>
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
- Addon Manager version
- Kubernetes version :
$ kubectl version -o yaml
Other debugging information (if applicable):
- Addon status:
$ kubectl describe addon <addon-name>
- controller logs:
$ kubectl logs <addon-manager-pod>
Old workflows are hanging around whether or not they're in a completed state, on some occasions the workflow-controller
is stuck waiting on the finish execution of the Worfklow Pods either due to some Pod stuck in ContainerCreating or some other Pending state, this is potentially a bug in Argo workflow-controller as the workflows were never marked failed even with activeDeadlineSeconds
being injected.
Restarting workflow-controller can trigger the reconciliation of old workflows that were stuck, this appears to be the case as the checksum for all 3 fluentd-install-*
workflows are different and started all at the same time. Then as part of the delete of Addon resource the current checksum delete wf is launched, we see 1 install wf and 1 delete wf match in checksum so this is most likely the case.
Addon-manager should actively cleanup old wf's in this case.
This issue will be addressed by feature #174
Should now be addressed, closing.