addon-manager icon indicating copy to clipboard operation
addon-manager copied to clipboard

BUG: Delete and Install Workflows both running during delete operation

Open kevdowney opened this issue 4 years ago • 2 comments

Is this a BUG REPORT or FEATURE REQUEST?: BUG What happened: Delete workflow was started and install was started. During delete install wf should never be run.

kubectl get pods -n addon-manager-system
NAME                                                  READY   STATUS              RESTARTS   AGE   IP              NODE                                           NOMINATED NODE   READINESS GATES
addon-manager-argo-server-55bdcf5698-x98bs            0/1     Running             0          7s    10.214.74.145   ip-10-214-113-27.us-west-2.compute.internal    <none>           <none>
addon-manager-controller-799585bbdd-fbq9w             2/2     Running             0          7s    10.214.91.83    ip-10-214-114-211.us-west-2.compute.internal   <none>           <none>
addon-manager-workflow-controller-5c67d9b569-jv62s    1/1     Running             0          7s    10.214.71.186   ip-10-214-113-27.us-west-2.compute.internal    <none>           <none>
fluentd-delete-9b92a37e-wf-1944323580                 0/2     ContainerCreating   0          1s    <none>          ip-10-214-117-208.us-west-2.compute.internal   <none>           <none>
fluentd-install-9520a37d-wf-1356763799                0/2     ContainerCreating   0          1s    <none>          ip-10-214-114-211.us-west-2.compute.internal   <none>           <none>
fluentd-install-9b92a37e-wf-2882905720                0/2     ContainerCreating   0          1s    <none>          ip-10-214-114-211.us-west-2.compute.internal   <none>           <none>
fluentd-install-a626a37f-wf-3816956954                0/2     ContainerCreating   0          1s    <none>          ip-10-214-117-208.us-west-2.compute.internal   <none>           <none>

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Addon Manager version
  • Kubernetes version :
$ kubectl version -o yaml

Other debugging information (if applicable):

  • Addon status:
$ kubectl describe addon <addon-name>
  • controller logs:
$ kubectl logs <addon-manager-pod>

kevdowney avatar Sep 18 '20 23:09 kevdowney

Old workflows are hanging around whether or not they're in a completed state, on some occasions the workflow-controller is stuck waiting on the finish execution of the Worfklow Pods either due to some Pod stuck in ContainerCreating or some other Pending state, this is potentially a bug in Argo workflow-controller as the workflows were never marked failed even with activeDeadlineSeconds being injected.

Restarting workflow-controller can trigger the reconciliation of old workflows that were stuck, this appears to be the case as the checksum for all 3 fluentd-install-* workflows are different and started all at the same time. Then as part of the delete of Addon resource the current checksum delete wf is launched, we see 1 install wf and 1 delete wf match in checksum so this is most likely the case.

Addon-manager should actively cleanup old wf's in this case.

kevdowney avatar Sep 22 '20 20:09 kevdowney

This issue will be addressed by feature #174

kevdowney avatar Mar 10 '23 21:03 kevdowney

Should now be addressed, closing.

kevdowney avatar Jun 04 '24 17:06 kevdowney