argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

Workflow controller panics when stopping a workflow that has an Lifecycle-Hook with a plugin

Open roofurmston opened this issue 2 years ago • 13 comments

Checklist

  • [x] Double-checked my configuration.
  • [x] Tested using the latest version.
  • [x] Used the Emissary executor.

Summary

What happened/what you expected to happen?

We have a workflow that has a workflow level lifecycle-hook that calls a plugin. (The plugin is a slack plugin that we wrote internally for notifications.) When the workflow completes (or fails) the lifecycle-hook works as expected and calls the plugin successfully.

However, when we stop the workflow from the UI then (1) the lifecycle-hook goes into a pending state indefinitely and (2) the workflow controller starts to panic.

Note: We have tested the same workflow with a life-cycle hook that does not call a plugin and this bug is not present, i.e., the lifecycle-hook runs properly in this case.

What version are you running?

3.3.2

Diagnostics

Paste the smallest workflow that reproduces the bug. We must be able to run the workflow.

Note: I've not included the plugin. I believe this issue is present for plugins in general, so one of the example ones in Argo Workflows should suffice. I can make a minimal working example of a plugin, if that is required.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: slack-example-
spec:
  entrypoint: heads
  hooks:
    exit:
      template: slack-alert

  templates:
    - name: slack-alert
      plugin:
        plugin-slack:
          channel: achannel
          message: "{{workflow.name}} finished!"

    - name: heads
      container:
        image: alpine:3.6
        command: [sh, -c]
        args: ["sleep 30; echo \"it was heads\""]
# Logs from the workflow controller:
kubectl logs -n argo deploy/workflow-controller | grep ${workflow} 

time="2022-06-15T10:36:09Z" level=info msg="index config" indexWorkflowSemaphoreKeys=true
time="2022-06-15T10:36:09Z" level=info msg="cron config" cronSyncPeriod=10s
time="2022-06-15T10:36:09Z" level=info msg="Memoization caches will be garbage-collected if they have not been hit after" gcAfterNotHitDuration=30s
time="2022-06-15T10:36:09.701Z" level=info msg="not enabling pprof debug endpoints"
time="2022-06-15T10:36:09.702Z" level=info msg="config map" name=workflow-controller-configmap
time="2022-06-15T10:36:09.717Z" level=info msg="Get configmaps 200"
time="2022-06-15T10:36:09.721Z" level=info msg="Configuration:\nartifactRepository:\n  archiveLogs: false\n  s3:\n    accessKeySecret:\n      key: s3-access-key\n      name: s3-access-key\n    bucket: tfurm76-testbucket\n    endpoint: s3.amazonaws.com\n    keyFormat: logs/{{workflow.creationTimestamp.Y}}/{{workflow.creationTimestamp.m}}/{{workflow.creationTimestamp.d}}/{{workflow.name}}/{{pod.name}}\n    secretKeySecret:\n      key: s3-secret-key\n      name: s3-secret-key\ninitialDelay: 0s\nmetricsConfig: {}\nnodeEvents: {}\npodSpecLogStrategy: {}\ntelemetryConfig: {}\nworkflowDefaults:\n  metadata:\n    creationTimestamp: null\n  spec:\n    arguments: {}\n    ttlStrategy:\n      secondsAfterCompletion: 2592000\n      secondsAfterFailure: 2592000\n      secondsAfterSuccess: 2592000\n  status:\n    finishedAt: null\n    startedAt: null\n"
time="2022-06-15T10:36:09.721Z" level=info msg="Persistence configuration disabled"
I0615 10:36:09.722061       1 leaderelection.go:248] attempting to acquire leader lease argo/workflow-controller...
time="2022-06-15T10:36:09.724Z" level=info msg="Get leases 200"
time="2022-06-15T10:36:09.727Z" level=info msg="Update leases 200"
I0615 10:36:09.728914       1 leaderelection.go:258] successfully acquired lease argo/workflow-controller
time="2022-06-15T10:36:09.729Z" level=info msg="new leader" leader=workflow-controller-6bd7478cb8-qb5lw
time="2022-06-15T10:36:09.729Z" level=info msg="Starting Workflow Controller" defaultRequeueTime=10s version=v3.3.2
time="2022-06-15T10:36:09.729Z" level=info msg="Current Worker Numbers" podCleanup=4 workflow=32 workflowTtl=4
time="2022-06-15T10:36:09.729Z" level=info msg="Watching task results" labelSelector="!workflows.argoproj.io/controller-instanceid,workflows.argoproj.io/workflow"
time="2022-06-15T10:36:09.729Z" level=info msg=Plugins executorPlugins=true
time="2022-06-15T10:36:09.734Z" level=info msg="Create events 201"
time="2022-06-15T10:36:09.734Z" level=info msg="Get leases 200"
time="2022-06-15T10:36:09.734Z" level=info msg="List workflows 200"
time="2022-06-15T10:36:09.738Z" level=info msg="Update leases 200"
time="2022-06-15T10:36:09.741Z" level=info msg="Manager initialized successfully"
time="2022-06-15T10:36:09.743Z" level=info msg="Watch configmaps 200"
time="2022-06-15T10:36:09.744Z" level=info msg="List configmaps 200"
time="2022-06-15T10:36:09.744Z" level=info msg="List workflowtemplates 200"
time="2022-06-15T10:36:09.744Z" level=info msg="List workflowtasksets 200"
time="2022-06-15T10:36:09.745Z" level=info msg="List pods 200"
time="2022-06-15T10:36:09.745Z" level=info msg="List workflows 200"
time="2022-06-15T10:36:09.745Z" level=info msg="List workflowtaskresults 200"
time="2022-06-15T10:36:09.745Z" level=info msg="List configmaps 200"
time="2022-06-15T10:36:09.747Z" level=info msg="Watch workflowtemplates 200"
time="2022-06-15T10:36:09.747Z" level=info msg="Watch configmaps 200"
time="2022-06-15T10:36:09.748Z" level=info msg="Executor plugin added" name=mlplatform-plugin-slack-executor-plugin namespace=argo
time="2022-06-15T10:36:09.747Z" level=info msg="Watch configmaps 200"
time="2022-06-15T10:36:09.747Z" level=info msg="Watch workflowtaskresults 200"
time="2022-06-15T10:36:09.747Z" level=info msg="Watch pods 200"
time="2022-06-15T10:36:09.749Z" level=info msg="Watch workflows 200"
time="2022-06-15T10:36:09.749Z" level=info msg="Watch workflowtasksets 200"
time="2022-06-15T10:36:09.849Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2022-06-15T10:36:09.853Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2022-06-15T10:36:09.856Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2022-06-15T10:36:09.858Z" level=info msg="Performing periodic GC" periodicity=5m0s
time="2022-06-15T10:36:09.862Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics"
time="2022-06-15T10:36:09.862Z" level=info msg="Persistence disabled - so archived workflow GC disabled - you must restart the controller if you enable this"
time="2022-06-15T10:36:09.862Z" level=info msg="Starting workflow garbage collector controller (retentionWorkers 4)"
time="2022-06-15T10:36:09.864Z" level=info msg="Started workflow garbage collection"
time="2022-06-15T10:36:09.862Z" level=info msg="List clusterworkflowtemplates 200"
time="2022-06-15T10:36:09.863Z" level=info msg="Starting CronWorkflow controller"
W0615 10:36:09.866191       1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed
time="2022-06-15T10:36:09.869Z" level=info msg="Watch clusterworkflowtemplates 200"
time="2022-06-15T10:36:09.869Z" level=info msg="List workflows 200"
time="2022-06-15T10:36:09.869Z" level=info msg="List cronworkflows 200"
time="2022-06-15T10:36:09.871Z" level=info msg="Watch workflows 200"
time="2022-06-15T10:36:09.872Z" level=info msg="Watch cronworkflows 200"
time="2022-06-15T10:36:09.883Z" level=info msg="Queueing Succeeded workflow argo/slack-example-qlxx2 for delete in 717h57m40s due to TTL"
time="2022-06-15T10:36:09.884Z" level=info msg="Queueing Failed workflow argo/slack-example-47vq8 for delete in 717h59m44s due to TTL"
time="2022-06-15T10:36:09.884Z" level=info msg="Queueing Succeeded workflow argo/slack-example-bvhmv for delete in 719h57m14s due to TTL"
time="2022-06-15T10:36:09.885Z" level=info msg="Processing workflow" namespace=argo workflow=slack-example-p6mh5
time="2022-06-15T10:36:09.886Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=slack-example-p6mh5
time="2022-06-15T10:36:09.886Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=slack-example-p6mh5
time="2022-06-15T10:36:09.886Z" level=info msg=reconcileAgentPod namespace=argo workflow=slack-example-p6mh5
time="2022-06-15T10:36:09.886Z" level=info msg="Running OnExit handler: " namespace=argo workflow=slack-example-p6mh5
time="2022-06-15T10:36:09.886Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=slack-example-p6mh5
time="2022-06-15T10:36:09.886Z" level=info msg="Creating TaskSet" namespace=argo workflow=slack-example-p6mh5
time="2022-06-15T10:36:09.892Z" level=info msg="Create workflowtasksets 409"
time="2022-06-15T10:36:09.897Z" level=info msg="Patch workflowtasksets 200"
time="2022-06-15T10:36:09.897Z" level=info msg=reconcileAgentPod namespace=argo workflow=slack-example-p6mh5
time="2022-06-15T10:36:09.899Z" level=info msg="Get secrets 404"
time="2022-06-15T10:36:09.901Z" level=info msg="Get serviceaccounts 200"
time="2022-06-15T10:36:09.910Z" level=info msg="Create pods 201"
time="2022-06-15T10:36:09.911Z" level=info msg="Created Agent pod" namespace=argo podName=slack-example-p6mh5-1340600742-agent workflow=slack-example-p6mh5
time="2022-06-15T10:36:09.911Z" level=info msg=updateAgentPodStatus namespace=argo workflow=slack-example-p6mh5
time="2022-06-15T10:36:09.911Z" level=info msg=assessAgentPodStatus namespace=argo podName=slack-example-p6mh5-1340600742-agent
time="2022-06-15T10:36:14.724Z" level=info msg="Get leases 200"
time="2022-06-15T10:36:14.731Z" level=info msg="Update leases 200"
time="2022-06-15T10:36:19.749Z" level=info msg="Get leases 200"
time="2022-06-15T10:36:19.755Z" level=info msg="Update leases 200"
time="2022-06-15T10:36:19.892Z" level=info msg="Processing workflow" namespace=argo workflow=slack-example-p6mh5
time="2022-06-15T10:36:19.894Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=slack-example-p6mh5
time="2022-06-15T10:36:19.895Z" level=info msg=updateAgentPodStatus namespace=argo workflow=slack-example-p6mh5
time="2022-06-15T10:36:19.895Z" level=info msg=assessAgentPodStatus namespace=argo podName=slack-example-p6mh5-1340600742-agent
time="2022-06-15T10:36:19.895Z" level=info msg="Deleting Pending pod argo/slack-example-p6mh5-1340600742-agent as part of workflow shutdown with strategy: Stop" namespace=argo workflow=slack-example-p6mh5
time="2022-06-15T10:36:19.906Z" level=info msg="Delete pods 200"
panic: workflow 'slack-example-p6mh5' node '' uninitialized when marking as Failed: [workflow shutdown with strategy:  Stop]

goroutine 268 [running]:
github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).markNodePhase(0xc000cfe180, {0x0, 0x0}, {0x1f8a3be, 0xc0004cee40}, {0xc000791680, 0x1, 0x1})
	/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:2225 +0xae5
github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).handleExecutionControlError(0xc000cfe180, {0xc000446f80, 0x1e}, 0xc00088cc78, {0xc0004cefc0, 0x26})
	/go/src/github.com/argoproj/argo-workflows/workflow/controller/exec_control.go:82 +0x187
github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).applyExecutionControl(0xc000cfe180, {0x22adfe8, 0xc000128000}, 0xc00021f000, 0x992577)
	/go/src/github.com/argoproj/argo-workflows/workflow/controller/exec_control.go:44 +0x6ea
github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).podReconciliation.func2(0x0)
	/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:1038 +0xb0
created by github.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).podReconciliation
	/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:1035 +0x21a

# If the workflow's pods have not been created, you can skip the rest of the diagnostics.

# The workflow's pods that are problematic:
kubectl get pod -o yaml -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

# Logs from in your workflow's wait container, something like:
kubectl logs -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

roofurmston avatar Jun 15 '22 14:06 roofurmston

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Jul 10 '22 06:07 stale[bot]

As far as I am aware this issue still exists. Still waiting for confirmation from the Argo team.

roofurmston avatar Jul 11 '22 08:07 roofurmston

hi, we have the same issue. and it's very awful ....could anyone can have a look this . @sarabala1979

whybeyoung avatar Jul 14 '22 02:07 whybeyoung

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Jul 31 '22 04:07 stale[bot]

This is still not stale as far as I am aware.

It's a really bad issue. Can someone look into it please?

roofurmston avatar Jul 31 '22 07:07 roofurmston

We have very limited resources, on argo core team. We are busy with other issues. Let me add it in blog and prioritize.

Sent from my iPhone

On Jul 31, 2022, at 12:57 AM, roofurmston @.***> wrote:

 This is still not stale as far as I am aware.

It's a really bad issue. Can someone look into it please?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

sarabala1979 avatar Jul 31 '22 13:07 sarabala1979

@roofurmston I am not able to reproduce on latest and 3.3.8. Can you try with 3.3.8?

sarabala1979 avatar Jul 31 '22 18:07 sarabala1979

@berlinsaint

sarabala1979 avatar Jul 31 '22 18:07 sarabala1979

Sure, no problem. I will try with 3.3.8 this week.

Out of interest, what plugin did you use with your test?

roofurmston avatar Aug 01 '22 06:08 roofurmston

This issue has been closed due to inactivity. Feel free to re-open if you still encounter this issue.

stale[bot] avatar Aug 12 '22 12:08 stale[bot]

@alexc looks like plugin node is not initialized. It looks like bug in plugin Can you take a looks?

Sent from my iPhone

On Jul 31, 2022, at 12:57 AM, roofurmston @.***> wrote:

 This is still not stale as far as I am aware.

It's a really bad issue. Can someone look into it please?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

sarabala1979 avatar Oct 11 '22 09:10 sarabala1979

@sarabala1979 Sorry, you have the wrong guy, as people often do with this project. You need @alexec ...

AlexC avatar Oct 11 '22 09:10 AlexC

The controller rarely panic unless there is a bug in the code. Unfortunately, panics tend to hide the real bug.

To fix, change this line of code:

https://github.com/argoproj/argo-workflows/blob/cd43bba6c87d185bd1530c03c99b874eeceba966/workflow/controller/operator.go#L2304

To

       if node == nil {
		node = &w1v.NodeStatus{}
	}

Could someone please submit a PR?

alexec avatar Oct 11 '22 14:10 alexec