argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

controller sending many pod delete requests that result in 404 response

Open tooptoop4 opened this issue 1 year ago • 6 comments

Pre-requisites

  • [X] I have double-checked my configuration
  • [X] I can confirm the issue exists when I tested with :latest
  • [X] I have searched existing issues and could not find a match for this bug
  • [ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

although this seems to have no effect on the functioning of argoworkflows it could be potential stability/performance/k8s log cost issue at scale. it must be also filling up the cleanup queue so delaying cleanup of pods that really do exist

from checking the k8s api server the wfcontroller seems to be sending delete pod request for <podname from a step>-agent and getting not found response. i am using standard workflows like whalesay example. not sure what significance of agent suffix is (i did see https://github.com/argoproj/argo-workflows/blob/66680f1c9bca8b47c40ce918b5d16714058647cb/workflow/controller/agent.go#L25)

seeing 1000s of these, seems for every pod run its sending this unrequired delete request for pod with -agent suffix?

Version

3.4.11

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

n/a

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
n/a

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
n/a

tooptoop4 avatar Feb 12 '24 07:02 tooptoop4

For reference, this was previously noted in Slack

agilgur5 avatar Feb 12 '24 17:02 agilgur5

not sure what significance of agent suffix is

As far as I understand, the "agent" is a piece of the Executor that runs for certain built-in, non-container template types (e.g. resource, http, data, script, etc). Anyone else please correct me if I'm wrong; the historical references to agent didn't quite have a standard definition.

seeing 1000s of these, seems for every pod run its sending this unrequired delete request for pod with -agent suffix?

That sounds like it might be accidentally assuming that each Pod has an agent, when there are only certain types that do 🤔

agilgur5 avatar Feb 12 '24 17:02 agilgur5

https://github.com/argoproj/argo-workflows/blob/66680f1c9bca8b47c40ce918b5d16714058647cb/workflow/controller/operator.go#L2369 seems to be the line assuming each pod has an agent.

tooptoop4 avatar Feb 12 '24 18:02 tooptoop4

The agent pod will only be created if taskSet is not empty. Each workflow can have at most one agent pod. https://github.com/argoproj/argo-workflows/blob/5c8062e55d975aab117e37c7592c0a648a9e9860/workflow/controller/agent.go#L32-L46 Only http and plugin template will be put into taskSet right now. image

jswxstw avatar Feb 21 '24 08:02 jswxstw

@jswxstw do u want to create PR?

tooptoop4 avatar Mar 15 '24 23:03 tooptoop4

@jswxstw do u want to create PR?

I see you have created a PR and it looks good to me basically. The only problem is that you can use the existing function woc.hasTaskSetNodes() to determine whether the deletion of AgentPod is necessary, rather than create a new function woc.hasNodeWithAgentPod().

jswxstw avatar Mar 16 '24 15:03 jswxstw