argo-workflows
argo-workflows copied to clipboard
Emissary Executor Zombie Processes
Checklist
- [X] Double-checked my configuration.
- [X] Tested using the latest version.
- [X] Used the Emissary executor.
Summary
What happened/what you expected to happen?
When restarting a service, it seems that the process is not being cleaned up properly with an emissary container. A zombie process remains, indicated by <defunct>
.
What version are you running?
Originally v3.3.6 but also tried with v3.3.9 and v3.4.0-rc2
Diagnostics
Paste the smallest workflow that reproduces the bug. We must be able to run the workflow.
Workflow to replicate:
metadata:
generateName: emiss-test-
spec:
entrypoint: first-step
templates:
- name: first-step
metadata:
container:
name: main
image: ubuntu:22.04
command: ["bash","-c"]
args:
- |
ps aux
apt-get update
apt-get install -y psmisc
apt-get purge openssh-server openssh-client
apt-get install -y openssh-server openssh-client
service ssh start
ps aux
pstree
service ssh restart
ps aux
pstree
resources:
limits:
cpu: 200m
memory: 200Mi
Output:
* Starting OpenBSD Secure Shell server sshd
...done.
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.1 0.0 760096 38320 ? Ssl 20:38 0:00 /var/run/argo/argoexec emissary -- bash -c ps aux apt-get update apt-get install -y psmisc apt-get purge openssh-ser
ver openssh-client apt-get install -y openssh-server openssh-client service ssh start ps aux pstree service ssh restart ps aux pstree
root 28 0.0 0.0 3980 2952 ? S 20:38 0:00 /usr/bin/bash -c ps aux apt-get update apt-get install -y psmisc apt-get purge openssh-server openssh-client apt-get
install -y openssh-server openssh-client service ssh start ps aux pstree service ssh restart ps aux pstree
root 3970 0.0 0.0 12176 3036 ? Ss 20:40 0:00 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups
root 3971 0.0 0.0 5896 3004 ? R 20:40 0:00 ps aux
argoexec-+-bash---pstree
|-sshd
`-18*[{argoexec}]
* Restarting OpenBSD Secure Shell server sshd
...done.
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.1 0.0 760096 38320 ? Ssl 20:38 0:00 /var/run/argo/argoexec emissary -- bash -c ps aux apt-get update apt-get install -y psmisc apt-get purge openssh-ser
ver openssh-client apt-get install -y openssh-server openssh-client service ssh start ps aux pstree service ssh restart ps aux pstree
root 28 0.0 0.0 3980 2952 ? S 20:38 0:00 /usr/bin/bash -c ps aux apt-get update apt-get install -y psmisc apt-get purge openssh-server openssh-client apt-get
install -y openssh-server openssh-client service ssh start ps aux pstree service ssh restart ps aux pstree
root 3970 0.0 0.0 0 0 ? Zs 20:40 0:00 [sshd] <defunct>
root 3983 0.0 0.0 12176 3052 ? Ss 20:40 0:00 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups
root 3984 0.0 0.0 5896 2900 ? R 20:40 0:00 ps aux
argoexec-+-pstree
|-2*[sshd]
`-18*[{argoexec}]
Note: there is a zombie process after restarting the service (denoted by the <defunct>
) as well as -2*[sshd]
from the pstree
.
Same workflow with docker executor:
metadata:
generateName: docker-test-
labels:
workflows.argoproj.io/container-runtime-executor: docker
spec:
entrypoint: first-step
templates:
- name: first-step
metadata:
container:
name: main
image: ubuntu:22.04
command: ["bash","-c"]
args:
- |
ps aux
apt-get update
apt-get install -y psmisc
apt-get purge openssh-server openssh-client
apt-get install -y openssh-server openssh-client
service ssh start
ps aux
pstree
service ssh restart
ps aux
pstree
resources:
limits:
cpu: 200m
memory: 200Mi
Output:
* Starting OpenBSD Secure Shell server sshd
...done.
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 3980 3024 ? Ss 20:38 0:00 bash -c ps aux apt-get update apt-get install -y psmisc apt-get purge openssh-server openssh-client apt-get install
-y openssh-server openssh-client service ssh start ps aux pstree service ssh restart ps aux pstree
root 3948 0.0 0.0 12176 3016 ? Ss 20:40 0:00 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups
root 3949 0.0 0.0 5896 2868 ? R 20:40 0:00 ps aux
bash-+-pstree
`-sshd
* Restarting OpenBSD Secure Shell server sshd
...done.
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 3980 3024 ? Ss 20:38 0:00 bash -c ps aux apt-get update apt-get install -y psmisc apt-get purge openssh-server openssh-client apt-get install
-y openssh-server openssh-client service ssh start ps aux pstree service ssh restart ps aux pstree
root 3961 0.0 0.0 12176 3108 ? Ss 20:40 0:00 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups
root 3962 0.0 0.0 5896 2860 ? R 20:40 0:00 ps aux
pstree---sshd
Note: no zombie processes and only one instance of sshd
from the pstree
.
# Logs from the workflow controller:
kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
time="2022-08-25T21:30:30.804Z" level=info msg="Processing workflow" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:30.808Z" level=info msg="Updated phase -> Running" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:30.809Z" level=info msg="Pod node emiss-test-npwx5 initialized Pending" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:30.832Z" level=info msg="Created pod: emiss-test-npwx5 (emiss-test-npwx5)" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:30.832Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:30.832Z" level=info msg=reconcileAgentPod namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:30.841Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11382 workflow=emiss-test-npwx5
time="2022-08-25T21:30:40.836Z" level=info msg="Processing workflow" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:40.836Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=emiss-test-npwx5
time="2022-08-25T21:30:40.836Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=emiss-test-npwx5 old.message= old.phase=Pending old.progress=0/1 workflow=emiss-test-npwx5
time="2022-08-25T21:30:40.836Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:40.836Z" level=info msg=reconcileAgentPod namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:40.849Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11406 workflow=emiss-test-npwx5
time="2022-08-25T21:30:50.829Z" level=info msg="Processing workflow" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:50.829Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=emiss-test-npwx5
time="2022-08-25T21:30:50.829Z" level=info msg="node unchanged" namespace=argo nodeID=emiss-test-npwx5 workflow=emiss-test-npwx5
time="2022-08-25T21:30:50.829Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:50.829Z" level=info msg=reconcileAgentPod namespace=argo workflow=emiss-test-npwx5
Seemingly related Github issues:
https://github.com/argoproj/argo-workflows/issues/8680 https://github.com/argoproj/argo-workflows/issues/8246 https://github.com/argoproj/argo-workflows/issues/7259
Message from the maintainers:
Impacted by this bug? Give it a š. We prioritise the issues with the most š.
@alexec Would you like to take a look at this?
Iām afraid I do not have bandwidth.
Question - what is the negative impact of these zombies?
A (possibly specific) negative example may be if a process is being restarted and something is monitoring the PID of the original process to see if it has stopped before starting it up again. The PID is still taken and so the process won't restart.
Or more simply, if something is monitoring the original process being stopped before performing X action, X may be blocked.
There is a loop that reaps 1 zombie per second. Maybe it does reap them, but not soon enough?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
@sarabala1979 this can be closed. I have confirmed that the zombie processes do eventually get reaped. Check this to confirm:
metadata:
generateName: emiss-test-
spec:
entrypoint: first-step
templates:
- name: first-step
metadata:
container:
name: main
image: ubuntu:22.04
command: ["bash","-c"]
args:
- |
ps aux
apt-get update
apt-get install -y psmisc
apt-get purge openssh-server openssh-client
apt-get install -y openssh-server openssh-client
service ssh start
ps aux
pstree
service ssh restart
sleep 120
ps aux
pstree
resources:
limits:
cpu: 200m
memory: 200Mi