argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

Emissary Executor Zombie Processes

Open BhavikaSharma opened this issue 2 years ago ā€¢ 6 comments

Checklist

  • [X] Double-checked my configuration.
  • [X] Tested using the latest version.
  • [X] Used the Emissary executor.

Summary

What happened/what you expected to happen?

When restarting a service, it seems that the process is not being cleaned up properly with an emissary container. A zombie process remains, indicated by <defunct>.

What version are you running?

Originally v3.3.6 but also tried with v3.3.9 and v3.4.0-rc2

Diagnostics

Paste the smallest workflow that reproduces the bug. We must be able to run the workflow.

Workflow to replicate:

metadata:
  generateName: emiss-test-
spec:
  entrypoint: first-step
  templates:
    - name: first-step
      metadata:
      container:
        name: main
        image: ubuntu:22.04 
        command: ["bash","-c"]
        args:
           - |
            ps aux
            apt-get update
            apt-get install -y psmisc
            apt-get purge openssh-server openssh-client
            apt-get install -y openssh-server openssh-client
            service ssh start
            ps aux 
            pstree
            service ssh restart
            ps aux 
            pstree
        resources:
          limits:
            cpu: 200m
            memory: 200Mi

Output:

 * Starting OpenBSD Secure Shell server sshd
   ...done.
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.1  0.0 760096 38320 ?        Ssl  20:38   0:00 /var/run/argo/argoexec emissary -- bash -c ps aux apt-get update apt-get install -y psmisc apt-get purge openssh-ser
ver openssh-client apt-get install -y openssh-server openssh-client service ssh start ps aux  pstree service ssh restart ps aux  pstree
root        28  0.0  0.0   3980  2952 ?        S    20:38   0:00 /usr/bin/bash -c ps aux apt-get update apt-get install -y psmisc apt-get purge openssh-server openssh-client apt-get
 install -y openssh-server openssh-client service ssh start ps aux  pstree service ssh restart ps aux  pstree
root      3970  0.0  0.0  12176  3036 ?        Ss   20:40   0:00 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups
root      3971  0.0  0.0   5896  3004 ?        R    20:40   0:00 ps aux
argoexec-+-bash---pstree
         |-sshd
         `-18*[{argoexec}]

 * Restarting OpenBSD Secure Shell server sshd
   ...done.

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.1  0.0 760096 38320 ?        Ssl  20:38   0:00 /var/run/argo/argoexec emissary -- bash -c ps aux apt-get update apt-get install -y psmisc apt-get purge openssh-ser
ver openssh-client apt-get install -y openssh-server openssh-client service ssh start ps aux  pstree service ssh restart ps aux  pstree
root        28  0.0  0.0   3980  2952 ?        S    20:38   0:00 /usr/bin/bash -c ps aux apt-get update apt-get install -y psmisc apt-get purge openssh-server openssh-client apt-get
 install -y openssh-server openssh-client service ssh start ps aux  pstree service ssh restart ps aux  pstree
root      3970  0.0  0.0      0     0 ?        Zs   20:40   0:00 [sshd] <defunct>
root      3983  0.0  0.0  12176  3052 ?        Ss   20:40   0:00 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups
root      3984  0.0  0.0   5896  2900 ?        R    20:40   0:00 ps aux
argoexec-+-pstree
         |-2*[sshd]
         `-18*[{argoexec}]

Note: there is a zombie process after restarting the service (denoted by the <defunct>) as well as -2*[sshd] from the pstree.

Same workflow with docker executor:

metadata:
  generateName: docker-test-
  labels:
    workflows.argoproj.io/container-runtime-executor: docker
spec:
  entrypoint: first-step
  templates:
    - name: first-step
      metadata:
      container:
        name: main
        image: ubuntu:22.04 
        command: ["bash","-c"]
        args:
           - |
            ps aux
            apt-get update
            apt-get install -y psmisc
            apt-get purge openssh-server openssh-client
            apt-get install -y openssh-server openssh-client
            service ssh start
            ps aux 
            pstree
            service ssh restart
            ps aux 
            pstree
        resources:
          limits:
            cpu: 200m
            memory: 200Mi

Output:

 * Starting OpenBSD Secure Shell server sshd
   ...done.
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   3980  3024 ?        Ss   20:38   0:00 bash -c ps aux apt-get update apt-get install -y psmisc apt-get purge openssh-server openssh-client apt-get install
-y openssh-server openssh-client service ssh start ps aux  pstree service ssh restart ps aux  pstree
root      3948  0.0  0.0  12176  3016 ?        Ss   20:40   0:00 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups
root      3949  0.0  0.0   5896  2868 ?        R    20:40   0:00 ps aux
bash-+-pstree
     `-sshd

 * Restarting OpenBSD Secure Shell server sshd
   ...done.

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   3980  3024 ?        Ss   20:38   0:00 bash -c ps aux apt-get update apt-get install -y psmisc apt-get purge openssh-server openssh-client apt-get install
-y openssh-server openssh-client service ssh start ps aux  pstree service ssh restart ps aux  pstree
root      3961  0.0  0.0  12176  3108 ?        Ss   20:40   0:00 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups
root      3962  0.0  0.0   5896  2860 ?        R    20:40   0:00 ps aux
pstree---sshd

Note: no zombie processes and only one instance of sshd from the pstree.

# Logs from the workflow controller:
kubectl logs -n argo deploy/workflow-controller | grep ${workflow} 

time="2022-08-25T21:30:30.804Z" level=info msg="Processing workflow" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:30.808Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:30.809Z" level=info msg="Pod node emiss-test-npwx5 initialized Pending" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:30.832Z" level=info msg="Created pod: emiss-test-npwx5 (emiss-test-npwx5)" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:30.832Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:30.832Z" level=info msg=reconcileAgentPod namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:30.841Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11382 workflow=emiss-test-npwx5
time="2022-08-25T21:30:40.836Z" level=info msg="Processing workflow" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:40.836Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=emiss-test-npwx5
time="2022-08-25T21:30:40.836Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=emiss-test-npwx5 old.message= old.phase=Pending old.progress=0/1 workflow=emiss-test-npwx5
time="2022-08-25T21:30:40.836Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:40.836Z" level=info msg=reconcileAgentPod namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:40.849Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=11406 workflow=emiss-test-npwx5
time="2022-08-25T21:30:50.829Z" level=info msg="Processing workflow" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:50.829Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=emiss-test-npwx5
time="2022-08-25T21:30:50.829Z" level=info msg="node unchanged" namespace=argo nodeID=emiss-test-npwx5 workflow=emiss-test-npwx5
time="2022-08-25T21:30:50.829Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=emiss-test-npwx5
time="2022-08-25T21:30:50.829Z" level=info msg=reconcileAgentPod namespace=argo workflow=emiss-test-npwx5

Seemingly related Github issues:

https://github.com/argoproj/argo-workflows/issues/8680 https://github.com/argoproj/argo-workflows/issues/8246 https://github.com/argoproj/argo-workflows/issues/7259


Message from the maintainers:

Impacted by this bug? Give it a šŸ‘. We prioritise the issues with the most šŸ‘.

BhavikaSharma avatar Aug 25 '22 21:08 BhavikaSharma

@alexec Would you like to take a look at this?

terrytangyuan avatar Aug 25 '22 22:08 terrytangyuan

Iā€™m afraid I do not have bandwidth.

alexec avatar Aug 28 '22 23:08 alexec

Question - what is the negative impact of these zombies?

alexec avatar Sep 05 '22 21:09 alexec

A (possibly specific) negative example may be if a process is being restarted and something is monitoring the PID of the original process to see if it has stopped before starting it up again. The PID is still taken and so the process won't restart.

Or more simply, if something is monitoring the original process being stopped before performing X action, X may be blocked.

BhavikaSharma avatar Sep 06 '22 06:09 BhavikaSharma

There is a loop that reaps 1 zombie per second. Maybe it does reap them, but not soon enough?

alexec avatar Sep 06 '22 14:09 alexec

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Oct 01 '22 17:10 stale[bot]

@sarabala1979 this can be closed. I have confirmed that the zombie processes do eventually get reaped. Check this to confirm:

metadata:
  generateName: emiss-test-
spec:
  entrypoint: first-step
  templates:
    - name: first-step
      metadata:
      container:
        name: main
        image: ubuntu:22.04 
        command: ["bash","-c"]
        args:
           - |
            ps aux
            apt-get update
            apt-get install -y psmisc
            apt-get purge openssh-server openssh-client
            apt-get install -y openssh-server openssh-client
            service ssh start
            ps aux 
            pstree
            service ssh restart
            sleep 120
            ps aux 
            pstree
        resources:
          limits:
            cpu: 200m
            memory: 200Mi

isubasinghe avatar Oct 26 '22 23:10 isubasinghe