argo-workflows Wait container cannot exit if main container is OOMKilled

Wait container cannot exit if main container is OOMKilled

Open tianshimoyi opened this issue 2 years ago • 18 comments

What happened/what you expected to happen?

wait container cannot exit if main container is OOMKilled

Use Case: Start a workflow, then let the main container oom, and find that the wait container does not exit.

截屏2022-06-30 上午10 47 17

截屏2022-06-30 上午10 48 03

Reason: I read the source code and found that the wait container decides whether to exit by checking if the /var/run/argo/ctr/main/exitcode file exists, and the main container calls the defer function in the NewEmissaryCommand function to create the file. But when the main container is OOMKilled, the main container crashes directly, the defer function will not be called, and the wait container will not exit.

What version are you running? v3.3.5

Diagnostics

Paste the smallest workflow that reproduces the bug. We must be able to run the workflow.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: dag-test2
  namespace: icdc-test
spec:
  arguments: {}
  entrypoint: scheduler-alpha-entrypoint
  podSpecPatch: '{"terminationGracePeriodSeconds": 0, "enableServiceLinks": false}'
  serviceAccountName: argo-workflow
  templates:
    - name: A
      container:
        image: ubuntu
        resources:
          requests:
            cpu: 1
            memory: 10Ki
          limits:
            cpu: 1
            memory: 50Mi
        command: ["/bin/bash"]
        args: ["-c","for x in {1..200}; do echo 'Round $x'; bash -c 'for b in {0..99999999}; do a=$b$a; done'; done"]
    - name: scheduler-alpha-entrypoint
      dag:
        tasks:
          - arguments: { }
            name: C
            template: A
      inputs: {}
      metadata: {}
      outputs: {}

# Logs from the workflow controller:
kubectl logs -n argo deploy/workflow-controller | grep ${workflow} 

# If the workflow's pods have not been created, you can skip the rest of the diagnostics.

# The workflow's pods that are problematic:
kubectl get pod -o yaml -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

# Logs from in your workflow's wait container, something like:
kubectl logs -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

Jun 30 '22 02:06 tianshimoyi

@alexec Could you take a look?

Jun 30 '22 02:06 terrytangyuan

This should not happen in v3.4. If the controller see that the main container has exited, then it sends a SIGTERM to the wait container and it should exit.

Can you try with latest?

If latest works, then we could look at backporting the fixes. My worry would be that the fixes are involved, so backporting might be tricky.

I’m not sure when the team plan to release v3.4

Jun 30 '22 04:06 alexec

@alexec Sorry, I tried the latest version and it still doesn't work！

3.3.5 version and 3.3.8 version the code about this part is the same, I don't think it will handle this case, and I tried the latest version, it does not work！

截屏2022-06-30 下午2 26 24

Jun 30 '22 06:06 tianshimoyi

@sarabala1979 the is a blocking bug for v3.4. You can’t release this is fixed.

Jun 30 '22 15:06 alexec

@alexec Are you going to fix it?

Jun 30 '22 16:06 sarabala1979

No. I'm both about to go OOV for two week, but also I don't have a mandate for bug fixes now. The core team will need to address them. Instead, the team can make sure they can repro the issue. If you can do that today or tomorrow, then we can get on a Zoom debug for 30m and see if we can diagnose the issue.

Jun 30 '22 16:06 alexec

@alexec will this code fix this issue? https://github.com/argoproj/argo-workflows/blob/0b2fe2face955f31941496c1a99c13e1b12909d9/workflow/controller/operator.go#L1188-L1192

Jun 30 '22 23:06 sarabala1979

I am not able to reproduce this issue on my local. controller | ******* Running controller | ***** checking cleanupPod dag-test2-a-4233667491 controller | ***** main &ContainerStateTerminated{ExitCode:137,Signal:0,Reason:OOMKilled,Message:,StartedAt:2022-06-30 16:28:16 -0700 PDT,FinishedAt:2022-06-30 16:28:17 -0700 PDT,ContainerID:containerd://d8942963603d4e01a1e489432a4361f9b629dbb706d32b63872ed49736800927,} controller | ***** wait nil controller | ***** cleanupPod dag-test2-a-4233667491

Code is calling cleanupPOD to terminate the running container

Jun 30 '22 23:06 sarabala1979

@tianshimoyi are you seeing cleaning up pod in your log once main container exits.

Jul 01 '22 00:07 sarabala1979

Sorry, I didn't see this log. You can make limits.memory smaller. And I think in the following code, if we judge that the main container is terminated, we should clean up pod. 截屏2022-07-01 下午12 34 51

Jul 01 '22 04:07 tianshimoyi

@sarabala1979 What is your ContainerRuntimeExecutor? Mine is the default emissary.

Jul 01 '22 07:07 tianshimoyi

I am also using emissary

Jul 07 '22 16:07 sarabala1979

@sarabala1979 @alexec I have found the problem, the reason is when oom_kill is triggered, the argoexec process has more rss than the child process, so the aroexec process is killed, the defer function cannot be called, and the /var/run/argo/ctrl/main/exitcode file cannot be created, and the wait container cannot exit. , the pod is always in the Running state. This is to allocate enough memory space, the child process will be killed off. 截屏2022-07-06 下午4 11 06 This is to allocate less memory space, and the parent process will be killed. 截屏2022-07-06 下午4 11 20

Jul 08 '22 01:07 tianshimoyi

Ok. So the controller presumably does not catch this. I have a solution that might work.

Every second, a Goroutine in the emissary in the main containers touches a file named /var/run/argo/ctrl/main/alive to indicate it is alive. In the same loop wait container loop that checks for the exitcode file, add a check to see if the alive file has been touched recently (e.g. in the last 10s), if not we assume that the main container is not alive anymore, and exit.

I think this will be quite robust.

Would anyone like to volunteer to fix this?

Jul 08 '22 10:07 alexec

I don't think this method can be solved, because you can't control when the main container starts. What if it takes an hour for the main container to start? I think the best way to modify it is to modify it in the contoller. If the main container fails, but the wait container does not exit, send a signal to the wait container, and the wait container will execute the exit logic after receiving the change signal. What do you think?

发自我的iPhone

------------------ Original ------------------ From: Alex Collins @.> Date: Fri,Jul 8,2022 6:02 PM To: argoproj/argo-workflows @.> Cc: 天使莫忆 @.>, Mention @.> Subject: Re: [argoproj/argo-workflows] Wait container cannot exit if main container is OOMKilled (Issue #9083)

Ok. So the controller presumably does not catch this. I have a solution that might work.

Every second, a Goroutine in the emissary in the main containers touches a file named /var/run/argo/ctrl/main/alive to indicate it is alive. In the same loop wait container loop that checks for the exitcode file, add a check to see if the alive file has been touched recently (e.g. in the last 10s), if not we assume that the main container is not alive anymore, and exit.

I think this will be quite robust.

Would anyone like to volunteer to fix this?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Jul 08 '22 10:07 tianshimoyi

That should already be the logic on :latest. You shared the block of code a 7 days ago, but the block you share was on v3, not on master.

Jul 08 '22 10:07 alexec

This is really cool, I'll check out the master branch

Jul 08 '22 10:07 tianshimoyi

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

Jul 31 '22 04:07 stale[bot]

I am experiencing the same on v3.3.8

Aug 08 '22 04:08 michaelmohamed

This is really cool, I'll check out the master branch

@tianshimoyi Did the master branch resolve this or no?

Sep 06 '22 15:09 juliev0

I've been too busy recently. The company has other businesses to promote. I'll take the time to test them recently. If you test and have the results, I hope you can tell me at the first time

发自我的iPhone

------------------ Original ------------------ From: Julie Vogelman @.> Date: Tue,Sep 6,2022 11:54 PM To: argoproj/argo-workflows @.> Cc: 天使莫忆 @.>, Mention @.> Subject: Re: [argoproj/argo-workflows] Wait container cannot exit if maincontainer is OOMKilled (Issue #9083)

This is really cool, I'll check out the master branch

@tianshimoyi Did the master branch resolve this or no?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Sep 06 '22 16:09 tianshimoyi

@alexec @tianshimoyi If I'm understanding this correctly, the logic in 3.4 checks to see if the main container has exited. But will the main container not have exited if the argoexec process has died but the child process is still alive? Is this the issue?

Sep 06 '22 16:09 juliev0

@juliev0 The problem is that the parent process of the main container is killed because of OOM, and the defer function is not called to create a file. The wait container needs to detect the existence of the file before exiting, which causes the main container to exit, but the wait container has not exited, and the pod is still running.

Sep 07 '22 01:09 tianshimoyi

@juliev0 The problem is that the parent process of the main container is killed because of OOM, and the defer function is not called to create a file. The wait container needs to detect the existence of the file before exiting, which causes the main container to exit, but the wait container has not exited, and the pod is still running.

Right, sorry, I was trying to figure out why Alex had written "So the controller presumably does not catch this" despite his having added this Controller check to latest (which has been incorporated into 3.4RCs): "If the controller see that the main container has exited, then it sends a SIGTERM to the wait container and it should exit." But it sounds like we're thinking that latest/3.4RC hasn't yet been tested, right? So, hopefully it does work.

Sep 07 '22 03:09 juliev0

I'm unable to reproduce.

I've run the workflows on Docker Desktop attached to this issue. I've also run this workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: oom-kill-
spec:
  entrypoint: fork-bomb
  templates:
  - name: fork-bomb
    container:
      image: debian:9.5-slim
      command: [bash, -c]
      args: [":(){ : $@$@;};: :"]
      resources:
        limits:
          memory: 8M

The pod exits with OOMKilled. The wait container is terminated.

NAME                           READY   STATUS      RESTARTS   AGE
dag-test2-9lw6r-a-3497416876   0/2     OOMKilled   0          52s

Sep 07 '22 03:09 alexec

@alexec It is necessary to ensure that when oom_kill is triggered, the rss of the argoexec process is more than that of the child process, so that this problem will reproduce stably.

Sep 07 '22 03:09 tianshimoyi

@alexec It is necessary to ensure that when oom_kill is triggered, the rss of the argoexec process is more than that of the child process, so that this problem will reproduce stably.

So, what you're saying is that when testing you need to ensure that the argoexec process dies first and cannot write the file. And you suspect that Alex's test may have caused the subprocess to run out of memory first, enabling argoexec to have written the file before it itself ran out of memory? (or we don't know anyway)

What is the recommended way to reproduce this such that the rss of the argoexec process is more than that of the child process? Is that what your example Workflow represents?

In any case, it sounds like Alex has enabled the Controller to detect OOM and kill the Pod, so hopefully testing with latest will show that that works in the case of argoexec dying before the subprocess.

Sep 07 '22 19:09 juliev0

@alexec @juliev0 Please test with v3.3.8, here is my test ymal and my test program

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: dag-test-1
  namespace: icdc-test
spec:
  arguments: {}
  entrypoint: scheduler-alpha-entrypoint
  podSpecPatch: '{"terminationGracePeriodSeconds": 0, "enableServiceLinks": false}'
  serviceAccountName: argo-workflow
  templates:
    - name: A
      container:
        env:
          - name: "Name"
            value: "lc"
        image: tianshimoyi/oom-test:v1
        imagePullPolicy: Always
        resources:
          requests:
            cpu: 1
            memory: 10Ki
          limits:
            cpu: 1
            memory: 50Mi
        command: ["/bin/bash"]
        args: ["-c","/oom 2000"]
    - name: scheduler-alpha-entrypoint
      dag:
        tasks:
          - arguments: { }
            name: C
            template: A
      inputs: {}
      metadata: {}
      outputs: {}

#include <stdio.h>
#include <malloc.h>
#include <string.h>
#include <unistd.h>
#include <stdlib.h>

#define BLOCK_SIZE (10*1024*1024)

int main(int argc, char **argv)
{

        int thr, i;
        char *p1;

        if (argc != 2) {
                printf("Usage: mem_alloc <num (MB)>\n");
                exit(0);
        }

        thr = atoi(argv[1]);

        printf("Allocating," "set to %d Mbytes\n", thr);
        sleep(10);
        for (i = 0; i < thr; i++) {
                p1 = malloc(BLOCK_SIZE);
                memset(p1, 0x00, BLOCK_SIZE);
        }

        sleep(600);

        return 0;
}

Sep 08 '22 02:09 tianshimoyi

Okay, I guess if we can reproduce it with 3.3.8, then hopefully we can show it goes away in 3.4 RC 3.

Sep 08 '22 03:09 juliev0

Good news. I think we can close this.

I was trying to reproduce your issue in 3.3.8 and show that it goes away with this change made in latest/3.4 release candidates. But I wasn't actually able to reproduce it with your docker image. When I ran your docker image I think the subprocess must have run out of memory first, since the main container did in fact write the file, which was read by the wait container.

So, assuming that your theory was that the argoexec process needs to die first, I decided to modify argoexec to fill up memory itself rather than even starting the subprocess. I did this both with 3.3.8 and latest. In the former, the wait container stayed in a "Running" state. In the latter the Controller terminated the wait container after noticing the main container was done.

Sep 08 '22 21:09 juliev0

argo-workflows argo-workflows copied to clipboard

Wait container cannot exit if main container is OOMKilled

Diagnostics

argo-workflows
argo-workflows copied to clipboard