conmon icon indicating copy to clipboard operation
conmon copied to clipboard

Improper container state, stale exec sessions and PID wrapping

Open mpolden opened this issue 3 years ago • 18 comments

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

Unable to stop or remove a Podman because of stale exec sessions. The PID associated with the exec session is reused due to PID wrapping in root PID namespace, causing Podman to believe the exec session is still alive.

Steps to reproduce the issue:

  1. We have a running Podman container.

  2. We execute podman exec periodically in the container, approximately once every minute. Our use-case for this is collecting metrics.

  3. When running podman exec, Podman stores an "exec session" containing metadata (e.g. PID of the spawned process) in its state database.

  4. In some cases we can end up with stale exec sessions, e.g. podman exec can be killed due to a timeout and is then unable to clean up the exec session.

  5. The container will then have stale exec sessions (the ExecIDs array in podman inspect <ctr> grows) in its database, referring to processes that are no longer running

  6. When stopping or removing a container, Podman checks if the PIDs of any exec session still refer to a running process. It retrieves the PIDs from its state database in /data/containers/graph/libpod/bolt_state.db

  7. However, the persisted PID may now refer to a different running process. This is due to the fact can PIDs wrap around after reaching the limit specified in /proc/sys/kernel/pid_max.

  8. When the PID persisted in the exec session is reused by another process, Podman believes the exec session is still active and the container can no longer be stopped or removed.

  9. Max current PID and limit on a system where we're triggering this bug:

    $ ps aux | awk '{print $2}' | sort -rn | head -1
    $ 399578
    $ cat /proc/sys/kernel/pid_max
    409600
    
  10. But why are we reaching the limit so quickly? (We see the "improper state" issue every other day)

    My guess is that this is due to PID namespaces. A container has its own PID namespace which maps into the root PID name space, and the number of possible PIDs in this namespace is smaller than the total number of possible PIDs on the host.

    The PID in the exec session is the PID in the root name space.

Describe the results you received:

Depending on when the PID reuse happens, either stop or rm fails.

podman stop <ctr> fails with:

Error: container 5c5925673e244190340d1af86cb2bb2d9438691e9a48e883d77fedf09d87222a has active exec sessions, refusing to clean up: container state improper

podman rm <ctr> fails with:

Error: cannot remove container 86795917878f6131ca98b45a5e7a87b32fdb9121a4359547b6b007199d115b99 as it has active exec sessions: container state improper

Describe the results you expected:

That podman stop and podman rm succeed.

Additional information you deem important (e.g. issue happens only occasionally):

If the container cannot be removed, the only solution is to restart it first with podman restart <ctr> && podman stop <ctr> && podman rm <ctr>.

Output of podman version:

Version:      2.2.1
API Version:  2
Go Version:   go1.14.7
Built:        Mon Feb  8 21:19:06 2021
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.18.0
  cgroupManager: systemd
  cgroupVersion: v1
  conmon:
    package: conmon-2.0.22-3.module+el8.3.1+9857+68fb1526.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.22, commit: a40e3092dbe499ea1d85ab339caea023b74829b9'
  cpus: 24
  distribution:
    distribution: '"rhel"'
    version: "8.3"
  eventLogger: file
  hostname: <snip>
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 4.18.0-240.15.1.el8_3.x86_64
  linkmode: dynamic
  memFree: 267927552
  memTotal: 25018028032
  ociRuntime:
    name: runc
    package: runc-1.0.0-70.rc92.module+el8.3.1+9857+68fb1526.x86_64
    path: /usr/bin/runc
    version: 'runc version spec: 1.0.2-dev'
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  rootless: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 0
  swapTotal: 0
  uptime: 366h 4m 16.74s (Approximately 15.25 days)
registries:
  search:
  - registry.access.redhat.com
  - registry.redhat.io
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /data/containers/graph
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageStore:
    number: 1
  runRoot: /data/containers/run
  volumePath: /data/containers/graph/volumes
version:
  APIVersion: "2"
  Built: 1612819146
  BuiltTime: Mon Feb  8 21:19:06 2021
  GitCommit: ""
  GoVersion: go1.14.7
  OsArch: linux/amd64
  Version: 2.2.1

Package info (e.g. output of rpm -q podman or apt list podman):

podman-2.2.1-7.module+el8.3.1+9857+68fb1526.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)

Yes. (We're running the latest version available in RHEL 8.3)

Additional environment details (AWS, VirtualBox, physical, etc.):

Physical.

mpolden avatar May 12 '21 13:05 mpolden

I’m a lot more concerned with how you’re getting these stale sessions in the first place. Each exec session should be spawning a cleanup process on exit, which takes care of updating the state and cleaning up residual files

  • I don’t see how those could be killed by timeouts since they’re spawning immediately after exit. Is something possibly killing the exec session’s Conmon process, preventing the spawn of the cleanup session? That would also be a serious issue.

Solving this seems to be a lot easier than rewriting session tracking to allow us to determine if PIDs wrapped.

On Wed, May 12, 2021 at 09:17 Martin Polden @.***> wrote:

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

Unable to stop or remove a Podman because of stale exec sessions. The PID associated with the exec session is reused due to PID wrapping in root PID namespace, causing Podman to believe the exec session is still alive.

Steps to reproduce the issue:

We have a running Podman container. 2.

We execute podman exec periodically in the container, approximately once every minute. Our use-case for this is collecting metrics. 3.

When running podman exec, Podman stores an "exec session" containing metadata (e.g. PID of the spawned process) in its state database. 4.

In some cases we can end up with stale exec sessions, e.g. podman exec can be killed due to a timeout and is then unable to clean up the exec session. 5.

The container will then have stale exec sessions (the ExecIDs array in podman inspect grows) in its database, referring to processes that are no longer running 6.

When stopping or removing a container, Podman checks if the PIDs of any exec session still refer to a running process. It retrieves the PIDs from its state database in /data/containers/graph/libpod/bolt_state.db 7.

However, the persisted PID may now refer to a different running process. This is due to the fact can PIDs wrap around after reaching the limit specified in /proc/sys/kernel/pid_max. 8.

When the PID persisted in the exec session is reused by another process, Podman believes the exec session is still active and the container can no longer be stopped or removed. 9.

Max current PID and limit on a system where we're triggering this bug:

$ ps aux | awk '{print $2}' | sort -rn | head -1 $ 399578 $ cat /proc/sys/kernel/pid_max 409600

But why are we reaching the limit so quickly? (We see the "improper state" issue every other day)

My guess is that this is due to PID namespaces. A container has its own PID namespace which maps into the root PID name space, and the number of possible PIDs in this namespace is smaller than the total number of possible PIDs on the host.

The PID in the exec session is the PID in the root name space.

Describe the results you received:

Depending on when the PID reuse happens, either stop or rm fails.

podman stop fails with:

Error: container 5c5925673e244190340d1af86cb2bb2d9438691e9a48e883d77fedf09d87222a has active exec sessions, refusing to clean up: container state improper

podman rm fails with:

Error: cannot remove container 86795917878f6131ca98b45a5e7a87b32fdb9121a4359547b6b007199d115b99 as it has active exec sessions: container state improper

Describe the results you expected:

That podman stop and podman rm succeed.

Additional information you deem important (e.g. issue happens only occasionally):

If the container cannot be removed, the only solution is to restart it first with podman restart && podman stop && podman rm .

Output of podman version:

Version: 2.2.1 API Version: 2 Go Version: go1.14.7 Built: Mon Feb 8 21:19:06 2021 OS/Arch: linux/amd64

Output of podman info --debug:

host: arch: amd64 buildahVersion: 1.18.0 cgroupManager: systemd cgroupVersion: v1 conmon: package: conmon-2.0.22-3.module+el8.3.1+9857+68fb1526.x86_64 path: /usr/bin/conmon version: 'conmon version 2.0.22, commit: a40e3092dbe499ea1d85ab339caea023b74829b9' cpus: 24 distribution: distribution: '"rhel"' version: "8.3" eventLogger: file hostname: idMappings: gidmap: null uidmap: null kernel: 4.18.0-240.15.1.el8_3.x86_64 linkmode: dynamic memFree: 267927552 memTotal: 25018028032 ociRuntime: name: runc package: runc-1.0.0-70.rc92.module+el8.3.1+9857+68fb1526.x86_64 path: /usr/bin/runc version: 'runc version spec: 1.0.2-dev' os: linux remoteSocket: path: /run/podman/podman.sock rootless: false slirp4netns: executable: "" package: "" version: "" swapFree: 0 swapTotal: 0 uptime: 366h 4m 16.74s (Approximately 15.25 days) registries: search:

  • registry.access.redhat.com
  • registry.redhat.io
  • docker.io store: configFile: /etc/containers/storage.conf containerStore: number: 1 paused: 0 running: 1 stopped: 0 graphDriverName: overlay graphOptions: overlay.mountopt: nodev,metacopy=on graphRoot: /data/containers/graph graphStatus: Backing Filesystem: xfs Native Overlay Diff: "false" Supports d_type: "true" Using metacopy: "true" imageStore: number: 1 runRoot: /data/containers/run volumePath: /data/containers/graph/volumes version: APIVersion: "2" Built: 1612819146 BuiltTime: Mon Feb 8 21:19:06 2021 GitCommit: "" GoVersion: go1.14.7 OsArch: linux/amd64 Version: 2.2.1

Package info (e.g. output of rpm -q podman or apt list podman):

podman-2.2.1-7.module+el8.3.1+9857+68fb1526.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md https://github.com/containers/podman/blob/master/troubleshooting.md)

Yes. (We're running the latest version available in RHEL 8.3)

Additional environment details (AWS, VirtualBox, physical, etc.):

Physical.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/containers/podman/issues/10320, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3AOCBLWPVELAIAVQXXZJDTNJ5XTANCNFSM44YSZQ6Q .

mheon avatar May 12 '21 13:05 mheon

Yes, we'll look into that. Our host management system calls podman exec with a timeout, where the process is terminated after a given duration.

It's terminated gracefully (SIGTERM), but podman exec still doesn't seem to properly clean up?

I did a quick test to verify this. I have a container called test running:

$ sudo podman exec --user root test sleep 1800 &
$ sudo podman inspect test | jq '.[].ExecIDs'
[
  "f1a5df8bc81665faa0a40b77d6acd35291d8a83e3389937242b4bf5f629ead2f"
]
$ ps aux|grep 'podman exec'
root     204407  0.0  0.0 1477452 23536 pts/0   Sl+  13:51   0:00 sudo podman exec --user root test sleep 1800
root     204414  0.1  0.2 2131868 67408 pts/0   Sl+  13:51   0:00 podman exec --user root test sleep 1800
mpolden  204906  0.0  0.0 221904  1096 pts/1    S+   13:52   0:00 grep --color=auto podman exec
$ sudo kill 204414
$ ps aux|grep 'podman exec'
mpolden  205992  0.0  0.0 221904   992 pts/1    S+   13:54   0:00 grep --color=auto podman exe
$ sudo podman inspect test | jq '.[].ExecIDs'
[
  "f1a5df8bc81665faa0a40b77d6acd35291d8a83e3389937242b4bf5f629ead2f"
]

My expectation is that there would be a signal handler that does the necessary cleanup on TERM.

mpolden avatar May 12 '21 13:05 mpolden

Also, I seem to recall when reading through the code that either stop or rm -f may end up either sending TERM or KILL to "active" exec sessions which seems dangerous if the PIDs have been reused...

mpolden avatar May 12 '21 14:05 mpolden

You’re just killing the Podman process there - the actual exec session has double forked and is running independently. I suspect you’ll see that there is still a sleep running in the case you describe, inside the container. Once an exec session is running, the Podman frontend is just performing attach work, printing to the terminal and reading input. The actual exec process and is Conmon parent are independent.

On Wed, May 12, 2021 at 10:02 Martin Polden @.***> wrote:

Also, I seem to recall when reading through the code that either stop or rm -f may end up either sending TERM or KILL to "active" exec sessions which seems dangerous if the PIDs have been reused...

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/containers/podman/issues/10320#issuecomment-839797918, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3AOCBOFWTRTXAJJCXACS3TNKC6NANCNFSM44YSZQ6Q .

mheon avatar May 12 '21 14:05 mheon

Once an exec session is running, the Podman frontend is just performing attach work, printing to the terminal and reading input. The actual exec process and is Conmon parent are independent.

Right, so the podman exec process does the cleanup? This does not seem to happen if the process receives TERM or INT. I assume this is because the signal is not forwarded to conmon, so any spawned processes may still be running.

Our integration expects that podman exec terminates any spawned processes gracefully on TERM (I think this is what docker exec does), but maybe calling runc exec directly is better for our use-case.

I'm not entirely clear on the purpose of exec sessions. Do they exist just to keep track of the processes that might need to be killed when the container is stopped/removed?

mpolden avatar May 12 '21 15:05 mpolden

The Podman process does not do the cleanup - Conmon spawns a separate Podman on completion of the exec session (when the first PID in it exits) which handles cleanup.

The expectation that podman exec cleans up the exec process on termination is definitely not accurate today. I'll need to investigate this when I can get access to a proper computer again (tomorrow) to see if Docker does so - if so, this is a Podman bug we'll have to fix.

The exec sessions are an artifact of how the Docker API, which we now support, works - each session tracks a single exec'd process, providing a handle by which it can be managed (stopped, inspected, etc). We don't strictly need them to kill the container (eliminating the first process in the PID namespace is enough to let the kernel do the rest for us), though I believe we go through and stop them all first anyways.

On Wed, May 12, 2021 at 11:23 AM Martin Polden @.***> wrote:

Once an exec session is running, the Podman frontend is just performing attach work, printing to the terminal and reading input. The actual exec process and is Conmon parent are independent.

Right, so the podman exec process does the cleanup? This does not seem to happen if the process receives TERM or INT. I assume this is because the signal is not forwarded to conmon, so any spawned processes may still be running.

Our integration expects that podman exec terminates any spawned processes gracefully on TERM (I think this is what docker exec does), but maybe calling runc exec directly is better for our use-case.

I'm not entirely clear on the purpose of exec sessions. Do they exist just to keep track of the processes that might need to be killed when the container is stopped/removed?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/containers/podman/issues/10320#issuecomment-839862433, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3AOCEFGDK72EWVY3LK3KTTNKMQXANCNFSM44YSZQ6Q .

mheon avatar May 12 '21 15:05 mheon

I did some more testing. conmon process doesn't appear to clean up session when killed either:

Start container and an exec session:

$ podman start test
test
$ podman inspect test | jq '.[].ExecIDs'
[]
$ podman exec test sleep 600
(C-c)
$ podman inspect test | jq '.[].ExecIDs'
[
  "9cd47a971a0b8cf73a3ba98099598e0212b7ab3c2bbd2cb523e782ec1de53f33"
]

Identify and kill conmon:

$ ps -ef |grep 9cd47a971a0b8cf73a3ba98099598e0212b7ab3c2bbd2cb523e782ec1de53f33
root     128711      1  0 08:55 ?        00:00:00 /usr/bin/conmon --api-version 1 -c d1769fe40619a4d2eca6f35a74a28f4138c3663c6a4c6e51128fb25a8e499122 -u 9cd47a971a0b8cf73a3ba98099598e0212b7ab3c2bbd2cb523e782ec1de53f33 -r /usr/bin/runc -b /data/containers/graph/overlay-containers/d1769fe40619a4d2eca6f35a74a28f4138c3663c6a4c6e51128fb25a8e499122/userdata/9cd47a971a0b8cf73a3ba98099598e0212b7ab3c2bbd2cb523e782ec1de53f33 -p /data/containers/graph/overlay-containers/d1769fe40619a4d2eca6f35a74a28f4138c3663c6a4c6e51128fb25a8e499122/userdata/9cd47a971a0b8cf73a3ba98099598e0212b7ab3c2bbd2cb523e782ec1de53f33/exec_pid -n test --exit-dir /data/containers/graph/overlay-containers/d1769fe40619a4d2eca6f35a74a28f4138c3663c6a4c6e51128fb25a8e499122/userdata/9cd47a971a0b8cf73a3ba98099598e0212b7ab3c2bbd2cb523e782ec1de53f33/exit --socket-dir-path /var/run/libpod/socket -s -l none --log-level error --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/data/containers/graph/overlay-containers/d1769fe40619a4d2eca6f35a74a28f4138c3663c6a4c6e51128fb25a8e499122/userdata/9cd47a971a0b8cf73a3ba98099598e0212b7ab3c2bbd2cb523e782ec1de53f33/oci-log -e --exec-attach --exec-process-spec /data/containers/graph/overlay-containers/d1769fe40619a4d2eca6f35a74a28f4138c3663c6a4c6e51128fb25a8e499122/userdata/9cd47a971a0b8cf73a3ba98099598e0212b7ab3c2bbd2cb523e782ec1de53f33/exec-process-613457679
$ ps -ef | grep 128711
root     128722 128711  0 08:55 ?        00:00:00 sleep 600
$ kill 128711
$ ps -ef |grep 'sleep 600'

Exec session remains:

$ podman inspect test | jq '.[].ExecIDs'
[
  "9cd47a971a0b8cf73a3ba98099598e0212b7ab3c2bbd2cb523e782ec1de53f33"
]

mpolden avatar May 14 '21 09:05 mpolden

Please open this as an issue with conmon.

rhatdan avatar May 14 '21 13:05 rhatdan

@haircommander @giuseppe @mrunalp WDYT?

rhatdan avatar May 14 '21 13:05 rhatdan

I don't think Conmon spawns the cleanup process when it's hit by SIGTERM. This sounds like something we should fix.

On Fri, May 14, 2021 at 9:09 AM Daniel J Walsh @.***> wrote:

@haircommander @giuseppe @mrunalp WDYT?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

mheon avatar May 14 '21 13:05 mheon

@mpolden I have transfered this to the conmon repo, and attempted a fix in https://github.com/containers/conmon/pull/261 can you give it a try?

haircommander avatar May 14 '21 18:05 haircommander

Sure, is there a build I can try?

mpolden avatar May 18 '21 07:05 mpolden

Thanks! You can download the binary from https://cirrus-ci.com/task/5425328676929536

haircommander avatar May 18 '21 13:05 haircommander

I still get the same behaviour as in https://github.com/containers/conmon/issues/260#issuecomment-841115969 when using the conmon binary from the linked build:

$ sha256sum /usr/bin/conmon
3d28413ff3f909781386882cd881c31283e3f0fea6965759d04bb8b23e6c5aff  /usr/bin/conmon

As far as I can tell there is only one signal handler registered with atexit by default: reap_children.

I'm not sure how the Podman cleanup of exec session happens, but I see do_exit_command is only registered if --exit-command is given. However, conmon is not spawned with --exit-command:

/usr/bin/conmon --api-version 1 -c bd61a8708c95fad0724857746833c691a2195f7af7362d0ffcf5b4d7cf072d29 -u ba6318906d400fd4c4c382e7cd1aa4ad9e1820bda69912c90ec2a58008f0a861 -r /usr/bin/runc -b /data/containers/graph/overlay-containers/bd61a8708c95fad0724857746833c691a2195f7af7362d0ffcf5b4d7cf072d29/userdata/ba6318906d400fd4c4c382e7cd1aa4ad9e1820bda69912c90ec2a58008f0a861 -p /data/containers/graph/overlay-containers/bd61a8708c95fad0724857746833c691a2195f7af7362d0ffcf5b4d7cf072d29/userdata/ba6318906d400fd4c4c382e7cd1aa4ad9e1820bda69912c90ec2a58008f0a861/exec_pid -n test --exit-dir /data/containers/graph/overlay-containers/bd61a8708c95fad0724857746833c691a2195f7af7362d0ffcf5b4d7cf072d29/userdata/ba6318906d400fd4c4c382e7cd1aa4ad9e1820bda69912c90ec2a58008f0a861/exit --socket-dir-path /var/run/libpod/socket -s -l none --log-level error --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/data/containers/graph/overlay-containers/bd61a8708c95fad0724857746833c691a2195f7af7362d0ffcf5b4d7cf072d29/userdata/ba6318906d400fd4c4c382e7cd1aa4ad9e1820bda69912c90ec2a58008f0a861/oci-log -e --exec-attach --exec-process-spec /data/containers/graph/overlay-containers/bd61a8708c95fad0724857746833c691a2195f7af7362d0ffcf5b4d7cf072d29/userdata/ba6318906d400fd4c4c382e7cd1aa4ad9e1820bda69912c90ec2a58008f0a861/exec-process-538401836

mpolden avatar May 19 '21 07:05 mpolden

hm interesting. @mheon shouldn't there be an exit command to cleanup the exec session?

haircommander avatar May 19 '21 14:05 haircommander

Hm. We may only be doing it for detached exec sessions. Finally have time to dig into this one, I'll go looking.

mheon avatar May 19 '21 15:05 mheon

I was right - we are, in fact, only doing exit commands for detached exec. Fortunately, this should be an easy fix.

mheon avatar May 19 '21 18:05 mheon

https://github.com/containers/podman/pull/10405 should fix that bit.

mheon avatar May 19 '21 18:05 mheon