cockpit-podman icon indicating copy to clipboard operation
cockpit-podman copied to clipboard

User service appears to crash, but is still running

Open BrentonPoke opened this issue 2 years ago • 5 comments

Cockpit version: 260 Cockpit-podman version: 39 Podman version: 3.4.4 OS: Fedora 35

Steps to reproduce

  1. Start a podman pod group through UI
  2. bring it down with podman-compose down or the UI
  3. Repeat 1-2 until bored
  4. User service appears to crash.

I'm not sure who's fault it is, but sometimes it's still running in actuality. Other times, I have to restart the system. Cockpit-Podman looks like this right now, though the user service is still running according to systemctl: Screenshot from 2022-01-17 17-48-22 Screenshot from 2022-01-17 17-13-24

BrentonPoke avatar Jan 17 '22 22:01 BrentonPoke

Managed to reproduce it with for example:

podman play kube wordpress.yml

wordpress.yml

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2021-11-24T16:41:57Z"
  labels:
    app: project
  name: project
spec:
  containers:
  - args:
    - mariadbd
    image: docker.io/library/mariadb:latest
    name: wordpress-db
    ports:
    - containerPort: 80
      hostPort: 8080
    resources: {}
    securityContext:
      capabilities:
        drop:
        - CAP_MKNOD
        - CAP_NET_RAW
        - CAP_AUDIT_WRITE
    volumeMounts:
    - mountPath: /var/lib/mysql
      name: b3acddf0b1f450b08f4b9a022b00700f2123aa8e0ac1a303eb5519e80c1a91ce-pvc
  - args:
    - apache2-foreground
    image: docker.io/library/wordpress:latest
    name: wordpress-web
    resources: {}
    securityContext:
      capabilities:
        drop:
        - CAP_MKNOD
        - CAP_NET_RAW
        - CAP_AUDIT_WRITE
    volumeMounts:
    - mountPath: /var/www/html
      name: 57ed032976161677987ff2e49bf7536383ab225aa47a418f4aa37ed68d25bb0a-pvc
  restartPolicy: Always
  volumes:
  - name: b3acddf0b1f450b08f4b9a022b00700f2123aa8e0ac1a303eb5519e80c1a91ce-pvc
    persistentVolumeClaim:
      claimName: b3acddf0b1f450b08f4b9a022b00700f2123aa8e0ac1a303eb5519e80c1a91ce
  - name: 57ed032976161677987ff2e49bf7536383ab225aa47a418f4aa37ed68d25bb0a-pvc
    persistentVolumeClaim:
      claimName: 57ed032976161677987ff2e49bf7536383ab225aa47a418f4aa37ed68d25bb0a
status: {}

For me podman ps also hangs, is it the same for you?

jelly avatar Jan 19 '22 15:01 jelly

For me podman ps also hangs, is it the same for you?

Nevermind, this is caused by podman.socket stopping the unit makes podman ps work again. so for some reason podman.socket is "stuck"

This hangs:

[admin@fedora-34-127-0-0-2-2201 ~]$ curl -X GET -s -g --no-buffer --unix-socket /run/user/1000/podman/podman.sock http://localhost/v1.12/libpod/info

Different invalid request works

[admin@fedora-34-127-0-0-2-2201 ~]$ curl -X GET -s -g --no-buffer --unix-socket /run/user/1000/podman/podman.sock http://localhost/v1.12/libpod/images
Not Found

An strace shows:

futex(0x55e61129c978, FUTEX_WAKE_PRIVATE, 1) = 1
newfstatat(AT_FDCWD, "/run/user/1000", {st_mode=S_IFDIR|0700, st_size=180, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/run/user/1000/containers", {st_mode=S_IFDIR|0700, st_size=120, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/home", {st_mode=S_IFDIR|0755, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/home/admin", {st_mode=S_IFDIR|0700, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/home/admin/.local", {st_mode=S_IFDIR|0700, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/home/admin/.local/share", {st_mode=S_IFDIR|0700, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/home/admin/.local/share/containers", {st_mode=S_IFDIR|0700, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/home/admin/.local/share/containers/storage", {st_mode=S_IFDIR|0700, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
rt_sigprocmask(SIG_SETMASK, ~[HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV TERM STKFLT CHLD PROF SYS RTMIN RT_1 RT_2], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV TERM STKFLT CHLD PROF SYS RTMIN RT_1 RT_2], NULL, 8) = 0
futex(0xc000060950, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55e61129d8d0, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
rt_sigprocmask(SIG_SETMASK, ~[HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV TERM STKFLT CHLD PROF SYS RTMIN RT_1 RT_2], NULL, 8) = 0
futex(0x55e6112cf6b8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55e6112cf6a0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55e61129d8d0, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc000060d50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55e61129d8d0, FUTEX_WAIT_PRIVATE, 0, NULL
[admin@fedora-34-127-0-0-2-2201 ~]$ podman --version
podman version 3.4.2

jelly avatar Jan 19 '22 15:01 jelly

This is a podman race condition, which requires a good reproducer so that upstream can look into fixing it.

jelly avatar Jan 21 '22 08:01 jelly

This is a podman race condition, which requires a good reproducer so that upstream can look into fixing it.

I'm not sure where to go from here. What would be a "good reproducer"?

BrentonPoke avatar Apr 07 '22 20:04 BrentonPoke

It does happen a lot to me, and I'm sure I could find a way to reliably trigger it, so if someone could tell me what to report, I would love to help.

fischer-felix avatar Jul 20 '22 21:07 fischer-felix