docuum icon indicating copy to clipboard operation
docuum copied to clipboard

Exits on `docker events terminated`

Open schnittchen opened this issue 1 year ago • 17 comments

Description

I use docker compose with multiple services, and compose builds the needed images itself. Occasionally, compose aborts with an error because it cannot find the image it has just built.

I assume this happens when both of these conditions are true:

  • the built image already existed, so its LRU is old enough, because all build steps were cached, leading to the newly built image being an existing one
  • docuum crashed and (in my case) was automatically started, performing the more aggressive startup cleaning

The crash looks like this:

Nov 18 11:42:09 sonnenschein-gh-runner docker[1733577]: [2024-11-18 11:42:09 +00:00 DEBUG] Waking up…
Nov 18 11:42:09 sonnenschein-gh-runner docker[1733577]: [2024-11-18 11:42:09 +00:00 DEBUG] Updating last-used timestamp for image `sha256:a5ae6b5aac27c27b6e4b1f8e81ff0af7fa071e17f85a3e2c767129e1814426e7`…
Nov 18 11:42:09 sonnenschein-gh-runner docker[1733577]: [2024-11-18 11:42:09 +00:00 DEBUG] Going back to sleep…
Nov 18 11:43:38 sonnenschein-gh-runner systemd[846]: docuum.service: Main process exited, code=exited, status=1/FAILURE
Nov 18 11:43:38 sonnenschein-gh-runner systemd[846]: docuum.service: Failed with result 'exit-code'.
Nov 18 11:43:39 sonnenschein-gh-runner systemd[846]: docuum.service: Scheduled restart job, restart counter is at 99.
Nov 18 11:43:39 sonnenschein-gh-runner systemd[846]: Stopped Docuum (docker image vacuumer).
Nov 18 11:43:39 sonnenschein-gh-runner systemd[846]: Started Docuum (docker image vacuumer).
Nov 18 11:43:39 sonnenschein-gh-runner docker[1822919]: [2024-11-18 11:43:39 +00:00 INFO] Performing an initial vacuum on startup…
Nov 18 11:43:40 sonnenschein-gh-runner docker[1822919]: [2024-11-18 11:43:40 +00:00 DEBUG] Ignored image `sha256:a5ae6b5aac27c27b6e4b1f8e81ff0af7fa071e17f85a3e2c767129e1814426e7` due to the `--min-age` flag.

As you can see, all crash related information I have is the exit status, at least this time. Last time around, I saw "docker events terminated" which made docuum exit.

Instructions to reproduce the bug

It happens only sporadically, so that remains to be seen.

Environment information:

  • 0.25.0
  • Docker version 27.3.1, build ce12230
  • Ubuntu 22.04.5 LTS

Additional context

I'm using rootless docker, and I'm starting docuum like this:

ExecStart=/usr/bin/docker run \
  --rm \
  --init \
  --name docuum \
  --mount type=bind,src=/run/user/%U/docker.sock,dst=/var/run/docker.sock \
  --mount type=volume,source=docuum,target=/root \
  stephanmisc/docuum --threshold '10 GB' --min-age '5min'

Help is very much appreciated for getting more information on this issue.

schnittchen avatar Nov 18 '24 11:11 schnittchen

Hi @schnittchen, thanks for filing this issue! Could you try running Docuum with LOG_LEVEL=trace to see if you get any more info from the logs?

docuum crashed and (in my case) was automatically started, performing the more aggressive startup cleaning

Docuum always tries to keep the image disk usage below the threshold, whether on startup or not—although on startup there might be more work to do. There's only one thing that is different about startup:

  • Normally, when Docuum encounters an image it hasn't seen before, it takes the current time as the "last used time" for that image. This usually happens because a new image was created since Docuum started.
  • However, on startup when Docuum encounters an image it hasn't seen before (according to its persisted state), it takes the image creation time as the "last used time" for that image. The reason for this is that on the first run, Docuum hasn't seen any images before, and it wouldn't be very useful to assign the current time as the "last used time" for all the images (then the deletion order would be unpredictable and probably not what one wants).

stepchowfun avatar Nov 18 '24 19:11 stepchowfun

I added the LOG_LEVEL and I'm currently investigating.

The second time now I've seen this:

Nov 19 09:19:13 sonnenschein-gh-runner docker[320972]: [2024-11-19 09:19:13 +00:00 DEBUG] Going back to sleep…
Nov 19 09:28:00 sonnenschein-gh-runner docker[320972]: [2024-11-19 09:28:00 +00:00 ERROR] `docker events` terminated.
Nov 19 09:28:01 sonnenschein-gh-runner systemd[843]: docuum.service: Main process exited, code=exited, status=1/FAILURE
Nov 19 09:28:01 sonnenschein-gh-runner systemd[843]: docuum.service: Failed with result 'exit-code'.

Relevant docker service logs at that time:

Nov 19 09:19:13 sonnenschein-gh-runner dockerd-rootless.sh[971]: time="2024-11-19T09:19:13.355083168Z" level=info msg="cleaning up dead shim" namespace=moby
Nov 19 09:28:01 sonnenschein-gh-runner dockerd-rootless.sh[920]: time="2024-11-19T09:28:01.014840625Z" level=info msg="ignoring event" container=5ab5a9a6fa7d6a7f70ccae4a9389afd60b6296e5313c26608b437f5de3320022 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Nov 19 09:28:01 sonnenschein-gh-runner dockerd-rootless.sh[971]: time="2024-11-19T09:28:01.015714228Z" level=info msg="shim disconnected" id=5ab5a9a6fa7d6a7f70ccae4a9389afd60b6296e5313c26608b437f5de3320022 namespace=moby
Nov 19 09:28:01 sonnenschein-gh-runner dockerd-rootless.sh[971]: time="2024-11-19T09:28:01.015804295Z" level=warning msg="cleaning up after shim disconnected" id=5ab5a9a6fa7d6a7f70ccae4a9389afd60b6296e5313c26608b437f5de3320022 namespace=moby
Nov 19 09:28:01 sonnenschein-gh-runner dockerd-rootless.sh[971]: time="2024-11-19T09:28:01.015817667Z" level=info msg="cleaning up dead shim" namespace=moby
Nov 19 09:28:02 sonnenschein-gh-runner dockerd-rootless.sh[971]: time="2024-11-19T09:28:02.365678559Z" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
Nov 19 09:28:02 sonnenschein-gh-runner dockerd-rootless.sh[971]: time="2024-11-19T09:28:02.365755852Z" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
Nov 19 09:28:02 sonnenschein-gh-runner dockerd-rootless.sh[971]: time="2024-11-19T09:28:02.365772040Z" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Nov 19 09:28:02 sonnenschein-gh-runner dockerd-rootless.sh[971]: time="2024-11-19T09:28:02.367406562Z" level=info msg="loading plugin \"io.containerd.ttrpc.v1.pause\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Nov 19 10:11:42 sonnenschein-gh-runner dockerd-rootless.sh[971]: time="2024-11-19T10:11:42.313428599Z" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1

schnittchen avatar Nov 19 '24 10:11 schnittchen

A weird thing about this is that Docuum runs in an infinite loop with a top-level error handler. So it generally doesn't crash under normal circumstances, regardless of what Docker does.

I wonder if Docuum is receiving a signal like SIGTERM or SIGKILL from something (Docker Compose?), or maybe the system doesn't have enough memory so things are getting OOM killed (Docuum itself doesn't use much memory, but maybe something else does).

stepchowfun avatar Nov 19 '24 11:11 stepchowfun

If it was a Signal, I think systemd would show that in the logs? It's definitely not memory pressure.

schnittchen avatar Nov 19 '24 11:11 schnittchen

I notice the exit code is 1 (is that right?)—the only place that Docuum explicitly exits with that code after it has started running is here in the handler for SIGINT/SIGTERM/SIGHUP, so that's why I'm suspicious that it's a signal. We could potentially add some logging there to confirm. Would you be able to add something like error!("Signal recieved."); right before that exit(1); and build Docuum locally (should just be cargo build if you have Rust installed)?

stepchowfun avatar Nov 19 '24 18:11 stepchowfun

That would be quite some effort, since I'm running it on a server and I'd need to set up all the tooling...

Maybe it's possible to wrap docuum with a little shell script that traps and logs signals. I'm not so much of a shell expert though.

My initial problem was probably due to my own cleanup script running in parallel and taking care of containers, networks and volumes, disrupting the startup of compose. I have hopefully fixed that and will know in a few days if it's working.

The mysterious docker events terminated probably remains though.

schnittchen avatar Nov 19 '24 19:11 schnittchen

It's a mystery indeed, because docker events terminating shouldn't cause Docuum to crash. Docuum will just spawn a new docker events.

stepchowfun avatar Nov 19 '24 19:11 stepchowfun

As I mentioned, besides docuum I have a service running which periodically tries to get rid of old container, volumes and networks. I disabled it(*) so it does not interfere with observing this issue.

(*) it still stops long-running containers because that seems to work well enough.

As I'm looking at that periodic script, I am looking at shell code trying to parse go-formatted output of docker ps and my head is getting all dizzy. I guess what I want to express is that these are problems that would make sense for docuum to solve as well. Even a heuristic solution would be better than none (or my horrible scripting).

schnittchen avatar Nov 19 '24 21:11 schnittchen

I agree that it would be nice for Docuum to clean up more than just images. This is a common feature request (e.g., https://github.com/stepchowfun/docuum/issues/68), and I don't have any major concerns about it—it just needs to be built.

stepchowfun avatar Nov 19 '24 22:11 stepchowfun

For the record, this is what I'm doing right now regarding other resources, from a minutely scheduled script:

# Clear docker build cache when free disk space is running low.
free_percent=`stat -f -c '%a * 100 / %b' / | bc`
min_free_percent=30
if [ $free_percent -lt $min_free_percent ]; then
  docker builder prune -f
fi

if date | grep ' 04:00:' >/dev/null 2>&1; then                                                                              
  docker container prune --force                                                                                            
  docker volume prune --all --force                                                                                               
  docker network prune --force                                                                                              
fi                                                                                                                          

I can do this only because it's not a production machine, and there's a time window where this doesn't cause any harm.

[Edited: added --all to volume pruning] [Edited: added build cache pruning]

schnittchen avatar Nov 29 '24 11:11 schnittchen

Here's logs from the last time docuum terminated:

Nov 29 10:19:42 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:19:42 +00:00 DEBUG] Docker images are using `9.68 GB`, which is within the limit of `10.00 GB`.
Nov 29 10:19:42 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:19:42 +00:00 INFO] Listening for Docker events…
Nov 29 10:20:17 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:20:17 +00:00 DEBUG] Waking up…
Nov 29 10:20:17 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:20:17 +00:00 DEBUG] Updating last-used timestamp for image `sha256:11229070f1e2ac463af222c4a912d29a1f1dbed27610c6de2ad5f0a55d43f23e`…
Nov 29 10:20:17 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:20:17 +00:00 DEBUG] Going back to sleep…
Nov 29 10:20:17 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:20:17 +00:00 DEBUG] Waking up…
Nov 29 10:20:17 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:20:17 +00:00 DEBUG] Updating last-used timestamp for image `sha256:40ddf29a3b2f81f8084970b142dc59edbaa2737edbfce3230d920671e2ee9833`…
Nov 29 10:20:17 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:20:17 +00:00 DEBUG] Going back to sleep…
Nov 29 10:20:27 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:20:27 +00:00 DEBUG] Waking up…
Nov 29 10:20:27 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:20:27 +00:00 DEBUG] Updating last-used timestamp for image `sha256:3b278b9780b6599272f368873357ec3d53b8b99bd09c24dd7e4aa4b13f5d4936`…
Nov 29 10:20:27 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:20:27 +00:00 DEBUG] Going back to sleep…
Nov 29 10:20:27 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:20:27 +00:00 DEBUG] Waking up…
Nov 29 10:20:27 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:20:27 +00:00 DEBUG] Updating last-used timestamp for image `sha256:bb7a102cb30b7030de182d8eb79403348fc02922920186589fd5f784a7bd1dbd`…
Nov 29 10:20:27 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:20:27 +00:00 DEBUG] Going back to sleep…
Nov 29 10:20:27 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:20:27 +00:00 DEBUG] Waking up…
Nov 29 10:20:27 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:20:27 +00:00 DEBUG] Updating last-used timestamp for image `sha256:fc9156a9e8b840766922b608e7ea181f379894a5bb67d60f526b5e4a0040112a`…
Nov 29 10:20:27 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:20:27 +00:00 DEBUG] Going back to sleep…
Nov 29 10:53:11 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:53:11 +00:00 ERROR] `docker events` terminated.
Nov 29 10:53:11 sonnenschein-gh-runner docker[2402942]: [2024-11-29 10:53:11 +00:00 INFO] Retrying in 5 seconds…
Nov 29 10:53:11 sonnenschein-gh-runner systemd[853]: docuum.service: Main process exited, code=exited, status=1/FAILURE
Nov 29 10:53:11 sonnenschein-gh-runner systemd[853]: docuum.service: Failed with result 'exit-code'.
Nov 29 10:53:12 sonnenschein-gh-runner systemd[853]: docuum.service: Scheduled restart job, restart counter is at 127.
Nov 29 10:53:12 sonnenschein-gh-runner systemd[853]: Stopped Docuum (docker image vacuumer).
Nov 29 10:53:12 sonnenschein-gh-runner systemd[853]: Started Docuum (docker image vacuumer).

schnittchen avatar Nov 29 '24 11:11 schnittchen

I misconfigured the log level. Hope I did it right this time.

Will post logs when I see the crash again.

schnittchen avatar Nov 29 '24 12:11 schnittchen

The error has not occurred again within the past five days. I suspect my previously aggressive cleanup script for running containers, which also pruned networks and volumes, somehow triggered it.

I'll continue to keep an eye on this. I'd close this when the problem does not re-appear within some time though, so we can move on.

schnittchen avatar Dec 04 '24 12:12 schnittchen

Not seen again.

schnittchen avatar Dec 12 '24 13:12 schnittchen

Just happened again (after I had removed the change to increase the log level, ofc)

schnittchen avatar Dec 13 '24 14:12 schnittchen

Now with trace... but nothing weird apart from the "docker events terminated" itself:

Dec 16 09:13:13 sonnenschein-gh-runner docker[1087831]: [2024-12-16 09:13:13 +00:00 TRACE] Skipping due to: missing field `id` at line 1 column 168
Dec 16 09:13:13 sonnenschein-gh-runner docker[1087831]: [2024-12-16 09:13:13 +00:00 TRACE] Incoming event: `{"Type":"volume","Action":"prune","Actor":{"ID":"","Attributes":{"reclaimed":"0"}},"scope":"local","time":1734340393,"timeNano":1734340393133640370}`
Dec 16 09:13:13 sonnenschein-gh-runner docker[1087831]: [2024-12-16 09:13:13 +00:00 TRACE] Skipping due to: missing field `id` at line 1 column 148
Dec 16 09:13:13 sonnenschein-gh-runner docker[1087831]: [2024-12-16 09:13:13 +00:00 TRACE] Incoming event: `{"Type":"network","Action":"prune","Actor":{"ID":"","Attributes":{"reclaimed":"0"}},"scope":"local","time":1734340393,"timeNano":1734340393159163237}`
Dec 16 09:13:13 sonnenschein-gh-runner docker[1087831]: [2024-12-16 09:13:13 +00:00 TRACE] Skipping due to: missing field `id` at line 1 column 149
Dec 16 09:14:13 sonnenschein-gh-runner docker[1087831]: [2024-12-16 09:14:13 +00:00 ERROR] `docker events` terminated.
Dec 16 09:14:13 sonnenschein-gh-runner systemd[851]: docuum.service: Main process exited, code=exited, status=1/FAILURE
Dec 16 09:14:13 sonnenschein-gh-runner systemd[851]: docuum.service: Failed with result 'exit-code'.

schnittchen avatar Dec 16 '24 09:12 schnittchen

Same today.

Captured some docker-rootless logging which happened at the same time.

Jan 03 11:13:27 sonnenschein-gh-runner systemd[852]: Started Remove other docker resources.
Jan 03 11:13:27 sonnenschein-gh-runner docker[2318777]: [2025-01-03 11:13:27 +00:00 ERROR] `docker events` terminated.
Jan 03 11:13:27 sonnenschein-gh-runner systemd[852]: docker-ecdb3767cb6b1d75d913889cbf7f4c8016ad42a97aa91cda262d96c5222e9999.scope: Consumed 1.267s CPU time.
Jan 03 11:13:27 sonnenschein-gh-runner dockerd-rootless.sh[983]: time="2025-01-03T11:13:27.204928362Z" level=info msg="shim disconnected" id=ecdb3767cb6b1d75d913889cbf7f4c8016ad42a97aa91cda262d96c5222e9999 name>
Jan 03 11:13:27 sonnenschein-gh-runner dockerd-rootless.sh[924]: time="2025-01-03T11:13:27.204947374Z" level=info msg="ignoring event" container=ecdb3767cb6b1d75d913889cbf7f4c8016ad42a97aa91cda262d96c5222e9999 >
Jan 03 11:13:27 sonnenschein-gh-runner dockerd-rootless.sh[983]: time="2025-01-03T11:13:27.205791673Z" level=warning msg="cleaning up after shim disconnected" id=ecdb3767cb6b1d75d913889cbf7f4c8016ad42a97aa91cda>
Jan 03 11:13:27 sonnenschein-gh-runner dockerd-rootless.sh[983]: time="2025-01-03T11:13:27.205904020Z" level=info msg="cleaning up dead shim" namespace=moby
Jan 03 11:13:27 sonnenschein-gh-runner remove-other-docker-resources.bash[2329168]: ecdb3767cb6b
Jan 03 11:13:27 sonnenschein-gh-runner systemd[852]: docuum.service: Main process exited, code=exited, status=1/FAILURE
Jan 03 11:13:27 sonnenschein-gh-runner systemd[852]: docuum.service: Failed with result 'exit-code'.

schnittchen avatar Jan 03 '25 12:01 schnittchen

Experiencing the exact same issue.

punkpeye avatar Aug 23 '25 02:08 punkpeye