docker icon indicating copy to clipboard operation
docker copied to clipboard

after upgrade from v26 to v27 dind fails to start: Unexpected error in sigtimedwait: 'Function not implemented'

Open mrclrchtr opened this issue 1 year ago • 9 comments

I tried to upgrade from v26 to v27.

I want to use docker dind in a github actions runner scale set with the following config:

image: docker:27.0.2-dind
name: dind
securityContext:
  privileged: true
env:
  - name: DOCKER_GROUP_GID
    value: "123"
resources:
  requests:
    cpu: 300m
    memory: 500Mi
  limits:
    cpu: 300m
    memory: 500Mi
args:
  - dockerd
  - --host=unix:///var/run/docker.sock
  - --group=$(DOCKER_GROUP_GID)

This ist the complete log, I can get:

cat: can't open '/proc/net/arp_tables_names': No such file or directory
iptables v1.8.10 (nf_tables)
[FATAL tini (1)] Unexpected error in sigtimedwait: 'Function not implemented'

The underlaying OS is Talos v1.7.4

Do you have any idea, whats happening?

mrclrchtr avatar Jun 27 '24 14:06 mrclrchtr

Interesting -- why is tini involved here? :thinking:

Do you have something configured on your system that would be putting tini inside that container automatically (for example, on dockerd there's a --init flag that would do so)?

(That being said, I can't reproduce the issue even using docker run --init to force tini to be the parent of my dockerd process, so that doesn't really help much, it's just the only meaningful thread I can see to pull on :sob:)

tianon avatar Jun 27 '24 17:06 tianon

Not that I know of... there is an earlier container that unpacks "dind-externals" from the github runner image and provides it via a volume mount for dind. But that shouldn't lead to a different startup behavior, should it?

This is the log of the v26 image:

cat: can't open '/proc/net/arp_tables_names': No such file or directory
iptables v1.8.10 (nf_tables)
time="2024-06-27T17:34:14.706370867Z" level=info msg="Starting up"
time="2024-06-27T17:34:14.711383174Z" level=info msg="containerd not running, starting managed containerd"
time="2024-06-27T17:34:14.797946949Z" level=info msg="started new containerd process" address=/var/run/docker/containerd/containerd.sock module=libcontainerd pid=346
time="2024-06-27T17:34:14.903422623Z" level=info msg="starting containerd" revision=ae71819c4f5e67bb4d5ae76a6b735f29cc25774e version=v1.7.18
...
...

I'll see if Talos has anything to do with it.

mrclrchtr avatar Jun 27 '24 17:06 mrclrchtr

I found this: https://github.com/docker-library/docker/blob/c0963f96ace4f48d13385cbf20356ae605edcb8b/27/dind/dockerd-entrypoint.sh#L143C2-L144C28

# XXX inject "docker-init" (tini) as pid1 to workaround https://github.com/docker-library/docker/issues/318 (zombie container-shim processes)
set -- docker-init -- "$@"

mrclrchtr avatar Jun 27 '24 17:06 mrclrchtr

Oh lol, good catch -- I forgot all about that. :sob:

However, that doesn't really help give us more threads to pull because it works fine here, so my only guess is something in the Talos environment or kernel or something? Maybe something about how Kubernetes is creating the container?

Is there any way you could get lower level on the affected system and debug/test more directly with simpler container run commands like docker run to help narrow down?

tianon avatar Jun 27 '24 18:06 tianon

However, that doesn't really help give us more threads to pull because it works fine here, so my only guess is something in the Talos environment or kernel or something? Maybe something about how Kubernetes is creating the container?

Yes, I also think it has to do with Talos. The question is whether the error message means that sigtimedwait is not present?

And I wonder what change to the image this function needs now?

Is there any way you could get lower level on the affected system and debug/test more directly with simpler container run commands like docker run to help narrow down?

No, unfortunately not. Talos is built in such a way that you can't even set up an SSH tunnel to the machine.

But I could build a very simple Kubernetes deployment with just the image. That's a good idea and helps to isolate the error.

Thank you very much for your help. I'll get back to you as soon as I have more information.

mrclrchtr avatar Jun 28 '24 08:06 mrclrchtr

Today I tried version 27.1.1 (without any further changes) and it works. Unfortunately, I still don't know what was going on in the meantime. Thanks again for your support!

mrclrchtr avatar Jul 30 '24 13:07 mrclrchtr

With the upgrade to 27.1.2 the problem is present again 😖🧐

mrclrchtr avatar Aug 15 '24 16:08 mrclrchtr

Ok, it's completely weird... in 27.2.0-dind it works, in 27.2.1-dind it doesn't work anymore..

I will continue to monitor it. Perhaps a pattern will emerge at some point or you can look at the history to see what has changed.

mrclrchtr avatar Sep 10 '24 10:09 mrclrchtr

I'm not familiar with this but looking at the error handling here: https://github.com/krallin/tini/blob/0b44d3665869e46ccbac7414241b8256d6234dc4/src/tini.c#L505-L512 and the spec here: https://pubs.opengroup.org/onlinepubs/9699919799/functions/sigtimedwait.html there is an error code that is not handled (EINVAL) and I am wondering if the error message could be misleading

LaurentGoderre avatar Dec 19 '24 15:12 LaurentGoderre