cli cli hangs on dead connection

Description

Steps to reproduce the issue:

Create a socket file with a dead listener by running:

nc -lkU /home/nick/test.sock

Run any docker command against the socket

DOCKER_HOST="unix:///home/nick/test.sock" docker ps

Describe the results you received:

The CLI should eventually time out and/or error

Describe the results you expected:

The CLI hangs indefinitely.

Additional information you deem important (e.g. issue happens only occasionally):

This is distilled from a longer user report and stack trace upstream: https://github.com/tilt-dev/tilt/issues/5841

Output of docker version:

docker version
Client: Docker Engine - Community
 Cloud integration: v1.0.24
 Version:           20.10.16
 API version:       1.41
 Go version:        go1.17.10
 Git commit:        aa7e414
 Built:             Thu May 12 09:17:23 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Additional environment details (AWS, VirtualBox, physical, etc.): Linux

Jun 03 '22 15:06 nicks

hmmmm....this is still reproducible for me, but i thought we fixed it

i see a stack trace like:

	/usr/local/go/src/net/http/client.go:590
github.com/docker/cli/vendor/github.com/docker/docker/client.(*Client).doRequest(0xc000258b40, 0xc000258b40?)
	/go/src/github.com/docker/cli/vendor/github.com/docker/docker/client/request.go:140 +0x7c fp=0xc00063f0b8 sp=0xc00063efd0 pc=0x5609e8a2a55c
github.com/docker/cli/vendor/github.com/docker/docker/client.(*Client).Ping(0xc000258b40, {0x5609e9506940, 0x5609e9d51840})
	/go/src/github.com/docker/cli/vendor/github.com/docker/docker/client/ping.go:28 +0x145 fp=0xc00063f390 sp=0xc00063f0b8 pc=0x5609e8a253a5
github.com/docker/cli/vendor/github.com/docker/docker/client.(*Client).NegotiateAPIVersion(0xc000258b40, {0x5609e9506940?, 0x5609e9d51840?})
	/go/src/github.com/docker/cli/vendor/github.com/docker/docker/client/client.go:310 +0x45 fp=0xc00063f468 sp=0xc00063f390 pc=0x5609e8a0a505
github.com/docker/cli/vendor/github.com/docker/docker/client.(*Client).checkVersion(...)

Feb 15 '24 13:02 nicks

I think the fix was in https://github.com/docker/cli/pull/3722

But of course that's CLI code, so could the client have been initialised through another path? (I think there's multiple paths to initialise things, which were added for different implementations, making things ... messy)

Feb 15 '24 13:02 thaJeztah

Wondering if the Ping itself should automatically / always add a timeout if none is set (or perhaps always?)

Feb 15 '24 13:02 thaJeztah

still digging into it, but i think there's multiple pings, and i think it's getting stuck on another one deep in the client code

Feb 15 '24 13:02 nicks

Yeah, I wonder if this is due to the effort to not connect until we need a connection, and that code-path not hitting the one we added the default timeout to 🤔

(hence thinking: should we make ping always have some timeout, separate from the context as a whole)

Feb 15 '24 14:02 thaJeztah

i don't really understand why we continue trying to make additional client requests at all if the initial ping negotiation fails.

like, if the ping fails, the protocol negotiation is going to fallback to the v1.24 api, which is kind of bad even if the daemon is working.

Feb 15 '24 14:02 nicks

OK, here's some good threads:

https://github.com/docker/cli/issues/149
https://github.com/docker/cli/pull/546
https://github.com/moby/moby/pull/39032

to see if i understand them --

in 2017, we have this thing called docker datacenter. _ping was for establishing the health of the datacenter as a whole, NOT whether the daemon was alive. So we changed the behavior so that if the ping returned a 500, we would try to keep going and make some more requests.

in 2019, we changed the client to do its own layer of negotation based on Ping(). we largely subverted the Ping() protocol negotation behavior we added in 2017. But we never went back and fixed the CLI code, so now the CLI and client code have different protocol negotiation protocols :dizzy_face:

Feb 15 '24 14:02 nicks