compose
compose copied to clipboard
deployment to remote hosts breaks after some time
Description
docker compose is unable to deploy stack(s) to remote hosts, the hostname appears to be hardcoded in docker(?)
bartek@czajniczek ~/Desktop [18]> docker compose -p "project_name" pull
[+] Running 0/0
⠿ <<service_name_obfuscated>> Error 0.0s
⠿ <<service_name_obfuscated>> Error 0.0s
⠿ <<service_name_obfuscated>> Error 0.0s
⠿ <<service_name_obfuscated>> Error 0.0s
⠿ <<service_name_obfuscated>> Error 0.0s
error during connect: Post "http://docker.example.com/v1.40/images/create?fromImage=postgres&tag=11-alpine": command [ssh -l testuser -- <<internal_server_ipv4_address_obfuscated>> docker system dial-stdio] has exited with signal: killed, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=
bartek@czajniczek ~/Desktop [18]> docker context ls
NAME DESCRIPTION DOCKER ENDPOINT KUBERNETES ENDPOINT ORCHESTRATOR
company_name-machine * ssh://testuser@<<internal_server_ipv4_address_obfuscated>>
default Current DOCKER_HOST based configuration unix:///var/run/docker.sock swarm
bartek@czajniczek ~/Desktop>
Steps to reproduce the issue:
docker context create ssh-box --docker "host=ssh://user@my-box"docker context use ssh-box- start using this context to do normal docker things, deploy a stack a few times, tear it down, etc.
docker context lsdocker compose up
Describe the results you received:
After some time the connection breaks(?) and docker decides that I wanted to call example.com address instead of whatever address I defined in the context.
Describe the results you expected:
Seamless communication with the remote docker daemon
Additional information you deem important (e.g. issue happens only occasionally):
issue happens only occasionally
After some time, at first it works, then something breaks. I can't tell if it's the remote daemon that does something or if its the client side.
Output of docker compose version:
Docker Compose version 2.10.2
Output of docker info:
Client:
Context: <<company_name>>-machine
Debug Mode: false
Plugins:
compose: Docker Compose (Docker Inc., 2.10.2)
Server:
Containers: 60
Running: 48
Paused: 0
Stopped: 12
Images: 116
Server Version: 19.03.8
Storage Driver: overlay2
Backing Filesystem: <unknown>
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-1062.9.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 31.42GiB
Name: <<internal_hostname>>
ID: <<internal>>
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
<<internal_registry>>
<<<<internal_registry>>
127.0.0.0/8
Live Restore Enabled: false
Additional environment details:
Local [client] OS is ArchLinux 64bit and this issue also happens with Debian 11 64bit hosts with official docker package (the one installed from official docker apt repository)
Is there anything that makes you think this is related to compose specifically and not any docker command? Are you able to docker run or docker ps against the remote host after it breaks?
Could be related to #9448?
Is there anything that makes you think this is related to compose specifically and not any docker command? Are you able to
docker runordocker psagainst the remote host after it breaks?
Yes, I've run into this issue a few minutes ago and docker ps works, docker compose pull doesn't work at all (mentions that cant connect to docker.example.com) and docker compose down -v deletes one volume at a time (I have to run this command 4 times to delete 4 containers because it fails after deleting the first one)
Could be related to #9448?
I think this is related, I even followed the advice there about increasing the max sessions and rerites in ssh daemon settings on one of the hosts and this problem still persists.
I didn't have time to test if downgrading docker compose on the client machine fixes anything, I'll probably try doing this later today/tomorrow (that is, if I at all find time to fiddle with it this weekend)
Update: I've downgraded docker compose package and everything works now
More information: https://github.com/docker/compose/issues/9448#issuecomment-1265437932
Downgrading to 2.3.3 seems to fix it for me as well. But I'm also getting the warnings described in https://github.com/docker/compose/issues/8544 now.
I'm able to reproduce this in compose 2.16.0. Same symptoms, docker compose ps will work one time, then it returns a similar error. I checked the SSH logs the connection was terminated by the client. On the other hand docker ps shows all the right containers. This is definitely a compose bug.
[22:25:38]❯ docker compose ps
error during connect: Get "http://docker.example.com/v1.42/containers/7a78a665df790636a1de787d8f84debd0e80fa4c17a7a1b6e5c1a84afdb2a61d/json": command [ssh -l <user> -- <server ip> docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=kex_exchange_identification: read: Connection reset by peer
Connection reset by <server ip> port 22
My docker contexts:
[22:28:52]❯ docker context ls
NAME TYPE DESCRIPTION DOCKER ENDPOINT KUBERNETES ENDPOINT ORCHESTRATOR
default moby Current DOCKER_HOST based configuration unix:///var/run/docker.sock
<context name> * moby ssh://<user>@<server ip>
Even more interesting, when I run the command with the debug flag:
[22:39:23]❯ docker --debug compose ps
DEBU[0000] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0001] commandconn (ssh):kex_exchange_identification: Connection closed by remote host
onnection closed by <server ip> port 22
error during connect: Get "http://docker.example.com/v1.42/containers/36da8834000fe7bd9eb2873e5389e8a3fef7371817cc7dab292f017e80b77853/json": command [ssh -l <user> -- <server ip> docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=kex_exchange_identification: Connection closed by remote host
Connection closed by <server ip> port 22
Those connections are all made very rapidly. I've already added my key to the SSH agent on my machine but I also modified my SSH config so that all connections to the server IP use my username and identity file. I think I've ruled out configuration at this point but I'm willing to troubleshoot this further. Also of note, Docker-Compose v1 works just fine with this context.
@matt0x6F could you please try to reproduce by directly executing the docker-compose CLI plugin ?
.docker/cli-plugins/docker-compose --debug ps
(or /Applications/Docker.app/Contents/Resources/bin/docker-compose --debug ps when using Docker Desktop Mac)
docker compose uses the exact same Docker client set by the docker CLI, just like docker ps does, but maybe this issue happens due to the docker CLI plugin architecture.
@ndeloof same output
DEBU[0000] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio]
DEBU[0002] commandconn (ssh):kex_exchange_identification: read: Connection reset by peer
onnection reset by <server ip> port 22
error during connect: Get "http://docker.example.com/v1.42/containers/6105f0b03deaaba6f0aad2f3e9ae2c2d482a18767573691ad32c311756f8cef2/json": command [ssh -l <user> -- <server ip> docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=kex_exchange_identification: read: Connection reset by peer
Connection reset by <server ip> port 22
It's worth mentioning that ssh -l <user> -- <server ip> echo test does what I expect it to do. I'm very curious about where http://docker.example.com/v1.42/containers/6105f0b03deaaba6f0aad2f3e9ae2c2d482a18767573691ad32c311756f8cef2/json comes from, because the command after it is correct (the ssh command), which means it's picking up the context correctly.
This fixed it for me!
The solution found by @politician works in a non-Windows client environment, but on Windows with Microsoft's OpenSSH implementation you won't get very far: ControlMaster is not supported by Microsoft's implementation, which means Windows users pretty much cannot use SSH authentication to the Docker API.
The underlying issue seems to be that compose is spamming separate connections for each docker command, which triggers either the remote host connection limiting or an intermediate firewall (thus connection reset).
I notice this behavior consistently when e.g. running docker compose up -d for around 10-20 services, but never when doing the same for a single service.
This is still a problem as of compose 2.23.0, but may be an issue with the docker CLI itself.