compose icon indicating copy to clipboard operation
compose copied to clipboard

deployment to remote hosts breaks after some time

Open lUNuXl opened this issue 3 years ago • 11 comments
trafficstars

Description

docker compose is unable to deploy stack(s) to remote hosts, the hostname appears to be hardcoded in docker(?)

bartek@czajniczek ~/Desktop [18]> docker compose -p "project_name" pull
[+] Running 0/0
 ⠿ <<service_name_obfuscated>> Error                                                                                      0.0s
 ⠿ <<service_name_obfuscated>> Error                                                                                 0.0s
 ⠿ <<service_name_obfuscated>> Error                                                                    0.0s
 ⠿ <<service_name_obfuscated>> Error                                                                             0.0s
 ⠿ <<service_name_obfuscated>> Error                                                                                       0.0s
error during connect: Post "http://docker.example.com/v1.40/images/create?fromImage=postgres&tag=11-alpine": command [ssh -l testuser -- <<internal_server_ipv4_address_obfuscated>> docker system dial-stdio] has exited with signal: killed, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=
bartek@czajniczek ~/Desktop [18]> docker context ls
NAME               DESCRIPTION                               DOCKER ENDPOINT               KUBERNETES ENDPOINT   ORCHESTRATOR
company_name-machine *                                             ssh://testuser@<<internal_server_ipv4_address_obfuscated>>                          
default            Current DOCKER_HOST based configuration   unix:///var/run/docker.sock                         swarm
bartek@czajniczek ~/Desktop> 

Steps to reproduce the issue:

  1. docker context create ssh-box --docker "host=ssh://user@my-box"
  2. docker context use ssh-box
  3. start using this context to do normal docker things, deploy a stack a few times, tear it down, etc.
  4. docker context ls
  5. docker compose up

Describe the results you received:

After some time the connection breaks(?) and docker decides that I wanted to call example.com address instead of whatever address I defined in the context.

Describe the results you expected:

Seamless communication with the remote docker daemon

Additional information you deem important (e.g. issue happens only occasionally):

issue happens only occasionally

After some time, at first it works, then something breaks. I can't tell if it's the remote daemon that does something or if its the client side.

Output of docker compose version:

Docker Compose version 2.10.2

Output of docker info:

Client:
 Context:    <<company_name>>-machine
 Debug Mode: false
 Plugins:
  compose: Docker Compose (Docker Inc., 2.10.2)

Server:
 Containers: 60
  Running: 48
  Paused: 0
  Stopped: 12
 Images: 116
 Server Version: 19.03.8
 Storage Driver: overlay2
  Backing Filesystem: <unknown>
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-1062.9.1.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 31.42GiB
 Name: <<internal_hostname>>
 ID: <<internal>>
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  <<internal_registry>>
  <<<<internal_registry>>
  127.0.0.0/8
 Live Restore Enabled: false

Additional environment details:

Local [client] OS is ArchLinux 64bit and this issue also happens with Debian 11 64bit hosts with official docker package (the one installed from official docker apt repository)

lUNuXl avatar Sep 23 '22 12:09 lUNuXl

Is there anything that makes you think this is related to compose specifically and not any docker command? Are you able to docker run or docker ps against the remote host after it breaks?

nicksieger avatar Sep 28 '22 18:09 nicksieger

Could be related to #9448?

nicksieger avatar Sep 28 '22 18:09 nicksieger

Is there anything that makes you think this is related to compose specifically and not any docker command? Are you able to docker run or docker ps against the remote host after it breaks?

Yes, I've run into this issue a few minutes ago and docker ps works, docker compose pull doesn't work at all (mentions that cant connect to docker.example.com) and docker compose down -v deletes one volume at a time (I have to run this command 4 times to delete 4 containers because it fails after deleting the first one)

lUNuXl avatar Sep 30 '22 16:09 lUNuXl

Could be related to #9448?

I think this is related, I even followed the advice there about increasing the max sessions and rerites in ssh daemon settings on one of the hosts and this problem still persists.

I didn't have time to test if downgrading docker compose on the client machine fixes anything, I'll probably try doing this later today/tomorrow (that is, if I at all find time to fiddle with it this weekend)

lUNuXl avatar Oct 01 '22 12:10 lUNuXl

Update: I've downgraded docker compose package and everything works now

More information: https://github.com/docker/compose/issues/9448#issuecomment-1265437932

lUNuXl avatar Oct 03 '22 13:10 lUNuXl

Downgrading to 2.3.3 seems to fix it for me as well. But I'm also getting the warnings described in https://github.com/docker/compose/issues/8544 now.

AlexZeitler avatar Jan 09 '23 12:01 AlexZeitler

I'm able to reproduce this in compose 2.16.0. Same symptoms, docker compose ps will work one time, then it returns a similar error. I checked the SSH logs the connection was terminated by the client. On the other hand docker ps shows all the right containers. This is definitely a compose bug.

[22:25:38]❯ docker compose ps
error during connect: Get "http://docker.example.com/v1.42/containers/7a78a665df790636a1de787d8f84debd0e80fa4c17a7a1b6e5c1a84afdb2a61d/json": command [ssh -l <user> -- <server ip> docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=kex_exchange_identification: read: Connection reset by peer
Connection reset by <server ip> port 22

My docker contexts:

[22:28:52]❯ docker context ls
NAME                TYPE                DESCRIPTION                               DOCKER ENDPOINT                                 KUBERNETES ENDPOINT   ORCHESTRATOR
default             moby                Current DOCKER_HOST based configuration   unix:///var/run/docker.sock                                                                  
<context name> *            moby                                                          ssh://<user>@<server ip>

Even more interesting, when I run the command with the debug flag:

[22:39:23]❯ docker --debug compose ps
DEBU[0000] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0001] commandconn (ssh):kex_exchange_identification: Connection closed by remote host
 onnection closed by <server ip> port 22
error during connect: Get "http://docker.example.com/v1.42/containers/36da8834000fe7bd9eb2873e5389e8a3fef7371817cc7dab292f017e80b77853/json": command [ssh -l <user> -- <server ip> docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=kex_exchange_identification: Connection closed by remote host
Connection closed by <server ip> port 22

Those connections are all made very rapidly. I've already added my key to the SSH agent on my machine but I also modified my SSH config so that all connections to the server IP use my username and identity file. I think I've ruled out configuration at this point but I'm willing to troubleshoot this further. Also of note, Docker-Compose v1 works just fine with this context.

matt0x6F avatar Feb 27 '23 06:02 matt0x6F

@matt0x6F could you please try to reproduce by directly executing the docker-compose CLI plugin ? .docker/cli-plugins/docker-compose --debug ps (or /Applications/Docker.app/Contents/Resources/bin/docker-compose --debug ps when using Docker Desktop Mac)

docker compose uses the exact same Docker client set by the docker CLI, just like docker ps does, but maybe this issue happens due to the docker CLI plugin architecture.

ndeloof avatar Feb 27 '23 12:02 ndeloof

@ndeloof same output

DEBU[0000] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn: starting ssh with [-l <user> -- <server ip> docker system dial-stdio] 
DEBU[0002] commandconn (ssh):kex_exchange_identification: read: Connection reset by peer
 onnection reset by <server ip> port 22
error during connect: Get "http://docker.example.com/v1.42/containers/6105f0b03deaaba6f0aad2f3e9ae2c2d482a18767573691ad32c311756f8cef2/json": command [ssh -l <user> -- <server ip> docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=kex_exchange_identification: read: Connection reset by peer
Connection reset by <server ip> port 22

It's worth mentioning that ssh -l <user> -- <server ip> echo test does what I expect it to do. I'm very curious about where http://docker.example.com/v1.42/containers/6105f0b03deaaba6f0aad2f3e9ae2c2d482a18767573691ad32c311756f8cef2/json comes from, because the command after it is correct (the ssh command), which means it's picking up the context correctly.

matt0x6F avatar Feb 27 '23 20:02 matt0x6F

This fixed it for me!

politician avatar Sep 23 '23 21:09 politician

The solution found by @politician works in a non-Windows client environment, but on Windows with Microsoft's OpenSSH implementation you won't get very far: ControlMaster is not supported by Microsoft's implementation, which means Windows users pretty much cannot use SSH authentication to the Docker API.

The underlying issue seems to be that compose is spamming separate connections for each docker command, which triggers either the remote host connection limiting or an intermediate firewall (thus connection reset).

I notice this behavior consistently when e.g. running docker compose up -d for around 10-20 services, but never when doing the same for a single service.

This is still a problem as of compose 2.23.0, but may be an issue with the docker CLI itself.

LaXiS96 avatar Nov 06 '23 09:11 LaXiS96