compose icon indicating copy to clipboard operation
compose copied to clipboard

Regression: 1.28 fails with "kex_exchange_identification: Connection closed by remote host"

Open jamshid opened this issue 4 years ago • 12 comments

Description of the issue

A docker-compose up -d fails with a kex_exchange_identification error when DOCKER_HOST=ssh:// points to a mac. I do have ssh keys provisioned in ~/.ssh/config for password-less login. This worked with the previous 1.27 version.

Context information (for bug reports)

Output of docker-compose version

docker-compose version 1.28.5, build c4eb3a1f

Output of docker version

Client: Docker Engine - Community
 Cloud integration: 1.0.9
 Version:           20.10.5
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        55c4c88
 Built:             Tue Mar  2 20:13:00 2021
 OS/Arch:           darwin/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.5
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       363e9a8
  Built:            Tue Mar  2 20:15:47 2021
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          1.4.3
  GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
 Kubernetes:
  Version:          v1.19.3
  StackAPI:         Unknown

Output of docker-compose config This is my stripped down compose file that still reproduces the error:

# Stripped down docker-compose.yml where up fails due to error:
#     kex_exchange_identification: Connection closed by remote host
#
# % docker-compose --log-level DEBUG up -d
# time="2021-03-10T14:41:14-06:00" level=error msg="error while CloseRead (stream to stdout): close unix ->docker.sock: shutdown: socket is not connected"
# time="2021-03-10T14:41:14-06:00" level=error msg="error while CloseRead (stream to stdout): close unix ->docker.sock: shutdown: socket is not connected"
# WARNING: The Docker Engine you're using is running in swarm mode.
# 
# Compose does not use swarm mode to deploy services to multiple nodes in a swarm. All containers will be scheduled on the current node.
# 
# To deploy your application across the swarm, use `docker stack deploy`.
# 
# Creating network "test_default" with driver "bridge"
# Creating volume "test_esdata" with default driver
# ...
# Creating volume "test_syslog-logs" with default driver
# 
# kex_exchange_identification: Connection closed by remote host
# 
# time="2021-03-10T14:41:24-06:00" level=error msg="error while CloseRead (stream to stdout): close unix ->docker.sock: shutdown: socket is not connected"
# time="2021-03-10T14:41:24-06:00" level=error msg="error while CloseRead (stream to stdout): close unix ->docker.sock: shutdown: socket is not connected"
# ...
#

networks:
  default:
    attachable: true
    driver: bridge
services:
  dnsmasq:
    cap_add:
    - NET_ADMIN
    command: --log-facility=- --interface=*
    deploy:
      resources:
        limits:
          memory: 256m
    expose:
    - 53/tcp
    image: andyshinn/dnsmasq:2.76
    ports:
    - protocol: tcp
      published: 4253
      target: 53
    - protocol: udp
      published: 4253
      target: 53
    restart: "no"
  etcd:
    command: /usr/local/bin/etcd --data-dir=/etcd-data --name node1 --initial-advertise-peer-urls
      "http://etcd:2380" --listen-peer-urls http://0.0.0.0:2380 --advertise-client-urls
      "http://etcd:2379" --listen-client-urls http://0.0.0.0:2379 --initial-cluster
      node1="http://etcd:2380"
    deploy:
      resources:
        limits:
          memory: 256m
    image: quay.io/coreos/etcd:v3.3.18
    ports:
    - protocol: tcp
      target: 2379
    - protocol: tcp
      target: 2380
    restart: "no"
    volumes:
    - etcd-data:/etcd-data:rw
  loki:
    command: -config.file=/etc/loki/local-config.yaml
    deploy:
      resources:
        limits:
          memory: 200m
    image: grafana/loki:1.6.1
    ports:
    - published: 4231
      target: 3100
    restart: "no"
    volumes:
    - loki-data:/loki:rw
  prometheus:
    deploy:
      resources:
        limits:
          memory: 1g
    expose:
    - 9090/tcp
    image: prom/prometheus:v2.23.0
    ports:
    - protocol: tcp
      published: 4209
      target: 9090
    restart: "no"
    volumes:
    - prometheus-etc:/etc/prometheus:rw
    - prometheus-data:/prometheus:rw
  promtail:
    depends_on:
      syslog:
        condition: service_started
    deploy:
      resources:
        limits:
          memory: 100m
    image: grafana/promtail:1.6.1
    restart: "no"
    volumes:
    - syslog-logs:/var/log:rw
  socat_etcd:
    command: TCP-LISTEN:80,fork,reuseaddr TCP:etcd:2379
    depends_on:
      etcd:
        condition: service_started
    deploy:
      resources:
        limits:
          memory: 25m
    image: alpine/socat:1.0.5
    ports:
    - published: 4227
      target: 80
  syslog:
    deploy:
      resources:
        limits:
          memory: 256m
    environment:
      FOO_DOMAIN: foo
      FOO_HOST: foo
      FOO_PORT: '80'
    expose:
    - 514/tcp
    - 514/udp
    - '8888'
    image: balabit/syslog-ng
    ports:
    - published: 4222
      target: 22
    - protocol: tcp
      published: 4251
      target: 514
    - protocol: udp
      published: 4251
      target: 514
    - published: 4290
      target: 8090
    - published: 4288
      target: 8888
    restart: "no"
version: '3.9'
volumes:
  esdata: {}
  esleaderdata: {}
  etcd-data: {}
  grafana-data: {}
  grafana-etc: {}
  loki-data: {}
  prometheus-data: {}
  prometheus-etc: {}
  syslog-logs: {}

Steps to reproduce the issue

  1. docker-compose --log-level DEBUG up -d
  2. See a lot of harmless "error while CloseRead errors (https://github.com/docker/compose/issues/8190)
  3. See the operation fails due to kex_exchange_identification: Connection closed by remote host

Observed result

kex_exchange_identification: Connection closed by remote host

Expected result

No error, this worked with previous compose (and macOS)

Stacktrace / full error message

kex_exchange_identification: Connection closed by remote host

Additional information

OS version / distribution, docker-compose install method, etc.

This broke when I upgraded my client mac and server mac to Docker Desktop 3.2.1. It was working with previous release. I also just upgraded both macOS to latest patch 11.2.3, I guess that could have introduced the problem too.

You might have to increase the server's sshd MaxSessions -- I've had to do this ever since switching from DOCKER_HOST=tcp://... to DOCKER_HOST=ssh://.

echo 'MaxSessions 500' | sudo tee -a /etc/ssh/sshd_config && sudo launchctl stop com.openssh.sshd

jamshid avatar Mar 10 '21 21:03 jamshid

Here is the successful run after downloading docker-compose 1.27.4. The output sure is different. Don't know why it says ssh://localhost:None, but I do have DOCKER_HOST set and exported correctly and it is using my remote docker server.

% echo $DOCKER_HOST
ssh://192.168.1.NN
% /Users/jamshid/Downloads/docker-compose-Darwin-x86_64-2 --version
docker-compose version 1.27.4, build 40524192
% ~/Downloads/docker-compose-Darwin-x86_64-2 --log-level DEBUG up -d 
Using configuration files: ./docker-compose.yml
Trying paths: ['/Users/jamshid/.docker/config.json', '/Users/jamshid/.dockercfg']
Found file at path: /Users/jamshid/.docker/config.json
Found 'auths' section
Auth data for https://index.docker.io/v1/ is absent. Client might be using a credentials store instead.
...
Found 'credsStore' section
ssh://localhost:None "GET /version HTTP/1.1" 200 None
ssh://localhost:None "GET /v1.41/networks/test_default HTTP/1.1" 200 None
ssh://localhost:None "GET /v1.41/volumes/test_etcd-data HTTP/1.1" 200 None
ssh://localhost:None "GET /v1.41/volumes/test_loki-data HTTP/1.1" 200 None
ssh://localhost:None "GET /v1.41/volumes/test_prometheus-etc HTTP/1.1" 200 None
ssh://localhost:None "GET /v1.41/volumes/test_prometheus-data HTTP/1.1" 200 None
ssh://localhost:None "GET /v1.41/volumes/test_syslog-logs HTTP/1.1" 200 None
ssh://localhost:None "GET /v1.41/info HTTP/1.1" 200 None
WARNING: The Docker Engine you're using is running in swarm mode.

Compose does not use swarm mode to deploy services to multiple nodes in a swarm. All containers will be scheduled on the current node.

To deploy your application across the swarm, use `docker stack deploy`.

ssh://localhost:None "GET /v1.41/networks/test_default HTTP/1.1" 200 None
ssh://localhost:None "GET /v1.41/volumes/test_esdata HTTP/1.1" 200 None
...
ssh://localhost:None "GET /v1.41/volumes/test_syslog-logs HTTP/1.1" 200 None
ssh://localhost:None "GET /v1.41/containers/json?limit=-1&all=0&size=0&trunc_cmd=0&filters=%7B%22label%22%3A+%5B%22com.docker.compose.project%3Dtest%22%2C+%22com.docker.compose.oneoff%3DFalse%22%5D%7D HTTP/1.1" 200 None
...
ssh://localhost:None "GET /v1.41/containers/json?limit=-1&all=1&size=0&trunc_cmd=0&filters=%7B%22label%22%3A+%5B%22com.docker.compose.project%3Dtest%22%2C+%22com.docker.compose.service%3Dpromtail%22%2C+%22com.docker.compose.oneoff%3DFalse%22%5D%7D HTTP/1.1" 200 None
ssh://localhost:None "GET /v1.41/images/andyshinn/dnsmasq:2.76/json HTTP/1.1" 200 None
ssh://localhost:None "GET /v1.41/images/quay.io/coreos/etcd:v3.3.18/json HTTP/1.1" 200 None
ssh://localhost:None "GET /v1.41/images/grafana/loki:1.6.1/json HTTP/1.1" 200 None
ssh://localhost:None "GET /v1.41/images/prom/prometheus:v2.23.0/json HTTP/1.1" 200 None
ssh://localhost:None "GET /v1.41/images/alpine/socat:1.0.5/json HTTP/1.1" 200 None
ssh://localhost:None "GET /v1.41/images/balabit/syslog-ng/json HTTP/1.1" 200 None
ssh://localhost:None "GET /v1.41/images/grafana/promtail:1.6.1/json HTTP/1.1" 200 None
ssh://localhost:None "GET /v1.41/containers/json?limit=-1&all=1&size=0&trunc_cmd=0&filters=%7B%22label%22%3A+%5B%22com.docker.compose.project%3Dtest%22%2C+%22com.docker.compose.service%3Ddnsmasq%22%2C+%22com.docker.compose.oneoff%3DFalse%22%5D%7D HTTP/1.1" 200 None
...
Pending: {<Service: socat_etcd>, <Service: loki>, <Service: dnsmasq>, <Service: prometheus>, <Service: promtail>, <Service: syslog>, <Service: etcd>}
Starting producer thread for <Service: loki>
Starting producer thread for <Service: dnsmasq>
Starting producer thread for <Service: prometheus>
Starting producer thread for <Service: syslog>
Starting producer thread for <Service: etcd>
ssh://localhost:None "GET /v1.41/containers/json?limit=-1&all=1&size=0&trunc_cmd=0&filters=%7B%22label%22%3A+%5B%22com.docker.compose.project%3Dtest%22%2C+%22com.docker.compose.service%3Dprometheus%22%2C+%22com.docker.compose.oneoff%3DFalse%22%5D%7D HTTP/1.1" 200 None
...
Starting producer thread for ServiceName(project='test', service='prometheus', number=1)
ssh://localhost:None "GET /v1.41/containers/json?limit=-1&all=1&size=0&trunc_cmd=0&filters=%7B%22label%22%3A+%5B%22com.docker.compose.project%3Dtest%22%2C+%22com.docker.compose.service%3Dsyslog%22%2C+%22com.docker.compose.oneoff%3DFalse%22%5D%7D HTTP/1.1" 200 None
Creating test_syslog_1     ... 
Pending: {ServiceName(project='test', service='syslog', number=1)}
Starting producer thread for ServiceName(project='test', service='syslog', number=1)
Pending: {<Service: socat_etcd>, <Service: promtail>}
Pending: set()
Pending: set()
ssh://localhost:None "GET /v1.41/containers/json?limit=-1&all=1&size=0&trunc_cmd=0&filters=%7B%22label%22%3A+%5B%22com.docker.compose.project%3Dtest%22%2C+%22com.docker.compose.service%3Detcd%22%2C+%22com.docker.compose.oneoff%3DFalse%22%5D%7D HTTP/1.1" 200 None
...
Creating test_etcd_1       ... 
...
ssh://localhost:None "POST /v1.41/containers/f9b8d98dceb0299f53e33ce340ff46007d73d7d13ff45de15b5c7c4df96eed2f/start HTTP/1.1" 204 0
Pending: set()
Finished processing: <Service: promtail>
Pending: set()
Pending: set()
Finished processing: <Service: socat_etcd>
Pending: set()

jamshid avatar Mar 10 '21 21:03 jamshid

I seem to be running into this as well after updating to 1.28.5

$ docker-compose version
docker-compose version 1.28.5, build c4eb3a1f
docker-py version: 4.4.4
CPython version: 3.9.0
OpenSSL version: OpenSSL 1.1.1h  22 Sep 2020
$ docker version
Client: Docker Engine - Community
 Cloud integration: 1.0.9
 Version:           20.10.5
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        55c4c88
 Built:             Tue Mar  2 20:13:00 2021
 OS/Arch:           darwin/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.5
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       363e9a8
  Built:            Tue Mar  2 20:15:47 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.3
  GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
$ 

scbash avatar Apr 02 '21 00:04 scbash

Still seeing this regression with docker-compose 1.29.0.

jamshid avatar Apr 14 '21 17:04 jamshid

For the record COMPOSE_PARAMIKO_SSH=1 environment variable appears to work around this problem (and significantly speeds up the operation...).

scbash avatar Apr 14 '21 18:04 scbash

Thanks, is that its older ssh implementation? I get this error, probably because I use non-default key names in my ~/.ssh/ (https://github.com/docker/compose/issues/7514).

Traceback (most recent call last):
  File "docker-compose", line 3, in <module>
  File "compose/cli/main.py", line 81, in main
  File "compose/cli/main.py", line 199, in perform_command
  File "compose/cli/command.py", line 60, in project_from_options
  File "compose/cli/command.py", line 152, in get_project
  File "compose/cli/docker_client.py", line 41, in get_client
  File "compose/cli/docker_client.py", line 170, in docker_client
  File "docker/api/client.py", line 171, in __init__
  File "docker/transport/sshconn.py", line 180, in __init__
  File "docker/transport/sshconn.py", line 225, in _connect
  File "paramiko/client.py", line 435, in connect
  File "paramiko/client.py", line 765, in _auth
paramiko.ssh_exception.SSHException: No authentication methods available
[13042] Failed to execute script docker-compose

jamshid avatar Apr 14 '21 18:04 jamshid

Yes, COMPOSE_PARAMIKO_SSH forces docker-compose back to the 1.27 and earlier SSH implementation. Honestly for me it's SO much faster (30 seconds vs 4+ minutes) I'm not sure I want to move forward... (but that should probably be a different ticket)

scbash avatar Apr 14 '21 22:04 scbash

The problem does still reproduce with our use-case with version 1.29.2. We are still unsure how to reliably re-produce the error - some systems do have the problems, other don't. It appears to be related to versions of involved system libraries.

Edit: We do experience the problem when running docker-compose on a Linux machine to deploy to another Linux machine. So Mac on its own does not seem to be the problem here.

arnegroskurth avatar Sep 10 '21 15:09 arnegroskurth

Seeing the same problem compose 1.29.2 in windows docker desktop 3.5.1 talking to Docker running on UBUNTU-20.04 Setting COMPOSE_PARAMIKO_SSH=1 seems to fix this.

markddrake avatar Sep 12 '21 18:09 markddrake

I am having the same issue linux to linux -- docker-compose 1.29.2 (both machines) Local OS - Pop os 21.10 Remote OS - Ubuntu 20.04

Trying to start quite a few containers on a remote server with docker-compose and docker context (ssh) when running into the same problem - the peer drops the connection about 80% of the way through starting or building.

Not sure if I am on the right path but I figured I'd share after sitting here for a few hours....

On the remote server, I ran tail -f /var/log/auth.log to view the auth log. On my host I ran docker-compose --context remote up to start my docker services on the remote server.

I noticed hundreds of ssh login sessions. It consistently closes the ssh connection when it hits about 150 connections. I tested this 3 times. Are there supposed to be so many SSH sessions starting when running the docker-compose up command with context?

Either way - easy solution but not sure if it is the right way to fix this -- I updated /etc/ssh/sshd_config (on the remote server) to have MaxStartups 500 (some crazy high number). Restart the ssh service with service sshd restart and run the up command again from the local machine. This time it starts just fine!

nickisyourfan avatar Jan 27 '22 00:01 nickisyourfan

Ah interesting... Thank you @NickisyourFan! I had run into an issue with MaxSessions pre-1.28 (the Paramiko SSH implementation), so I experimented with increasing that, but I didn't try MaxStartups. Since there doesn't seem to be much motivation to fix this bug, I'll experiment with those settings as a potential work around.

scbash avatar Jan 27 '22 16:01 scbash

Why is this workaround needed? I had this issue with a remote docker engine on CentOS and a local WSL ubuntu docker-compose 1.29.2 client and would like to see this work without workarounds. Setting COMPOSE_PARAMIKO_SSH=1 does fix it though

austinsasko avatar Mar 02 '22 02:03 austinsasko

I can confirm that increasing MaxSessions worked for me too. 👍

Too bad I didn't get here sooner, I had to find out the hard way 🥲

ondrados avatar Aug 09 '22 15:08 ondrados

FWIW I'm not sure MaxSessions (or MaxStartups) on the remote server actually helps. I was still getting this error using the docker-compose V2.15 client on ubuntu 20 to DOCKER_HOST=ssh://ubuntu22-server, latest docker 23.0 installed.

Run with docker --debug compose --timeout 120 ... to see the ssh connections being failed or denied. Btw note COMPOSE_HTTP_TIMEOUT no longer seems to be used, docs say it's deprecated.

Adding this to the user's ~/.ssh/config does seem to finally fix the stubborn, intermittent kex_exchange_identification failures.

The settings are documented in https://docs.docker.com/engine/security/protect-access/ but I restrict it to known docker hosts to avoid it causing ssh problems for other servers and git.

# Avoids `kex_exchange_identification: Connection closed by remote host` failures
# on docker-compose V2 operations when using 
# DOCKER_HOST=ssh://remote-docker.example.com style connections.
Host remote-docker.example.com
   ControlMaster     auto
   ControlPath       ~/.ssh/control-%C
   ControlPersist    yes

jamshid avatar Feb 28 '23 18:02 jamshid

I'm closing this issue as obsolete: Docker Compose V1 has reached end-of-life and we are not accepting any more changes. Please try and reproduce your issue with Compose V2 and create a new issue or PR with the relevant Compose V2 information.

ndeloof avatar Aug 07 '23 08:08 ndeloof