install
install copied to clipboard
Contiv installer - Intermittent Install failures seen w/ latest 1.1.7 installer bits
Description
v2Plugin installation failures seen multiple times on 2 different setups. There are different error messages for the failure for Contiv master and Contiv worker nodes.
Expected Behavior
Contiv install should succeed on all Master/Worker Nodes w/o any errors.
Observed Behavior
Issue is being seen intermittently but can be stated for sure - After complete clean-up of the Docker Swarm cluster from Contiv bits, first iteration of installation fails then subsequent re-try eventually succeeds in installing Contiv. This behaviour is being seen only with the latest code-changes done some 20 days back on 1.1.7 release. We have not seen this issue during the CVD validation cycle till the CVD was released on Dec'18th, 2017.
##Master Node install failures -
TASK [contiv_network : install v2plugin on master nodes] *********************** fatal: [node2]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.122.63 control_url=10.65.122.63:9999 vxlan_port=8472 iflist=eno6 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=master fwd_mode=bridge", "delta": "0:06:11.601524", "end": "2018-01-22 15:11:25.034534", "failed": true, "rc": 1, "start": "2018-01-22 15:05:13.433010", "stderr": "Error response from daemon: dial unix /run/docker/plugins/330e5e6cb7025e7c40805912541ff706fad4d35eb4bb34b877ea5004dfcf8511/netplugin.sock: connect: connection refused", "stderr_lines": ["Error response from daemon: dial unix /run/docker/plugins/330e5e6cb7025e7c40805912541ff706fad4d35eb4bb34b877ea5004dfcf8511/netplugin.sock: connect: connection refused"], "stdout": "1.1.7: Pulling from contiv/v2plugin\n1ba3fc0d8c93: Verifying Checksum\n1ba3fc0d8c93: Download complete\nDigest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30\nStatus: Downloaded newer image for contiv/v2plugin:1.1.7", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin", "1ba3fc0d8c93: Verifying Checksum", "1ba3fc0d8c93: Download complete", "Digest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30", "Status: Downloaded newer image for contiv/v2plugin:1.1.7"]} fatal: [node1]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.122.61 control_url=10.65.122.61:9999 vxlan_port=8472 iflist=eno6 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=master fwd_mode=bridge", "delta": "0:06:12.083192", "end": "2018-01-22 15:11:25.836960", "failed": true, "rc": 1, "start": "2018-01-22 15:05:13.753768", "stderr": "Error response from daemon: dial unix /run/docker/plugins/6f11c1b2fea19a72d9aa2ef95c0e85c224891f982826f815ff8a556dc640e48c/netplugin.sock: connect: no such file or directory", "stderr_lines": ["Error response from daemon: dial unix /run/docker/plugins/6f11c1b2fea19a72d9aa2ef95c0e85c224891f982826f815ff8a556dc640e48c/netplugin.sock: connect: no such file or directory"], "stdout": "1.1.7: Pulling from contiv/v2plugin\n1ba3fc0d8c93: Verifying Checksum\n1ba3fc0d8c93: Download complete\nDigest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30\nStatus: Downloaded newer image for contiv/v2plugin:1.1.7", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin", "1ba3fc0d8c93: Verifying Checksum", "1ba3fc0d8c93: Download complete", "Digest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30", "Status: Downloaded newer image for contiv/v2plugin:1.1.7"]} fatal: [node3]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.122.62 control_url=10.65.122.62:9999 vxlan_port=8472 iflist=eno6 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=master fwd_mode=bridge", "delta": "0:06:12.404043", "end": "2018-01-22 15:11:25.136644", "failed": true, "rc": 1, "start": "2018-01-22 15:05:12.732601", "stderr": "Error response from daemon: dial unix /run/docker/plugins/9c15133fdbe9ee55f4054b0f3af7fbd9be9ae8efc0bfd72d70b791f3ecfb27fd/netplugin.sock: connect: no such file or directory", "stderr_lines": ["Error response from daemon: dial unix /run/docker/plugins/9c15133fdbe9ee55f4054b0f3af7fbd9be9ae8efc0bfd72d70b791f3ecfb27fd/netplugin.sock: connect: no such file or directory"], "stdout": "1.1.7: Pulling from contiv/v2plugin\n1ba3fc0d8c93: Verifying Checksum\n1ba3fc0d8c93: Download complete\nDigest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30\nStatus: Downloaded newer image for contiv/v2plugin:1.1.7", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin", "1ba3fc0d8c93: Verifying Checksum", "1ba3fc0d8c93: Download complete", "Digest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30", "Status: Downloaded newer image for contiv/v2plugin:1.1.7"]} to retry, use: --limit @/ansible/install_plays.retry
PLAY RECAP ********************************************************************* node1 : ok=17 changed=9 unreachable=0 failed=1 node2 : ok=17 changed=9 unreachable=0 failed=1 node3 : ok=17 changed=9 unreachable=0 failed=1 node4 : ok=9 changed=4 unreachable=0 failed=0 node5 : ok=9 changed=4 unreachable=0 failed=0 node6 : ok=9 changed=4 unreachable=0 failed=0 node7 : ok=9 changed=4 unreachable=0 failed=0 node8 : ok=9 changed=4 unreachable=0 failed=0 node9 : ok=9 changed=4 unreachable=0 failed=0
##Worker Node install failures -
TASK [contiv_network : install v2plugin on worker nodes] ***********************
fatal: [node6]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.140 control_url=10.65.121.140:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:04:51.934836", "end": "2018-01-25 11:38:37.231374", "failed": true, "rc": 1, "start": "2018-01-25 11:33:45.296538", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}
fatal: [node7]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.141 control_url=10.65.121.141:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:04:52.343379", "end": "2018-01-25 11:38:44.770569", "failed": true, "rc": 1, "start": "2018-01-25 11:33:52.427190", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}
fatal: [node4]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.142 control_url=10.65.121.142:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:04:52.475222", "end": "2018-01-25 11:38:46.382501", "failed": true, "rc": 1, "start": "2018-01-25 11:33:53.907279", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}
fatal: [node8]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.130 control_url=10.65.121.130:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:04:54.685860", "end": "2018-01-25 11:38:48.099427", "failed": true, "rc": 1, "start": "2018-01-25 11:33:53.413567", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}
fatal: [node5]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.143 control_url=10.65.121.143:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:04:55.817107", "end": "2018-01-25 11:38:49.210135", "failed": true, "rc": 1, "start": "2018-01-25 11:33:53.393028", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}
fatal: [node12]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.129 control_url=10.65.121.129:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:01:54.202116", "end": "2018-01-25 11:40:35.330632", "failed": true, "rc": 1, "start": "2018-01-25 11:38:41.128516", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}
fatal: [node11]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.128 control_url=10.65.121.128:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:01:56.424311", "end": "2018-01-25 11:40:43.263658", "failed": true, "rc": 1, "start": "2018-01-25 11:38:46.839347", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}
fatal: [node9]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.124 control_url=10.65.121.124:9999 vxlan_port=8472 iflist=eno6 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:02:54.790835", "end": "2018-01-25 11:41:46.656811", "failed": true, "rc": 1, "start": "2018-01-25 11:38:51.865976", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}
changed: [node10]
##Worker node failure key error message -
failed": true, "rc": 1, "start": "2018-01-25 11:33:45.296538", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}
##Master node failure key error message -
"stderr": "Error response from daemon: dial unix /run/docker/plugins/330e5e6cb7025e7c40805912541ff706fad4d35eb4bb34b877ea5004dfcf8511/netplugin.sock: connect: connection refused", "stderr_lines": ["Error response from daemon: dial unix /run/docker/plugins/330e5e6cb7025e7c40805912541ff706fad4d35eb4bb34b877ea5004dfcf8511/netplugin.sock: connect: connection refused"], "stdout": "1.1.7: Pulling from contiv/v2plugin\n1ba3fc0d8c93: Verifying Checksum\n1ba3fc0d8c93: Download complete\nDigest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30\nStatus: Downloaded newer image for contiv/v2plugin:1.1.7", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin", "1ba3fc0d8c93: Verifying Checksum", "1ba3fc0d8c93: Download complete", "Digest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30", "Status: Downloaded newer image for contiv/v2plugin:1.1.7"]}
Steps to Reproduce (for bugs)
- Create DEE swarm mode cluster setup with 3 master and couple of worker nodes
- Download latest Contiv Installer bits version 1.1.7 from Contiv Github Install release location for full install
- Modify cfg.yml and env.json to suit your cluster environment
- Issue command for installation -
./install/ansible/install_swarm.sh -f install/ansible/cfg.yml -u root -e ~/.ssh/id_rsa -p
Your Environment
- netctl version - 1.1.7/v2Plugin
- Orchestrator version (e.g. kubernetes, mesos, swarm): Swarm/UCP2.2.4/Docker Engine17.06.2-ee-6
- Operating System and version: RHEL7.3
##Installation logs are attached herewith - contiv_install_01-22-2018.09-34-14.UTC.log contiv_install_01-25-2018.05-56-47.UTC.log
Looking at the attached logs contiv_install_01-22-2018.09-34-14.UTC.log and contiv_install_01-25-2018.05-56-47.UTC.log, I see failures when the contiv docker v2plugin was installed.
The following command failed on both master and worker nodes in the logs:
/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=<IP> control_url=<IP>:9999 vxlan_port=8472 iflist=<interface> plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=[master|worker] fwd_mode=bridge
Can you send the logs in /var/log/contiv/
and /var/log/contiv*.log
from the master and worker nodes that saw this issue?
Worker node install failures - worker nodes don't have /var/log/contiv/ folder or any other contiv logs. So attaching logs from corresponding master nodes in the same cluster - contiv-master-logs-workerfailure.tar.gz
Master node intall failures - (as observed on 2nd cluter) - contiv-master-node-logs.tar.gz
in this case master nodes doesn't have netctl
installed, though netplugin booted up cleanly -
[root@DEE-Ctrl-1 contiv]# cat plugin_bootup.log 2018-01-22T09:41:03Z|00001|vlog|INFO|opened log file /var/log/contiv/ovs-db.log 2018-01-22T09:41:03Z|00001|vlog|INFO|opened log file /var/log/contiv/ovs-vswitchd.log Waiting for netmaster to be ready for connections Netmaster ready for connections, setting forward mode to bridge Forward mode is set n-if=eno6 -cluster-store=etcd://localhost:2379 -ctrl-ip=10.65.122.61 /netmaster -plugin-name=contiv/v2plugin:1.1.7 -cluster-mode=swarm-mode -cluster-store=etcd://localhost:2379 -control-url=10.65.122.61:9999
Also docker plugin ls
doesn't list Contiv -
[root@DEE-Ctrl-1 contiv]# docker plugin ls ID NAME DESCRIPTION ENABLED 631d379403b4 docker/telemetry:1.0.0.linux-x86_64-stable Docker Inc. metrics exporter false
@rkharya: Have you reproduced this on CentOS or on another distribution?
@unclejack: Reproducible on RHEL7.3 environments - BareMetal and BareMetal with VMs