installcentos 3.11 deployment issue

TASK [openshift_control_plane : Wait for all control plane pods to become ready] ********************************************************* FAILED - RETRYING: Wait for all control plane pods to become ready (60 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (59 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (58 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (57 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (56 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (55 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (54 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (53 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (52 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (51 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (50 retries left). ok: [10.0.1.31] => (item=etcd) FAILED - RETRYING: Wait for all control plane pods to become ready (60 retries left). ok: [10.0.1.31] => (item=api) FAILED - RETRYING: Wait for all control plane pods to become ready (60 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (59 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (58 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (57 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (56 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (55 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (54 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (53 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (52 retries left).

TASK [openshift_node_group : Wait for the sync daemonset to become ready and available] ************************************************** FAILED - RETRYING: Wait for the sync daemonset to become ready and available (60 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (59 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (58 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (57 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (56 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (55 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (54 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (53 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (52 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (51 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (50 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (49 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (48 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (47 retries left).

Oct 21 '18 01:10 ryannix123

Any chance you have Ansible 2.7?

Oct 21 '18 05:10 marekjelen

@gshipley, The dreaded error has appeared to me...the "wait for control plane pods to appear". When i run "journalctl -flu docker.service" on another ssh session i get:

Oct 21 08:39:44 optung.vm.local oci-umount[59912]: umounthook : prestart container_id:3501626da860 rootfs:/var/lib/docker/overlay2/d1c0efea2c3ec01638c000b736c49744ded80645d6c63c2cc7e77e011fc8fa30/merged Oct 21 08:39:45 optung.vm.local dockerd-current[43275]: time="2018-10-21T08:39:45.074482088-04:00" level=error msg="containerd: deleting container" error="exit status 1: "container 3501626da8607e40433476414cc19237102900d1b5e50f2236c0e305eb75a623 does not exist\none or more of the container deletions failed\n"" Oct 21 08:39:45 optung.vm.local dockerd-current[43275]: time="2018-10-21T08:39:45.082623052-04:00" level=warning msg="3501626da8607e40433476414cc19237102900d1b5e50f2236c0e305eb75a623 cleanup: failed to unmount secrets: invalid argument"

It keeps repeating the block above, the only difference is that level=warning msg="xxx" cleanup changes the id (where "xxx" is the ID) also when it gets to the last retry it shows the following message before starting all 60 retries:

failed: [10.84.51.10] (item=etcd) => {"attempts": 60, "changed": false, "item": "etcd", "msg": {"cmd": "/usr/bin/oc get pod master-etcd-optung.vm.local -o json -n kube-system", "results": [{}], "returncode": 1, "stderr": "The connection to the server optung.vm.local:8443 was refused - did you specify the right host or port?\n", "stdout": ""}}

The vm has been created with: 8 cores(core i7) 16GB RAM 300GB hard drive(SSD hard drive) ansible version is that one from the script. I touched nothing on the scripts, are you able to help?

Oct 21 '18 12:10 fclaudiopalmeira

Can you check the logs whether the system complains about not being able to create certificates?

Oct 21 '18 13:10 marekjelen

Looks like it's the correct version, 2.6.5.

Installing : ansible-2.6.5-1.el7.ans.noarch 6/6

Oct 21 '18 14:10 ryannix123

Hey guys, i found out my problem...for some reason during the installation ansible was being updated to version 2.7, which doesn´t make any sense because of these 2 lines on the script: curl -o ansible.rpm https://releases.ansible.com/ansible/rpm/release/epel-7-x86_64/ansible-2.6.5-1.el7.ans.noarch.rpm yum -y --enablerepo=epel install ansible.rpm At first i tought that i had installed ansible on the system before running the script, so i went drastic and installed a Centos 7.5 minimal from scratch...it happened again. what i did to solve it was to add the line yum remove ansible before those 2 lines installing ansible and it is now working as intended. Weird stuff though. Do any of you by any means know if opencontrail/tungsten Fabric support is officially added on Origin/OKD??

Oct 21 '18 14:10 fclaudiopalmeira

Post-install, mine is still 2.6.5.

ansible --version ansible 2.6.5 config file = /etc/ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, Jul 13 2018, 13:06:57) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

Oct 21 '18 14:10 ryannix123

I'd happily send the logs, but it seems like the logging location changes with each version of OpenShift, so I'm not sure where to look and Google isn't helping.

Oct 21 '18 14:10 ryannix123

@ryannix123 Why do lumberjacks get frustrated with OpenShift?

Answer: Because they can never find the logs.

Okay, okay - a Dad joke for sure. We are working on the logging situation and much improvement will happen in the 4.0 release.

Oct 21 '18 17:10 gshipley

@fclaudiopalmeira so far the only reason I have encountered for control plane failing with these messages are incorrect certificates caused by 2.7 Ansible

Oct 21 '18 18:10 marekjelen

@marekjelen My certificates were OK, the ansible version however, was not, I am inclined to believe that whenever you have ansible 2.7 installed weird stuff will happen! But, luckily i got past that error, and now i´m dealing with another one, which is related to git, when itry to create an APP i´m gettiing: error: fatal: unable to access 'https://github.com/gshipley/simplephp/': The requested URL returned error: 503 That started happening after I setup the GIT_SSL_NO_VERIFY = true env var (if i don´t, it gives me "the Peer's certificate issuer has been marked as not trusted by the user" ) But, so far i had no luck in finding out a solution!

Oct 21 '18 18:10 fclaudiopalmeira

well...no luck at all with this certificate stuff...anyone could help?

Oct 22 '18 01:10 fclaudiopalmeira

@ryannix123 rerunning the setup script and all the control pods come up just fine. Can you go to the docker level (docker ps , docker logs) and check what containers are failing? and extract some logs?

Oct 22 '18 08:10 marekjelen

@fclaudiopalmeira can you provide more info how are you trying to deploy the app?

Have tried to clone the repo on the machine

screen shot 2018-10-22 at 10 28 33

as well as deploy the app on OpenShift

screen shot 2018-10-22 at 10 27 06

and both seem to work ...

Oct 22 '18 08:10 marekjelen

Hey @marekjelen I was trying to deploy it by following exactly the youtube video(from openshift dahsboard)

Oct 22 '18 22:10 fclaudiopalmeira

hmm, that is the 2nd picture @fclaudiopalmeira and it worked fine on a cluster I have just provisioned.

Oct 24 '18 15:10 marekjelen

you can alter the ansible version in the installation script from 2.6.x to 2.7.1.1 as a temporary workaround.

Nov 08 '18 20:11 javabeanz

Please attach the inventory and output with ansible-playbook -vvv.

Sync daemonset might fail if some nodes haven't applied their configuration, so oc describe nodes output would be handy too

Nov 08 '18 20:11 vrutkovs

I have fixed this is by doing following steps.

yum remove atomic-openshift* (On all node)
yum install atomic-openshift* (On all node)
mv /etc/origin /etc/origin.old
mv /etc/kubernetes /etc/kubernetes.old
mv ~/.kube/config /tmp/kube_config_backup

ansible-playbook -i /tmp/test /usr/share/ansible/openshift-ansible/playbooks/deploy_cluster.yml

Please let me know if that works for you.

Dec 09 '18 03:12 choudharirahul

if above step doesnt work then update vi /usr/share/ansible/openshift-ansible/roles/openshift_control_plane/tasks/main.yml ###REPLACE THIS WITH BELOW - "{{ 'etcd' if inventory_hostname in groups['oo_etcd_to_config'] else omit }}"