installcentos icon indicating copy to clipboard operation
installcentos copied to clipboard

3.11 deployment issue

Open ryannix123 opened this issue 6 years ago • 23 comments

TASK [openshift_control_plane : Wait for all control plane pods to become ready] ********************************************************* FAILED - RETRYING: Wait for all control plane pods to become ready (60 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (59 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (58 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (57 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (56 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (55 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (54 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (53 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (52 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (51 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (50 retries left). ok: [10.0.1.31] => (item=etcd) FAILED - RETRYING: Wait for all control plane pods to become ready (60 retries left). ok: [10.0.1.31] => (item=api) FAILED - RETRYING: Wait for all control plane pods to become ready (60 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (59 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (58 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (57 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (56 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (55 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (54 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (53 retries left). FAILED - RETRYING: Wait for all control plane pods to become ready (52 retries left).

TASK [openshift_node_group : Wait for the sync daemonset to become ready and available] ************************************************** FAILED - RETRYING: Wait for the sync daemonset to become ready and available (60 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (59 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (58 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (57 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (56 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (55 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (54 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (53 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (52 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (51 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (50 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (49 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (48 retries left). FAILED - RETRYING: Wait for the sync daemonset to become ready and available (47 retries left).

ryannix123 avatar Oct 21 '18 01:10 ryannix123

Any chance you have Ansible 2.7?

marekjelen avatar Oct 21 '18 05:10 marekjelen

@gshipley, The dreaded error has appeared to me...the "wait for control plane pods to appear". When i run "journalctl -flu docker.service" on another ssh session i get:

Oct 21 08:39:44 optung.vm.local oci-umount[59912]: umounthook : prestart container_id:3501626da860 rootfs:/var/lib/docker/overlay2/d1c0efea2c3ec01638c000b736c49744ded80645d6c63c2cc7e77e011fc8fa30/merged Oct 21 08:39:45 optung.vm.local dockerd-current[43275]: time="2018-10-21T08:39:45.074482088-04:00" level=error msg="containerd: deleting container" error="exit status 1: "container 3501626da8607e40433476414cc19237102900d1b5e50f2236c0e305eb75a623 does not exist\none or more of the container deletions failed\n"" Oct 21 08:39:45 optung.vm.local dockerd-current[43275]: time="2018-10-21T08:39:45.082623052-04:00" level=warning msg="3501626da8607e40433476414cc19237102900d1b5e50f2236c0e305eb75a623 cleanup: failed to unmount secrets: invalid argument"

It keeps repeating the block above, the only difference is that level=warning msg="xxx" cleanup changes the id (where "xxx" is the ID) also when it gets to the last retry it shows the following message before starting all 60 retries:

failed: [10.84.51.10] (item=etcd) => {"attempts": 60, "changed": false, "item": "etcd", "msg": {"cmd": "/usr/bin/oc get pod master-etcd-optung.vm.local -o json -n kube-system", "results": [{}], "returncode": 1, "stderr": "The connection to the server optung.vm.local:8443 was refused - did you specify the right host or port?\n", "stdout": ""}}

The vm has been created with: 8 cores(core i7) 16GB RAM 300GB hard drive(SSD hard drive) ansible version is that one from the script. I touched nothing on the scripts, are you able to help?

fclaudiopalmeira avatar Oct 21 '18 12:10 fclaudiopalmeira

Can you check the logs whether the system complains about not being able to create certificates?

marekjelen avatar Oct 21 '18 13:10 marekjelen

Looks like it's the correct version, 2.6.5.

Installing : ansible-2.6.5-1.el7.ans.noarch 6/6

ryannix123 avatar Oct 21 '18 14:10 ryannix123

Hey guys, i found out my problem...for some reason during the installation ansible was being updated to version 2.7, which doesn´t make any sense because of these 2 lines on the script: curl -o ansible.rpm https://releases.ansible.com/ansible/rpm/release/epel-7-x86_64/ansible-2.6.5-1.el7.ans.noarch.rpm yum -y --enablerepo=epel install ansible.rpm At first i tought that i had installed ansible on the system before running the script, so i went drastic and installed a Centos 7.5 minimal from scratch...it happened again. what i did to solve it was to add the line yum remove ansible before those 2 lines installing ansible and it is now working as intended. Weird stuff though. Do any of you by any means know if opencontrail/tungsten Fabric support is officially added on Origin/OKD??

fclaudiopalmeira avatar Oct 21 '18 14:10 fclaudiopalmeira

Post-install, mine is still 2.6.5.

ansible --version ansible 2.6.5 config file = /etc/ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, Jul 13 2018, 13:06:57) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]

ryannix123 avatar Oct 21 '18 14:10 ryannix123

I'd happily send the logs, but it seems like the logging location changes with each version of OpenShift, so I'm not sure where to look and Google isn't helping.

ryannix123 avatar Oct 21 '18 14:10 ryannix123

@ryannix123 Why do lumberjacks get frustrated with OpenShift?

Answer: Because they can never find the logs.

Okay, okay - a Dad joke for sure. We are working on the logging situation and much improvement will happen in the 4.0 release.

gshipley avatar Oct 21 '18 17:10 gshipley

@fclaudiopalmeira so far the only reason I have encountered for control plane failing with these messages are incorrect certificates caused by 2.7 Ansible

marekjelen avatar Oct 21 '18 18:10 marekjelen

@marekjelen My certificates were OK, the ansible version however, was not, I am inclined to believe that whenever you have ansible 2.7 installed weird stuff will happen! But, luckily i got past that error, and now i´m dealing with another one, which is related to git, when itry to create an APP i´m gettiing: error: fatal: unable to access 'https://github.com/gshipley/simplephp/': The requested URL returned error: 503 That started happening after I setup the GIT_SSL_NO_VERIFY = true env var (if i don´t, it gives me "the Peer's certificate issuer has been marked as not trusted by the user" ) But, so far i had no luck in finding out a solution!

fclaudiopalmeira avatar Oct 21 '18 18:10 fclaudiopalmeira

well...no luck at all with this certificate stuff...anyone could help?

fclaudiopalmeira avatar Oct 22 '18 01:10 fclaudiopalmeira

@ryannix123 rerunning the setup script and all the control pods come up just fine. Can you go to the docker level (docker ps , docker logs) and check what containers are failing? and extract some logs?

marekjelen avatar Oct 22 '18 08:10 marekjelen

@fclaudiopalmeira can you provide more info how are you trying to deploy the app?

Have tried to clone the repo on the machine

screen shot 2018-10-22 at 10 28 33

as well as deploy the app on OpenShift

screen shot 2018-10-22 at 10 27 06

and both seem to work ...

marekjelen avatar Oct 22 '18 08:10 marekjelen

Hey @marekjelen I was trying to deploy it by following exactly the youtube video(from openshift dahsboard)

fclaudiopalmeira avatar Oct 22 '18 22:10 fclaudiopalmeira

hmm, that is the 2nd picture @fclaudiopalmeira and it worked fine on a cluster I have just provisioned.

marekjelen avatar Oct 24 '18 15:10 marekjelen

you can alter the ansible version in the installation script from 2.6.x to 2.7.1.1 as a temporary workaround.

javabeanz avatar Nov 08 '18 20:11 javabeanz

Please attach the inventory and output with ansible-playbook -vvv.

Sync daemonset might fail if some nodes haven't applied their configuration, so oc describe nodes output would be handy too

vrutkovs avatar Nov 08 '18 20:11 vrutkovs

I have fixed this is by doing following steps.

  1. yum remove atomic-openshift* (On all node)
  2. yum install atomic-openshift* (On all node)
  3. mv /etc/origin /etc/origin.old
  4. mv /etc/kubernetes /etc/kubernetes.old
  5. mv ~/.kube/config /tmp/kube_config_backup

ansible-playbook -i /tmp/test /usr/share/ansible/openshift-ansible/playbooks/deploy_cluster.yml

Please let me know if that works for you.

choudharirahul avatar Dec 09 '18 03:12 choudharirahul

if above step doesnt work then update vi /usr/share/ansible/openshift-ansible/roles/openshift_control_plane/tasks/main.yml ###REPLACE THIS WITH BELOW - "{{ 'etcd' if inventory_hostname in groups['oo_etcd_to_config'] else omit }}"

  • "{{ 'etcd' if (inventory_hostname in groups['oo_etcd_to_config'] and inventory_hostname in groups['oo_masters_to_config']) else '' }}"

choudharirahul avatar Dec 09 '18 03:12 choudharirahul

Still no luck, same issue

sivalanka avatar Dec 09 '18 08:12 sivalanka

can you paste me the exact error and have tied both way?

rahulchoudhari avatar Dec 10 '18 00:12 rahulchoudhari

Looks like these deployments are going to radically change in OpenShift 4: https://www.youtube.com/watch?v=-xJIvBpvEeE

ryannix123 avatar Dec 17 '18 16:12 ryannix123

well...no luck at all with this certificate stuff...anyone could help?

@fclaudiopalmeira - have you found a solution to the certificate issue?

dennislabajo avatar Apr 13 '19 06:04 dennislabajo