agnosticd For OCP3 installs, ansible 2.6.x fails to reboot nodes after kernel update

For OCP3 installs, ansible 2.6.x fails to reboot nodes after kernel update

Open sseago opened this issue 5 years ago • 0 comments

Describe the bug When launching an OCP3 cluster using ansible 2.6.18, nodes do not reboot at TASK [common : Reboot all VMs after updating to the latest packages (Ansible 2.6 version)], although the Wait for VMs to come back up task reports success

To Reproduce Using ansible 2.6.18,

Steps to reproduce the behavior: ansible-playbook $AGNOSTICD_CHECKOUT_LOCATION/ansible/main.yml -e my_vars.yml -e ocp3_vars.yml -e ../secret.yml

my_vars.yml:

sets email/guid/output_dir/subdomain_base_suffix/HostedZoneId, key_name, cloud_tags.owner to local values

cloud_provider: ec2
install_glusterfs: true
aws_region: us-east-2
node_instance_count: 3

ocp3_vars.yml:

env_type: "ocp-workshop"
repo_version: 3.11
osrelease: 3.11.104
software_to_deploy: "openshift" 
course_name: "ocp-workshop" 
platform: "aws" 
install_k8s_modules: true

bastion_instance_type: "t2.large"
master_instance_type: "m5.large"
infranode_instance_type: "m5.large"
node_instance_type: "m5.large"
support_instance_type: "m4.large"

support_instance_public_dns: true

nfs_server_address: "support1.{{ guid }}{{ subdomain_base_suffix }}"

Expected behavior Expected ocp3 cluster to be up and running, including gluster pods. All nodes should be running the latest installed kernel version after a successful reboot.

Screenshots / logs Log snippet:

```TASK [common : Update all packages] ********************************************
Wednesday 24 July 2019  18:41:49 +0000 (0:00:01.514)       0:00:42.283 ******** 
ok: [node3.aclewettocp3.internal]
ok: [node2.aclewettocp3.internal]
ok: [node1.aclewettocp3.internal]
ok: [infranode1.aclewettocp3.internal]
ok: [master1.aclewettocp3.internal]
ok: [support2.aclewettocp3.internal]
ok: [support1.aclewettocp3.internal]
ok: [support3.aclewettocp3.internal]
changed: [bastion.aclewettocp3.internal]
TASK [common : Determine if reboot is needed] **********************************
Wednesday 24 July 2019  18:42:46 +0000 (0:00:57.060)       0:01:39.344 ******** 
ok: [support2.aclewettocp3.internal]
ok: [node3.aclewettocp3.internal]
ok: [node1.aclewettocp3.internal]
ok: [master1.aclewettocp3.internal]
ok: [infranode1.aclewettocp3.internal]
ok: [support1.aclewettocp3.internal]
ok: [bastion.aclewettocp3.internal]
ok: [support3.aclewettocp3.internal]
ok: [node2.aclewettocp3.internal]
TASK [common : Reboot all VMs after updating to the latest packages (Ansible 2.6 version)] ***
Wednesday 24 July 2019  18:42:47 +0000 (0:00:01.144)       0:01:40.488 ******** 
skipping: [support2.aclewettocp3.internal]
skipping: [support3.aclewettocp3.internal]
skipping: [support1.aclewettocp3.internal]
included: /home/ec2-user/agnosticd/ansible/roles/common/tasks/reboot_26.yml for node3.aclewettocp3.internal, node1.aclewettocp3.internal, node2.aclewettocp3.internal, bastion.aclewettocp3.internal, infranode1.aclewettocp3.internal, master1.aclewettocp3.internal
TASK [common : Reboot all VMs after updating to the latest release] ************
Wednesday 24 July 2019  18:42:47 +0000 (0:00:00.321)       0:01:40.809 ******** 
changed: [bastion.aclewettocp3.internal]
changed: [node3.aclewettocp3.internal]
changed: [node2.aclewettocp3.internal]
changed: [node1.aclewettocp3.internal]
changed: [master1.aclewettocp3.internal]
changed: [infranode1.aclewettocp3.internal]
TASK [common : Wait for VMs to come back up] ***********************************
Wednesday 24 July 2019  18:42:49 +0000 (0:00:01.850)       0:01:42.660 ******** 
ok: [node3.aclewettocp3.internal]
ok: [node1.aclewettocp3.internal]
ok: [node2.aclewettocp3.internal]
ok: [bastion.aclewettocp3.internal]
ok: [infranode1.aclewettocp3.internal]
ok: [master1.aclewettocp3.internal]
TASK [common : Reboot all VMs after updating to the latest packages (Ansible 2.7+ version)] ***
Wednesday 24 July 2019  18:43:20 +0000 (0:00:30.922)       0:02:13.582 ******** 
skipping: [support2.aclewettocp3.internal]
skipping: [support1.aclewettocp3.internal]
skipping: [node3.aclewettocp3.internal]
skipping: [support3.aclewettocp3.internal]
skipping: [bastion.aclewettocp3.internal]
skipping: [node1.aclewettocp3.internal]
skipping: [node2.aclewettocp3.internal]
skipping: [infranode1.aclewettocp3.internal]
skipping: [master1.aclewettocp3.internal]
TASK [common : Update network facts after reboot] ******************************
Wednesday 24 July 2019  18:43:21 +0000 (0:00:00.287)       0:02:13.870 ******** 
ok: [node2.aclewettocp3.internal]
ok: [infranode1.aclewettocp3.internal]
ok: [bastion.aclewettocp3.internal]
ok: [node3.aclewettocp3.internal]
ok: [master1.aclewettocp3.internal]
ok: [node1.aclewettocp3.internal]
ok: [support1.aclewettocp3.internal]
ok: [support2.aclewettocp3.internal]
ok: [support3.aclewettocp3.internal]

Versions (please complete the following information):

RHEL7
AAD version:

a64dcb6b9a9a19d19793c73b10dfde5a952650eb Added Operator Group to Kube Fed Workload (no operator group in kube-federation
0abe2405192ac7a8da3c85b4e9202de0c70574a6 Updated terminal to 4.2.0 and using the proper terminal template to use OpenShi

Ansible: 2.6.18
cloud provider CLI:

aws --version
aws-cli/1.16.113 Python/2.7.5 Linux/3.10.0-957.el7.x86_64 botocore/1.12.184

Additional context After noticing the failed gluster pods, the failure indicated that the minimum kernel version requirement wasn't met. We discovered that the nodes were still running the old kernel version, indicating that they still needed a reboot. Rebooting the service support nodes resulted in the gluster pods starting successfully. We also verified that master1 and infranode1 didn't reboot:

# uname -r
3.10.0-862.el7.x86_64
2:57 PM
# rpm -qa | grep kernel
kernel-3.10.0-957.21.3.el7.x86_64
kernel-tools-3.10.0-957.21.3.el7.x86_64
kernel-tools-libs-3.10.0-957.21.3.el7.x86_64
kernel-3.10.0-862.el7.x86_64
```
If you look at the log snippet above, there seems to be two things going on:
1) "Reboot all VMs after updating to the latest packages (Ansible 2.6 version)" indicates that it's skipping the support nodes, even though they needed a reboot too.
2) "Reboot all VMs after updating to the latest release" and "Wait for VMs to come back up" indicates that the other nodes *did* get rebooted as requested, but the above output (taken from master1) shows that, in fact, either the reboot didn't happen or the kernel update happened post-reboot.

After all of this, we upgraded ansible to 2.8.2 and, without changing anything else, re-ran the install, and all nodes rebooted as needed, and gluster came up properly on the first try. In separate installs, we've also verified that ansible 2.8.2 and 2.7.6 seem to work correctly.

I'm not sure if the resolution is to document "we don't support ansible 2.6.x anymore" or a bug fix to the ansible roles/tasks related to determining when to reboot (and making the reboot happen), but since there are ansible 2.6-specific tasks being run, I'm guessing that we're trying to support 2.6 right now.

Jul 25 '19 14:07 sseago

agnosticd agnosticd copied to clipboard

For OCP3 installs, ansible 2.6.x fails to reboot nodes after kernel update

agnosticd
agnosticd copied to clipboard