agnosticd
agnosticd copied to clipboard
For OCP3 installs, ansible 2.6.x fails to reboot nodes after kernel update
Describe the bug
When launching an OCP3 cluster using ansible 2.6.18, nodes do not reboot at TASK [common : Reboot all VMs after updating to the latest packages (Ansible 2.6 version)]
, although the Wait for VMs to come back up
task reports success
To Reproduce Using ansible 2.6.18,
Steps to reproduce the behavior:
ansible-playbook $AGNOSTICD_CHECKOUT_LOCATION/ansible/main.yml -e my_vars.yml -e ocp3_vars.yml -e ../secret.yml
my_vars.yml:
- sets email/guid/output_dir/subdomain_base_suffix/HostedZoneId, key_name, cloud_tags.owner to local values
cloud_provider: ec2
install_glusterfs: true
aws_region: us-east-2
node_instance_count: 3
ocp3_vars.yml:
env_type: "ocp-workshop"
repo_version: 3.11
osrelease: 3.11.104
software_to_deploy: "openshift"
course_name: "ocp-workshop"
platform: "aws"
install_k8s_modules: true
bastion_instance_type: "t2.large"
master_instance_type: "m5.large"
infranode_instance_type: "m5.large"
node_instance_type: "m5.large"
support_instance_type: "m4.large"
support_instance_public_dns: true
nfs_server_address: "support1.{{ guid }}{{ subdomain_base_suffix }}"
Expected behavior Expected ocp3 cluster to be up and running, including gluster pods. All nodes should be running the latest installed kernel version after a successful reboot.
Screenshots / logs Log snippet:
```TASK [common : Update all packages] ********************************************
Wednesday 24 July 2019 18:41:49 +0000 (0:00:01.514) 0:00:42.283 ********
ok: [node3.aclewettocp3.internal]
ok: [node2.aclewettocp3.internal]
ok: [node1.aclewettocp3.internal]
ok: [infranode1.aclewettocp3.internal]
ok: [master1.aclewettocp3.internal]
ok: [support2.aclewettocp3.internal]
ok: [support1.aclewettocp3.internal]
ok: [support3.aclewettocp3.internal]
changed: [bastion.aclewettocp3.internal]
TASK [common : Determine if reboot is needed] **********************************
Wednesday 24 July 2019 18:42:46 +0000 (0:00:57.060) 0:01:39.344 ********
ok: [support2.aclewettocp3.internal]
ok: [node3.aclewettocp3.internal]
ok: [node1.aclewettocp3.internal]
ok: [master1.aclewettocp3.internal]
ok: [infranode1.aclewettocp3.internal]
ok: [support1.aclewettocp3.internal]
ok: [bastion.aclewettocp3.internal]
ok: [support3.aclewettocp3.internal]
ok: [node2.aclewettocp3.internal]
TASK [common : Reboot all VMs after updating to the latest packages (Ansible 2.6 version)] ***
Wednesday 24 July 2019 18:42:47 +0000 (0:00:01.144) 0:01:40.488 ********
skipping: [support2.aclewettocp3.internal]
skipping: [support3.aclewettocp3.internal]
skipping: [support1.aclewettocp3.internal]
included: /home/ec2-user/agnosticd/ansible/roles/common/tasks/reboot_26.yml for node3.aclewettocp3.internal, node1.aclewettocp3.internal, node2.aclewettocp3.internal, bastion.aclewettocp3.internal, infranode1.aclewettocp3.internal, master1.aclewettocp3.internal
TASK [common : Reboot all VMs after updating to the latest release] ************
Wednesday 24 July 2019 18:42:47 +0000 (0:00:00.321) 0:01:40.809 ********
changed: [bastion.aclewettocp3.internal]
changed: [node3.aclewettocp3.internal]
changed: [node2.aclewettocp3.internal]
changed: [node1.aclewettocp3.internal]
changed: [master1.aclewettocp3.internal]
changed: [infranode1.aclewettocp3.internal]
TASK [common : Wait for VMs to come back up] ***********************************
Wednesday 24 July 2019 18:42:49 +0000 (0:00:01.850) 0:01:42.660 ********
ok: [node3.aclewettocp3.internal]
ok: [node1.aclewettocp3.internal]
ok: [node2.aclewettocp3.internal]
ok: [bastion.aclewettocp3.internal]
ok: [infranode1.aclewettocp3.internal]
ok: [master1.aclewettocp3.internal]
TASK [common : Reboot all VMs after updating to the latest packages (Ansible 2.7+ version)] ***
Wednesday 24 July 2019 18:43:20 +0000 (0:00:30.922) 0:02:13.582 ********
skipping: [support2.aclewettocp3.internal]
skipping: [support1.aclewettocp3.internal]
skipping: [node3.aclewettocp3.internal]
skipping: [support3.aclewettocp3.internal]
skipping: [bastion.aclewettocp3.internal]
skipping: [node1.aclewettocp3.internal]
skipping: [node2.aclewettocp3.internal]
skipping: [infranode1.aclewettocp3.internal]
skipping: [master1.aclewettocp3.internal]
TASK [common : Update network facts after reboot] ******************************
Wednesday 24 July 2019 18:43:21 +0000 (0:00:00.287) 0:02:13.870 ********
ok: [node2.aclewettocp3.internal]
ok: [infranode1.aclewettocp3.internal]
ok: [bastion.aclewettocp3.internal]
ok: [node3.aclewettocp3.internal]
ok: [master1.aclewettocp3.internal]
ok: [node1.aclewettocp3.internal]
ok: [support1.aclewettocp3.internal]
ok: [support2.aclewettocp3.internal]
ok: [support3.aclewettocp3.internal]
Versions (please complete the following information):
- RHEL7
- AAD version:
a64dcb6b9a9a19d19793c73b10dfde5a952650eb Added Operator Group to Kube Fed Workload (no operator group in kube-federation
0abe2405192ac7a8da3c85b4e9202de0c70574a6 Updated terminal to 4.2.0 and using the proper terminal template to use OpenShi
- Ansible: 2.6.18
- cloud provider CLI:
aws --version
aws-cli/1.16.113 Python/2.7.5 Linux/3.10.0-957.el7.x86_64 botocore/1.12.184
Additional context After noticing the failed gluster pods, the failure indicated that the minimum kernel version requirement wasn't met. We discovered that the nodes were still running the old kernel version, indicating that they still needed a reboot. Rebooting the service support nodes resulted in the gluster pods starting successfully. We also verified that master1 and infranode1 didn't reboot:
# uname -r
3.10.0-862.el7.x86_64
2:57 PM
# rpm -qa | grep kernel
kernel-3.10.0-957.21.3.el7.x86_64
kernel-tools-3.10.0-957.21.3.el7.x86_64
kernel-tools-libs-3.10.0-957.21.3.el7.x86_64
kernel-3.10.0-862.el7.x86_64
```
If you look at the log snippet above, there seems to be two things going on:
1) "Reboot all VMs after updating to the latest packages (Ansible 2.6 version)" indicates that it's skipping the support nodes, even though they needed a reboot too.
2) "Reboot all VMs after updating to the latest release" and "Wait for VMs to come back up" indicates that the other nodes *did* get rebooted as requested, but the above output (taken from master1) shows that, in fact, either the reboot didn't happen or the kernel update happened post-reboot.
After all of this, we upgraded ansible to 2.8.2 and, without changing anything else, re-ran the install, and all nodes rebooted as needed, and gluster came up properly on the first try. In separate installs, we've also verified that ansible 2.8.2 and 2.7.6 seem to work correctly.
I'm not sure if the resolution is to document "we don't support ansible 2.6.x anymore" or a bug fix to the ansible roles/tasks related to determining when to reboot (and making the reboot happen), but since there are ansible 2.6-specific tasks being run, I'm guessing that we're trying to support 2.6 right now.