kubespray
                                
                                
                                
                                    kubespray copied to clipboard
                            
                            
                            
                        fallback_ips: Gather facts from reachable hosts only
What type of PR is this?
/kind bug
What this PR does / why we need it:
When working with a large fleet of nodes, there is inevitably going to be some unreachable nodes. The linked issue describes that when kubespray-defaults runs and there exists an unreachable node, then it causes the play to unexpectedly exit early.
This is a fix to fallback_ips.yml to let it finish the rest of the role when unreachable node(s) exist.
Which issue(s) this PR fixes:
Fixes #10993
Special notes for your reviewer:
Local setup (inventory and test playbook):
details
Inventory[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr
[kube_control_plane]
k8s1.local
[etcd]
k8s1.local
[kube_node]
k8s3.local  # unreachable
k8s2.local
[calico_rr]
Playbook
---
- name: Prepare nodes for upgrade
  hosts: k8s_cluster:etcd:calico_rr
  gather_facts: False
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray-defaults }
  post_tasks:
    - name: Task that runs after
      debug:
        var: fallback_ips
Showing the original output before applying my PR:
details
PLAY [Prepare nodes for upgrade] ********************************************************************************************************************************************************************************
TASK [kubespray-defaults : Gather ansible_default_ipv4 from all hosts] ******************************************************************************************************************************************
ok: [k8s1.local] => (item=k8s1.local)
[WARNING]: Unhandled error in Python interpreter discovery for host k8s1.local: Failed to connect to the host via ssh: ssh: connect to host k8s3.local port 22: Connection timed out
failed: [k8s1.local -> k8s3.local] (item=k8s3.local) => {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}
ok: [k8s1.local -> k8s2.local] => (item=k8s2.local)
fatal: [k8s1.local -> {{ item }}]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "results": [{"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.29", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:41:88:12", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s1.local"}, {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}, {"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.30", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:be:42:a6", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s2.local"}]}
...ignoring
NO MORE HOSTS LEFT **********************************************************************************************************************************************************************************************
PLAY RECAP ******************************************************************************************************************************************************************************************************
k8s1.local                 : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=1
After my PR:
details
PLAY [Prepare nodes for upgrade] ********************************************************************************************************************************************************************************
TASK [kubespray-defaults : Determine reachable hosts] ***********************************************************************************************************************************************************
ok: [k8s1.local] => (item=k8s1.local)
[WARNING]: Unhandled error in Python interpreter discovery for host k8s1.local: Failed to connect to the host via ssh: ssh: connect to host k8s3.local port 22: Connection timed out
failed: [k8s1.local -> k8s3.local] (item=k8s3.local) => {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}
ok: [k8s1.local -> k8s2.local] => (item=k8s2.local)
fatal: [k8s1.local -> {{ item }}]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "results": [{"ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"data": "pong"}}, "item": "k8s1.local", "ping": "pong"}, {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}, {"ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"data": "pong"}}, "item": "k8s2.local", "ping": "pong"}]}
...ignoring
TASK [kubespray-defaults : Gather ansible_default_ipv4 from reachable hosts] ************************************************************************************************************************************
ok: [k8s1.local] => (item=k8s1.local)
ok: [k8s1.local -> k8s2.local] => (item=k8s2.local)
TASK [kubespray-defaults : Create fallback_ips_base] ************************************************************************************************************************************************************
ok: [k8s1.local -> localhost]
TASK [kubespray-defaults : Set fallback_ips] ********************************************************************************************************************************************************************
ok: [k8s1.local]
ok: [k8s2.local]
ok: [k8s3.local]
TASK [Task that runs after] *************************************************************************************************************************************************************************************
ok: [k8s1.local] => {
    "fallback_ips": {
        "k8s1.local": "10.88.111.29",
        "k8s2.local": "10.88.111.30",
        "k8s3.local": "127.0.0.1"
    }
}
ok: [k8s3.local] => {
    "fallback_ips": {
        "k8s1.local": "10.88.111.29",
        "k8s2.local": "10.88.111.30",
        "k8s3.local": "127.0.0.1"
    }
}
ok: [k8s2.local] => {
    "fallback_ips": {
        "k8s1.local": "10.88.111.29",
        "k8s2.local": "10.88.111.30",
        "k8s3.local": "127.0.0.1"
    }
}
PLAY RECAP ******************************************************************************************************************************************************************************************************
k8s1.local                 : ok=5    changed=0    unreachable=0    failed=0    skipped=3    rescued=0    ignored=1
k8s2.local                 : ok=2    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0
k8s3.local                 : ok=2    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0
Also to note, I originally tried making this change to add ignore_errors: true:
details
# roles/kubespray-defaults/tasks/fallback_ips.yml
 ---
 # Set 127.0.0.1 as fallback IP if we do not have host facts for host
 # ansible_default_ipv4 isn't what you think.
 # Thanks https://medium.com/opsops/ansible-default-ipv4-is-not-what-you-think-edb8ab154b10
 - name: Gather ansible_default_ipv4 from all hosts
   setup:
     gather_subset: '!all,network'
     filter: "ansible_default_ipv4"
   delegate_to: "{{ item }}"
   delegate_facts: yes
   when: hostvars[item].ansible_default_ipv4 is not defined
   loop: "{{ (groups['k8s_cluster'] | default([]) + groups['etcd'] | default([]) + groups['calico_rr'] | default([])) | unique }}"
   run_once: yes
   ignore_unreachable: true
+  ignore_errors: true
   tags: always
 - name: Create fallback_ips_base
   set_fact:
     fallback_ips_base: |
       ---
       {% for item in (groups['k8s_cluster'] | default([]) + groups['etcd'] | default([]) + groups['calico_rr'] | default([])) | unique %}
-      {% set found = hostvars[item].get('ansible_default_ipv4') %}
+      {% set found = hostvars[item].get('ansible_default_ipv4', {}) %}
       {{ item }}: "{{ found.get('address', '127.0.0.1') }}"
       {% endfor %}
   delegate_to: localhost
   connection: local
   delegate_facts: yes
   become: no
   run_once: yes
 - name: Set fallback_ips
   set_fact:
     fallback_ips: "{{ hostvars.localhost.fallback_ips_base | from_yaml }}"
PLAY [Prepare nodes for upgrade] ********************************************************************************************************************************************************************************
TASK [kubespray-defaults : Gather ansible_default_ipv4 from all hosts] ******************************************************************************************************************************************
ok: [k8s1.local] => (item=k8s1.local)
[WARNING]: Unhandled error in Python interpreter discovery for host k8s1.local: Failed to connect to the host via ssh: ssh: connect to host k8s3.local port 22: Connection timed out
failed: [k8s1.local -> k8s3.local] (item=k8s3.local) => {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}
ok: [k8s1.local -> k8s2.local] => (item=k8s2.local)
fatal: [k8s1.local -> {{ item }}]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "results": [{"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.29", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:41:88:12", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s1.local"}, {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}, {"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.30", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:be:42:a6", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s2.local"}]}
...ignoring
TASK [kubespray-defaults : Create fallback_ips_base] ************************************************************************************************************************************************************
ok: [k8s1.local -> localhost]
TASK [kubespray-defaults : Set fallback_ips] ********************************************************************************************************************************************************************
ok: [k8s1.local]
ok: [k8s3.local]
ok: [k8s2.local]
TASK [Task that runs after] *************************************************************************************************************************************************************************************
ok: [k8s1.local] => {
    "fallback_ips": {
        "k8s1.local": "127.0.0.1",
        "k8s2.local": "127.0.0.1",
        "k8s3.local": "127.0.0.1"
    }
}
ok: [k8s3.local] => {
    "fallback_ips": {
        "k8s1.local": "127.0.0.1",
        "k8s2.local": "127.0.0.1",
        "k8s3.local": "127.0.0.1"
    }
}
ok: [k8s2.local] => {
    "fallback_ips": {
        "k8s1.local": "127.0.0.1",
        "k8s2.local": "127.0.0.1",
        "k8s3.local": "127.0.0.1"
    }
}
PLAY RECAP ******************************************************************************************************************************************************************************************************
k8s1.local                 : ok=4    changed=0    unreachable=0    failed=0    skipped=3    rescued=0    ignored=1
k8s2.local                 : ok=2    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0
k8s3.local                 : ok=2    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0
But as you can see, while it got past the early exit, it looks like it doesn't save the discovered facts. And so the output is all wrong. That's why I decided to first filter out the unreachable hosts in a separate task before running setup.
Does this PR introduce a user-facing change?:
None
Hi @Rickkwa. Thanks for your PR.
I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/ok-to-test
Do you need any more info from me to move this forward? I'm hoping this can get into the 2.25 release, whenever that is.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Rickkwa Once this PR has been reviewed and has the lgtm label, please assign mzaian for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages.
The list of commits with invalid commit messages:
- 77cc749 Bump tox from 4.11.3 to 4.15.0 (#11133)
 - 7d0e887 Bump jinja2 from 3.1.2 to 3.1.3 (#11119)
 - ad95947 Bump molecule-plugins[vagrant] from 23.5.0 to 23.5.3 (#11120)
 - 688a0c7 Bump tzdata from 2023.3 to 2024.1 (#11121)
 - 74405ce Bump netaddr from 0.9.0 to 1.2.1 (#11148)
 - cc245e6 Bump ansible from 9.3.0 to 9.5.1 (#11157)
 - 1d0a09f Bump ruamel-yaml from 0.18.5 to 0.18.6 (#11147)
 - 89ff14f Bump jinja2 from 3.1.3 to 3.1.4 (#11166)
 - c59e9ca Bump ansible-lint from 6.22.2 to 24.2.3 (#11151)
 - e3145c0 Bump pytest-testinfra from 9.0.0 to 10.1.0 (#11149)
 - b5e8b78 Bump molecule from 6.0.2 to 24.2.1 (#11150)
 - 997d1b7 Bump markupsafe from 2.1.3 to 2.1.5 (#11176)
 - 4e02645 Bump yamllint from 1.32.0 to 1.35.1 (#11177)
 - bc1085c Bump cryptography from 41.0.4 to 42.0.7 (#11187)
 - d8d1850 Bump ara[server] from 1.7.0 to 1.7.1 (#11178)
 - 7f0b927 Bump pbr from 5.11.1 to 6.0.0 (#11188)
 
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
Gah, I'm bad at rebasing. Let me re-open a new one.
@Rickkwa: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command | 
|---|---|---|---|---|
| pull-kubespray-yamllint | 499b395f801c4f62eeea4ef1437015387e19f7a3 | link | true | /test pull-kubespray-yamllint | 
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.