kubespray icon indicating copy to clipboard operation
kubespray copied to clipboard

fallback_ips: Gather facts from reachable hosts only

Open Rickkwa opened this issue 1 year ago • 4 comments

What type of PR is this?

/kind bug

What this PR does / why we need it:

When working with a large fleet of nodes, there is inevitably going to be some unreachable nodes. The linked issue describes that when kubespray-defaults runs and there exists an unreachable node, then it causes the play to unexpectedly exit early.

This is a fix to fallback_ips.yml to let it finish the rest of the role when unreachable node(s) exist.

Which issue(s) this PR fixes:

Fixes #10993

Special notes for your reviewer:

Local setup (inventory and test playbook):

details Inventory
[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr

[kube_control_plane]
k8s1.local

[etcd]
k8s1.local

[kube_node]
k8s3.local  # unreachable
k8s2.local

[calico_rr]

Playbook

---
- name: Prepare nodes for upgrade
  hosts: k8s_cluster:etcd:calico_rr
  gather_facts: False
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray-defaults }

  post_tasks:
    - name: Task that runs after
      debug:
        var: fallback_ips

Showing the original output before applying my PR:

details
PLAY [Prepare nodes for upgrade] ********************************************************************************************************************************************************************************

TASK [kubespray-defaults : Gather ansible_default_ipv4 from all hosts] ******************************************************************************************************************************************
ok: [k8s1.local] => (item=k8s1.local)
[WARNING]: Unhandled error in Python interpreter discovery for host k8s1.local: Failed to connect to the host via ssh: ssh: connect to host k8s3.local port 22: Connection timed out
failed: [k8s1.local -> k8s3.local] (item=k8s3.local) => {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}
ok: [k8s1.local -> k8s2.local] => (item=k8s2.local)
fatal: [k8s1.local -> {{ item }}]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "results": [{"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.29", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:41:88:12", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s1.local"}, {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}, {"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.30", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:be:42:a6", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s2.local"}]}
...ignoring

NO MORE HOSTS LEFT **********************************************************************************************************************************************************************************************

PLAY RECAP ******************************************************************************************************************************************************************************************************
k8s1.local                 : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=1

After my PR:

details
PLAY [Prepare nodes for upgrade] ********************************************************************************************************************************************************************************

TASK [kubespray-defaults : Determine reachable hosts] ***********************************************************************************************************************************************************
ok: [k8s1.local] => (item=k8s1.local)
[WARNING]: Unhandled error in Python interpreter discovery for host k8s1.local: Failed to connect to the host via ssh: ssh: connect to host k8s3.local port 22: Connection timed out
failed: [k8s1.local -> k8s3.local] (item=k8s3.local) => {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}
ok: [k8s1.local -> k8s2.local] => (item=k8s2.local)
fatal: [k8s1.local -> {{ item }}]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "results": [{"ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"data": "pong"}}, "item": "k8s1.local", "ping": "pong"}, {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}, {"ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"data": "pong"}}, "item": "k8s2.local", "ping": "pong"}]}
...ignoring

TASK [kubespray-defaults : Gather ansible_default_ipv4 from reachable hosts] ************************************************************************************************************************************
ok: [k8s1.local] => (item=k8s1.local)
ok: [k8s1.local -> k8s2.local] => (item=k8s2.local)

TASK [kubespray-defaults : Create fallback_ips_base] ************************************************************************************************************************************************************
ok: [k8s1.local -> localhost]

TASK [kubespray-defaults : Set fallback_ips] ********************************************************************************************************************************************************************
ok: [k8s1.local]
ok: [k8s2.local]
ok: [k8s3.local]

TASK [Task that runs after] *************************************************************************************************************************************************************************************
ok: [k8s1.local] => {
    "fallback_ips": {
        "k8s1.local": "10.88.111.29",
        "k8s2.local": "10.88.111.30",
        "k8s3.local": "127.0.0.1"
    }
}
ok: [k8s3.local] => {
    "fallback_ips": {
        "k8s1.local": "10.88.111.29",
        "k8s2.local": "10.88.111.30",
        "k8s3.local": "127.0.0.1"
    }
}
ok: [k8s2.local] => {
    "fallback_ips": {
        "k8s1.local": "10.88.111.29",
        "k8s2.local": "10.88.111.30",
        "k8s3.local": "127.0.0.1"
    }
}

PLAY RECAP ******************************************************************************************************************************************************************************************************
k8s1.local                 : ok=5    changed=0    unreachable=0    failed=0    skipped=3    rescued=0    ignored=1
k8s2.local                 : ok=2    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0
k8s3.local                 : ok=2    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0

Also to note, I originally tried making this change to add ignore_errors: true:

details
# roles/kubespray-defaults/tasks/fallback_ips.yml
 ---
 # Set 127.0.0.1 as fallback IP if we do not have host facts for host
 # ansible_default_ipv4 isn't what you think.
 # Thanks https://medium.com/opsops/ansible-default-ipv4-is-not-what-you-think-edb8ab154b10

 - name: Gather ansible_default_ipv4 from all hosts
   setup:
     gather_subset: '!all,network'
     filter: "ansible_default_ipv4"
   delegate_to: "{{ item }}"
   delegate_facts: yes
   when: hostvars[item].ansible_default_ipv4 is not defined
   loop: "{{ (groups['k8s_cluster'] | default([]) + groups['etcd'] | default([]) + groups['calico_rr'] | default([])) | unique }}"
   run_once: yes
   ignore_unreachable: true
+  ignore_errors: true
   tags: always

 - name: Create fallback_ips_base
   set_fact:
     fallback_ips_base: |
       ---
       {% for item in (groups['k8s_cluster'] | default([]) + groups['etcd'] | default([]) + groups['calico_rr'] | default([])) | unique %}
-      {% set found = hostvars[item].get('ansible_default_ipv4') %}
+      {% set found = hostvars[item].get('ansible_default_ipv4', {}) %}
       {{ item }}: "{{ found.get('address', '127.0.0.1') }}"
       {% endfor %}
   delegate_to: localhost
   connection: local
   delegate_facts: yes
   become: no
   run_once: yes

 - name: Set fallback_ips
   set_fact:
     fallback_ips: "{{ hostvars.localhost.fallback_ips_base | from_yaml }}"
PLAY [Prepare nodes for upgrade] ********************************************************************************************************************************************************************************

TASK [kubespray-defaults : Gather ansible_default_ipv4 from all hosts] ******************************************************************************************************************************************
ok: [k8s1.local] => (item=k8s1.local)
[WARNING]: Unhandled error in Python interpreter discovery for host k8s1.local: Failed to connect to the host via ssh: ssh: connect to host k8s3.local port 22: Connection timed out
failed: [k8s1.local -> k8s3.local] (item=k8s3.local) => {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}
ok: [k8s1.local -> k8s2.local] => (item=k8s2.local)
fatal: [k8s1.local -> {{ item }}]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "results": [{"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.29", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:41:88:12", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s1.local"}, {"ansible_loop_var": "item", "item": "k8s3.local", "msg": "Data could not be sent to remote host \"k8s3.local\". Make sure this host can be reached over ssh: ssh: connect to host k8s3.local port 22: Connection timed out\r\n", "unreachable": true}, {"ansible_facts": {"ansible_default_ipv4": {"address": "10.88.111.30", "alias": "eth0", "broadcast": "10.88.111.255", "gateway": "10.88.111.254", "interface": "eth0", "macaddress": "bc:24:11:be:42:a6", "mtu": 1500, "netmask": "255.255.252.0", "network": "10.88.108.0", "prefix": "22", "type": "ether"}, "discovered_interpreter_python": "/usr/bin/python3"}, "ansible_loop_var": "item", "changed": false, "failed": false, "invocation": {"module_args": {"fact_path": "/etc/ansible/facts.d", "filter": ["ansible_default_ipv4"], "gather_subset": ["!all", "network"], "gather_timeout": 10}}, "item": "k8s2.local"}]}
...ignoring

TASK [kubespray-defaults : Create fallback_ips_base] ************************************************************************************************************************************************************
ok: [k8s1.local -> localhost]

TASK [kubespray-defaults : Set fallback_ips] ********************************************************************************************************************************************************************
ok: [k8s1.local]
ok: [k8s3.local]
ok: [k8s2.local]

TASK [Task that runs after] *************************************************************************************************************************************************************************************
ok: [k8s1.local] => {
    "fallback_ips": {
        "k8s1.local": "127.0.0.1",
        "k8s2.local": "127.0.0.1",
        "k8s3.local": "127.0.0.1"
    }
}
ok: [k8s3.local] => {
    "fallback_ips": {
        "k8s1.local": "127.0.0.1",
        "k8s2.local": "127.0.0.1",
        "k8s3.local": "127.0.0.1"
    }
}
ok: [k8s2.local] => {
    "fallback_ips": {
        "k8s1.local": "127.0.0.1",
        "k8s2.local": "127.0.0.1",
        "k8s3.local": "127.0.0.1"
    }
}

PLAY RECAP ******************************************************************************************************************************************************************************************************
k8s1.local                 : ok=4    changed=0    unreachable=0    failed=0    skipped=3    rescued=0    ignored=1
k8s2.local                 : ok=2    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0
k8s3.local                 : ok=2    changed=0    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0

But as you can see, while it got past the early exit, it looks like it doesn't save the discovered facts. And so the output is all wrong. That's why I decided to first filter out the unreachable hosts in a separate task before running setup.

Does this PR introduce a user-facing change?:

None

Rickkwa avatar Mar 13 '24 00:03 Rickkwa

Hi @Rickkwa. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 13 '24 00:03 k8s-ci-robot

/ok-to-test

cyclinder avatar Mar 13 '24 01:03 cyclinder

Do you need any more info from me to move this forward? I'm hoping this can get into the 2.25 release, whenever that is.

Rickkwa avatar Apr 12 '24 21:04 Rickkwa

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Rickkwa Once this PR has been reviewed and has the lgtm label, please assign mzaian for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jun 26 '24 13:06 k8s-ci-robot

Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages.

The list of commits with invalid commit messages:

  • 77cc749 Bump tox from 4.11.3 to 4.15.0 (#11133)
  • 7d0e887 Bump jinja2 from 3.1.2 to 3.1.3 (#11119)
  • ad95947 Bump molecule-plugins[vagrant] from 23.5.0 to 23.5.3 (#11120)
  • 688a0c7 Bump tzdata from 2023.3 to 2024.1 (#11121)
  • 74405ce Bump netaddr from 0.9.0 to 1.2.1 (#11148)
  • cc245e6 Bump ansible from 9.3.0 to 9.5.1 (#11157)
  • 1d0a09f Bump ruamel-yaml from 0.18.5 to 0.18.6 (#11147)
  • 89ff14f Bump jinja2 from 3.1.3 to 3.1.4 (#11166)
  • c59e9ca Bump ansible-lint from 6.22.2 to 24.2.3 (#11151)
  • e3145c0 Bump pytest-testinfra from 9.0.0 to 10.1.0 (#11149)
  • b5e8b78 Bump molecule from 6.0.2 to 24.2.1 (#11150)
  • 997d1b7 Bump markupsafe from 2.1.3 to 2.1.5 (#11176)
  • 4e02645 Bump yamllint from 1.32.0 to 1.35.1 (#11177)
  • bc1085c Bump cryptography from 41.0.4 to 42.0.7 (#11187)
  • d8d1850 Bump ara[server] from 1.7.0 to 1.7.1 (#11178)
  • 7f0b927 Bump pbr from 5.11.1 to 6.0.0 (#11188)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot avatar Jun 26 '24 22:06 k8s-ci-robot

Gah, I'm bad at rebasing. Let me re-open a new one.

Rickkwa avatar Jun 26 '24 22:06 Rickkwa

@Rickkwa: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubespray-yamllint 499b395f801c4f62eeea4ef1437015387e19f7a3 link true /test pull-kubespray-yamllint

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot avatar Jun 26 '24 22:06 k8s-ci-robot