ansible icon indicating copy to clipboard operation
ansible copied to clipboard

delegate_to with unreachable hosts breaks playbook

Open NZSmartie opened this issue 7 years ago • 9 comments

ISSUE TYPE
  • Bug Report
COMPONENT NAME

delegate_to

ANSIBLE VERSION
ansible 2.4.2.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/dist-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.12 (default, Nov 20 2017, 18:23:56) [GCC 5.4.0 20160609]
OS / ENVIRONMENT

Ubuntu 16.04.3 LTS

SUMMARY

Using delegate_to (a perfect example can be found in the documentation http://docs.ansible.com/ansible/latest/playbooks_delegation.html#delegated-facts); When a host passed along to delegate_to is unreachable for whatever reason. The host that is executing the task will be also be marked as unreachable. After a large error is printed out (output from all hosts in the delegate_to?) and the playbook ends with something like:

PLAY RECAP *************************************************************************************************************************************************
foo.example.com    : ok=0    changed=0    unreachable=1    failed=0
STEPS TO REPRODUCE

playbook.yml:

---
- hosts: app_servers
  tasks:
    - name: gather facts from db servers
      setup:
      delegate_to: "{{ item }}"
      delegate_facts: True
      with_items: "{{groups['dbservers']}}"
    - name: test
      debug:
        msg: "test"

inventory.yml:

---
all:
  children:
    app_servers:
      hosts:
        foo.example.com:
    dbservers:
      hosts:
        bar.example.com:
EXPECTED RESULTS
PLAY [app_servers] ***********************************************************************************************************************************

TASK [gather facts from db servers] ******************************************************************************************************************
failed: [foo.example.com] (item=bar.example.com) => {"item": "bar.example.com", "msg": "Failed to connect to the host via ssh: ssh: connect to host bar.example.com port 22: Connection timed out\r\n", "unreachable": true}

TASK [test] ************************************************************************************************************************************************
ok: [foo.example.com] => {
    "msg": "test"
}
ACTUAL RESULTS
PLAY [app_servers] ***********************************************************************************************************************************

TASK [gather facts from db servers] ******************************************************************************************************************
failed: [foo.example.com] (item=bar.example.com) => {"item": "bar.example.com", "msg": "Failed to connect to the host via ssh: ssh: connect to host bar.example.com port 22: Connection timed out\r\n", "unreachable": true}
fatal: [foo.example.com]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "...

NZSmartie avatar Jan 18 '18 04:01 NZSmartie

Updated the issue as it's only delegate_to that is the cause. omitting delegate_facts has no affect on the bug

NZSmartie avatar Jan 18 '18 05:01 NZSmartie

Digging into this further. When a Task is executed using delegate_to it's results are stored against the executing host without any context of which host was actually unreachable. Which happens in TaskExecutor.run(self)(lib/ansible/executor/task_executor.py)

When those results are passed back to WorkerProcess.run(self)(lib/ansible/executor/process/worker.py), The TaskResult object stores the results from the task and provides is_unreachable(self) method to check if the host is nolonger reachable. Since there is no context of which host was unreachable apart from the loop's itemproperty. It will return True that the executing host is unreachable.

I propose a couple ideas for fixing this bug:

  • Include a flag, or metadata in the result to signify which host was unreachable.
  • Flag that the result was from a delegate_to based task and thus should be ignored/skipped
  • a task with delegate_to should not check if the executing host is unreachable

NZSmartie avatar Jan 19 '18 05:01 NZSmartie

I'll be happy to provide a PR to resolve this, provided that feedback is given on the proposals above.

NZSmartie avatar Jan 22 '18 20:01 NZSmartie

Hello?

NZSmartie avatar Jan 25 '18 21:01 NZSmartie

Can I please get some sort of confirmation from anyone who works on core development so that I may start to plan out a PR to resolve this bug?

NZSmartie avatar Feb 12 '18 13:02 NZSmartie

I'm not sure it is needed, both block/rescue and meta can be used to clear such an error.

Bring this up in one of the IRC core meetings (https://github.com/ansible/community/issues?q=is%3Aopen+is%3Aissue+label%3Acore) to get quorum and a decision on this.

Also this seems to merit a proposal more than an issue: https://github.com/ansible/proposals

bcoca avatar Feb 12 '18 18:02 bcoca

In Ansible 2.6.0 it is not entering rescue when the delegated host is not reachable. Need some work around.

puriadeb avatar Jul 23 '18 08:07 puriadeb

So after discussing with core team we think the best approach to solve this is to mark the delegated host as 'unreachable', but still fail and mark the inventory_hostname as 'failed'. This should correct the status and still continue with the play flow as expected (and enable rescue/ignore_errors to work).

bcoca avatar Mar 13 '20 15:03 bcoca

Is there meanwhile any way to achieve this?

Edit: figured out setting ignore_unreachable: yes on the block works, i. e., the task with delegation that fails due to connection timeout is shown as failed, but the play continues and the rescue gets triggered.

tumbl3w33d avatar Sep 16 '22 09:09 tumbl3w33d

in ansible 2.13.5 problem still present. ignore_unreachable: yes works, but print "failed" and "fatal" message. failed_when: false not help.

fixed77 avatar Oct 31 '22 12:10 fixed77

any workaround?? it's interesting how such issue still not resolved in 2023!

shqear93 avatar Jan 03 '24 13:01 shqear93

I have changed my playbook to wait and make sure the host is reachable before executing tasks on the host

        - name: Wait for the instance to become reachable and SSH becomes accessible
          delegate_to: "{{ item.private_ip_address }}"
          loop: "{{ ec2.instances }}"
          wait_for_connection:
            delay: 5
            timeout: 60

If you still need to handle the unreachable error on every step you can manually fail the task:

        - name: Check if Python is installed
          raw: python --version
          register: python_check
          delegate_to: "{{ ec2.instances[0].private_ip_address }}"
          ignore_unreachable: yes
        - name: Fail when unreachable
          fail:
          when: python_check.unreachable

shqear93 avatar Jan 08 '24 12:01 shqear93

Makes it very difficult to work with clusters where some nodes may be down.

Edit: in 2.16.5 even rescue: meta: clear_host_errors does nothing, the delegated task still hard fails. Regardless of if the task and/or block have ignore_unreachable.

Edit 2: actually it is just the debugger that is incorrectly invoked if enabled on failure. Without fail debugging enabled it seems to work.

bluikko avatar Apr 05 '24 09:04 bluikko

@bluikko it does something, the problem is that it only affects hosts in the current play, which does not include delegated_to ones.

bcoca avatar Apr 05 '24 13:04 bcoca