cisco.nxos icon indicating copy to clipboard operation
cisco.nxos copied to clipboard

cisco.nxos.nxos_install_os fails with issu parameter set to 'yes'

Open boleslawlucjanek opened this issue 3 years ago • 5 comments

SUMMARY

Module cisco.nxos.nxos_install_os fails with issu parameter set to 'yes' (while check_mode is set to 'no') with below messages: "raw_data": [ "timeout value 600 seconds reached while trying to send command: b'install all nxos nxos.9.3.9.bin non-disruptive'"], "msg": "Failed to upgrade device using command: ['terminal dont-ask', 'install all nxos nxos.9.3.9.bin non-disruptive']".

Module works perfectly with issu parameter set to 'yes' in check mode (while check_mode is set to 'yes'). It works perfectly as well with issu parameter set to 'no' (while check_mode is set to 'no') . With issu parameter set to 'yes' (while check_mode is set to 'no') it fails as a task, but in fact it sends with success command 'install all nxos nxos.9.3.9.bin non-disruptive' to Nexus switch. Task fails but switch starts being upgraded via non-disruptive method.

ISSUE TYPE
  • Bug Report
COMPONENT NAME

module: nxos_install_os

ANSIBLE VERSION
  ansible [core 2.13.1]
  config file = /etc/ansible/ansible.cfg
  python version = 3.8.13 (default, Apr  5 2022, 17:15:15) [GCC 9.1.1 20190605 (Red Hat 9.1.1-2)]
  jinja version = 3.1.2
  libyaml = True
COLLECTION VERSION
Collection Version
---------- -------
cisco.nxos 3.1.0
CONFIGURATION

OS / ENVIRONMENT

Nexus switch model: N9K-C93180YC-FX OS version: 9.3(7) Target OS version: 9.3(9)

STEPS TO REPRODUCE
- name: ISSU non-disruptive OS upgrade on N9k
  check_mode: no
  cisco.nxos.nxos_install_os:
    system_image_file: nxos.9.3.9.bin
    issu: yes
  register: show_install_output

- name: Print show install output
  debug:
    var: show_install_output

EXPECTED RESULTS

Task should not fail, it should succeed with registered below return value (for key 'install_state'): "show_install_output.install_state": [ "Compatibility check is done:", "Module bootable Impact Install-type Reason", "------ -------- -------------- ------------ ------", " 1 yes non-disruptive reset ", "Images will be upgraded according to following table:", "Module Image Running-Version(pri:alt) New-Version Upg-Required", "------ ---------- ---------------------------------------- -------------------- ------------", " 1 nxos 9.3(7) 9.3(9) yes", " 1 bios v05.45(07/05/2021):v05.28(01/18/2018) v05.45(07/05/2021) no", "--------------------------------------", ]

ACTUAL RESULTS

Task fails: fatal: [N9K-C93180YC-FX]: FAILED! =>

{
    "raw_data": [
        "timeout value 600 seconds reached while trying to send command: b'install all nxos nxos.9.3.9.bin non-disruptive'"
    ],
    "msg": "Failed to upgrade device using command: ['terminal dont-ask', 'install all nxos nxos.9.3.9.bin non-disruptive']",
    "invocation": {
        "module_args": {
            "system_image_file": "nxos.9.3.9.bin",
            "issu": "yes",
            "kickstart_image_file": null,
            "provider": null
        }
    },
    "_ansible_no_log": false,
    "changed": false
}

boleslawlucjanek avatar Sep 01 '22 06:09 boleslawlucjanek

Hello there,

I think I am facing the same behaviour, both on N9K-C93180YC-EX and FX. I have tried the following upgrade so far (both on EX & FX) :

  • from 7.0(3)I7(8) to 9.3(7),
  • from 7.0(3)I7(8) to 9.3(9),
  • from 9.3(7) to 9.3(9).

The actual upgrade is done on the switch, but for some reason the Ansible task timeout, so the rest of the playbook does not play. I have access to a console connection to the switch so I can see the upgrade happening, the switch reloading and getting ready again in less than 600 seconds but the Ansible playbook does not see that and just wait for the 600 seconds timer to expire before failing.

What bother me the most is that everything is working just fine on N3K-C3048TP-1GE and N3K-C3548P-10G. In this case, from the console I can see the upgrade happening and the Ansible playbook resume right after the switch is SSH reachable again.

I can provide logs and try fixes if needed, I also happen to have access to a lab with other N9K & N3K references if this can help troubleshoot this.

Here is a very minimalistic environment where the issue happens : Note : check at the very end of this post for the successful output of the same playbook running against a N3K-C3048TP-1GE

Ansible collection list (truncated) :

# /home/ansible/.ansible/collections/ansible_collections
Collection        Version
----------------- -------
ansible.netcommon 3.1.0  
ansible.utils     2.6.1  
cisco.nxos        3.1.0  

Inventory file (filename : hosts) :

---
nxos:
  vars:
    ansible_connection: ansible.netcommon.network_cli
    ansible_network_os: cisco.nxos.nxos
  hosts:
    N93180EX_1:
      ansible_host: 10.1.1.1
...

Playbook (filename : upgrade.yaml) :

- name: Simple upgrade
  gather_facts: false
  hosts: nxos
  tasks:
    - name: Upgrade NXOS
      cisco.nxos.nxos_install_os:
        system_image_file: nxos.9.3.9.bin
        issu: desired
      register: debugvar

    - name: Debug upgrade
      debug:
        var: debugvar

Command : ansible-playbook upgrade.yaml -i hosts -u admin -k -vvvvv

ansible.cfg :

[persistent_connection]
connect_timeout = 600
command_timeout = 600

Output :

ansible-playbook [core 2.13.4]
  config file = /home/ansible/testDir/ansible.cfg
  configured module search path = ['/home/ansible/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/ansible/.local/lib/python3.9/site-packages/ansible
  ansible collection location = /home/ansible/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/ansible/.local/bin/ansible-playbook
  python version = 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110]
  jinja version = 3.1.2
  libyaml = True
Using /home/ansible/testDir/ansible.cfg as config file
SSH password: 
setting up inventory plugins
host_list declined parsing /home/ansible/testDir/hosts as it did not pass its verify_file() method
script declined parsing /home/ansible/testDir/hosts as it did not pass its verify_file() method
auto declined parsing /home/ansible/testDir/hosts as it did not pass its verify_file() method
Parsed /home/ansible/testDir/hosts inventory source with yaml plugin
Loading collection cisco.nxos from /home/ansible/.ansible/collections/ansible_collections/cisco/nxos
redirecting (type: action) cisco.nxos.nxos_install_os to cisco.nxos.nxos
Loading callback plugin default of type stdout, v2.0 from /home/ansible/.local/lib/python3.9/site-packages/ansible/plugins/callback/default.py
Attempting to use 'default' callback.
Skipping callback 'default', as we already have a stdout callback.
Attempting to use 'junit' callback.
Attempting to use 'minimal' callback.
Skipping callback 'minimal', as we already have a stdout callback.
Attempting to use 'oneline' callback.
Skipping callback 'oneline', as we already have a stdout callback.
Attempting to use 'tree' callback.

PLAYBOOK: upgrade.yaml *********************************************************************************
Positional arguments: upgrade.yaml
verbosity: 5
remote_user: admin
connection: smart
timeout: 10
ask_pass: True
become_method: sudo
tags: ('all',)
inventory: ('/home/ansible/testDir/hosts',)
forks: 5
1 plays in upgrade.yaml

PLAY [Simple upgrade] **********************************************************************************
META: ran handlers
redirecting (type: action) cisco.nxos.nxos_install_os to cisco.nxos.nxos
Loading collection ansible.netcommon from /home/ansible/.ansible/collections/ansible_collections/ansible/netcommon

TASK [Upgrade NXOS] ************************************************************************************
task path: /home/ansible/testDir/upgrade.yaml:5
<10.1.1.1> attempting to start connection
<10.1.1.1> using connection plugin ansible.netcommon.network_cli
Found ansible-connection at path /home/ansible/.local/bin/ansible-connection
<10.1.1.1> local domain socket does not exist, starting it
<10.1.1.1> control socket path is /home/ansible/.ansible/pc/dd6bec9c00
<10.1.1.1> Loading collection ansible.netcommon from /home/ansible/.ansible/collections/ansible_collections/ansible/netcommon
<10.1.1.1> Loading collection cisco.nxos from /home/ansible/.ansible/collections/ansible_collections/cisco/nxos
<10.1.1.1> local domain socket listeners started successfully
<10.1.1.1> loaded cliconf plugin ansible_collections.cisco.nxos.plugins.cliconf.nxos from path /home/ansible/.ansible/collections/ansible_collections/cisco/nxos/plugins/cliconf/nxos.py for network_os cisco.nxos.nxos
<10.1.1.1> ssh type is set to auto
<10.1.1.1> autodetecting ssh_type
[WARNING]: ansible-pylibssh not installed, falling back to paramiko
<10.1.1.1> ssh type is now set to paramiko
<10.1.1.1> 
<10.1.1.1> local domain socket path is /home/ansible/.ansible/pc/dd6bec9c00
redirecting (type: action) cisco.nxos.nxos_install_os to cisco.nxos.nxos
redirecting (type: action) cisco.nxos.nxos_install_os to cisco.nxos.nxos
<10.1.1.1> PERSISTENT_COMMAND_TIMEOUT is 600
<10.1.1.1> PERSISTENT_CONNECT_TIMEOUT is 600
<10.1.1.1> ANSIBLE_NETWORK_IMPORT_MODULES: enabled
<10.1.1.1> ANSIBLE_NETWORK_IMPORT_MODULES: found cisco.nxos.nxos_install_os  at /home/ansible/.ansible/collections/ansible_collections/cisco/nxos/plugins/modules/nxos_install_os.py
<10.1.1.1> ANSIBLE_NETWORK_IMPORT_MODULES: running cisco.nxos.nxos_install_os
<10.1.1.1> ANSIBLE_NETWORK_IMPORT_MODULES: complete
<10.1.1.1> ANSIBLE_NETWORK_IMPORT_MODULES: Result: {'raw_data': ["timeout value 600 seconds reached while trying to send command: b'install all nxos nxos.9.3.9.bin non-disruptive'"], 'failed': True, 'msg': "Failed to upgrade device using command: ['terminal dont-ask', 'install all nxos nxos.9.3.9.bin non-disruptive']", 'invocation': {'module_args': {'system_image_file': 'nxos.9.3.9.bin', 'issu': 'desired', 'kickstart_image_file': None, 'provider': None}}, '_ansible_parsed': True}
fatal: [N93180EX_1]: FAILED! => {
    "changed": false,
    "invocation": {
        "module_args": {
            "issu": "desired",
            "kickstart_image_file": null,
            "provider": null,
            "system_image_file": "nxos.9.3.9.bin"
        }
    },
    "msg": "Failed to upgrade device using command: ['terminal dont-ask', 'install all nxos nxos.9.3.9.bin non-disruptive']",
    "raw_data": [
        "timeout value 600 seconds reached while trying to send command: b'install all nxos nxos.9.3.9.bin non-disruptive'"
    ]
}

PLAY RECAP *********************************************************************************************
N93180EX_1                 : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   

Successful output of the same playbook when upgrading a N3K-C3048TP-1GE from 7.0(3)I7(8) to 9.3(9) (using compacted images if that matter) :

ansible-playbook [core 2.13.4]
  config file = /home/ansible/testDir/ansible.cfg
  configured module search path = ['/home/ansible/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/ansible/.local/lib/python3.9/site-packages/ansible
  ansible collection location = /home/ansible/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/ansible/.local/bin/ansible-playbook
  python version = 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110]
  jinja version = 3.1.2
  libyaml = True
Using /home/ansible/testDir/ansible.cfg as config file
SSH password: 
setting up inventory plugins
host_list declined parsing /home/ansible/testDir/hosts as it did not pass its verify_file() method
script declined parsing /home/ansible/testDir/hosts as it did not pass its verify_file() method
auto declined parsing /home/ansible/testDir/hosts as it did not pass its verify_file() method
Parsed /home/ansible/testDir/hosts inventory source with yaml plugin
Loading collection cisco.nxos from /home/ansible/.ansible/collections/ansible_collections/cisco/nxos
redirecting (type: action) cisco.nxos.nxos_install_os to cisco.nxos.nxos
Loading callback plugin default of type stdout, v2.0 from /home/ansible/.local/lib/python3.9/site-packages/ansible/plugins/callback/default.py
Attempting to use 'default' callback.
Skipping callback 'default', as we already have a stdout callback.
Attempting to use 'junit' callback.
Attempting to use 'minimal' callback.
Skipping callback 'minimal', as we already have a stdout callback.
Attempting to use 'oneline' callback.
Skipping callback 'oneline', as we already have a stdout callback.
Attempting to use 'tree' callback.

PLAYBOOK: upgrade.yaml **********************************************************************************************************************************************************
Positional arguments: upgrade.yaml
verbosity: 5
remote_user: admin
connection: smart
timeout: 10
ask_pass: True
become_method: sudo
tags: ('all',)
inventory: ('/home/ansible/testDir/hosts',)
forks: 5
1 plays in upgrade.yaml

PLAY [Simple upgrade] ***********************************************************************************************************************************************************
META: ran handlers
redirecting (type: action) cisco.nxos.nxos_install_os to cisco.nxos.nxos
Loading collection ansible.netcommon from /home/ansible/.ansible/collections/ansible_collections/ansible/netcommon

TASK [Upgrade NXOS] *************************************************************************************************************************************************************
task path: /home/ansible/testDir/upgrade.yaml:5
<10.1.1.2> attempting to start connection
<10.1.1.2> using connection plugin ansible.netcommon.network_cli
Found ansible-connection at path /home/ansible/.local/bin/ansible-connection
<10.1.1.2> local domain socket does not exist, starting it
<10.1.1.2> control socket path is /home/ansible/.ansible/pc/c3161504c3
<10.1.1.2> Loading collection ansible.netcommon from /home/ansible/.ansible/collections/ansible_collections/ansible/netcommon
<10.1.1.2> Loading collection cisco.nxos from /home/ansible/.ansible/collections/ansible_collections/cisco/nxos
<10.1.1.2> local domain socket listeners started successfully
<10.1.1.2> loaded cliconf plugin ansible_collections.cisco.nxos.plugins.cliconf.nxos from path /home/ansible/.ansible/collections/ansible_collections/cisco/nxos/plugins/cliconf/nxos.py for network_os cisco.nxos.nxos
<10.1.1.2> ssh type is set to auto
<10.1.1.2> autodetecting ssh_type
[WARNING]: ansible-pylibssh not installed, falling back to paramiko
<10.1.1.2> ssh type is now set to paramiko
<10.1.1.2> 
<10.1.1.2> local domain socket path is /home/ansible/.ansible/pc/c3161504c3
redirecting (type: action) cisco.nxos.nxos_install_os to cisco.nxos.nxos
redirecting (type: action) cisco.nxos.nxos_install_os to cisco.nxos.nxos
<10.1.1.2> PERSISTENT_COMMAND_TIMEOUT is 600
<10.1.1.2> PERSISTENT_CONNECT_TIMEOUT is 600
<10.1.1.2> ANSIBLE_NETWORK_IMPORT_MODULES: enabled
<10.1.1.2> ANSIBLE_NETWORK_IMPORT_MODULES: found cisco.nxos.nxos_install_os  at /home/ansible/.ansible/collections/ansible_collections/cisco/nxos/plugins/modules/nxos_install_os.py
<10.1.1.2> ANSIBLE_NETWORK_IMPORT_MODULES: running cisco.nxos.nxos_install_os
<10.1.1.2> ANSIBLE_NETWORK_IMPORT_MODULES: complete
<10.1.1.2> ANSIBLE_NETWORK_IMPORT_MODULES: Result: {'changed': True, 'install_state': ['Compatibility check is done:', 'Module  bootable          Impact  Install-type  Reason', '------  --------  --------------  ------------  ------', '     1       yes      disruptive         reset  default upgrade is not hitless', 'Images will be upgraded according to following table:', 'Module       Image                  Running-Version(pri:alt)           New-Version  Upg-Required', '------  ----------  ----------------------------------------  --------------------  ------------', '     1        nxos                               7.0(3)I7(8)                9.3(9)           yes', '     1        bios                        v5.0.0(06/06/2018)    v5.0.0(06/06/2018)            no', '     1   power-seq                                       5.5                   5.5            no', 'Module 1: Refreshing compact flash and upgrading bios/loader/bootrom.'], 'invocation': {'module_args': {'system_image_file': 'n3000-compact.9.3.9.bin', 'issu': 'desired', 'kickstart_image_file': None, 'provider': None}}, '_ansible_parsed': True}
changed: [N3048_1] => {
    "changed": true,
    "install_state": [
        "Compatibility check is done:",
        "Module  bootable          Impact  Install-type  Reason",
        "------  --------  --------------  ------------  ------",
        "     1       yes      disruptive         reset  default upgrade is not hitless",
        "Images will be upgraded according to following table:",
        "Module       Image                  Running-Version(pri:alt)           New-Version  Upg-Required",
        "------  ----------  ----------------------------------------  --------------------  ------------",
        "     1        nxos                               7.0(3)I7(8)                9.3(9)           yes",
        "     1        bios                        v5.0.0(06/06/2018)    v5.0.0(06/06/2018)            no",
        "     1   power-seq                                       5.5                   5.5            no",
        "Module 1: Refreshing compact flash and upgrading bios/loader/bootrom."
    ],
    "invocation": {
        "module_args": {
            "issu": "desired",
            "kickstart_image_file": null,
            "provider": null,
            "system_image_file": "n3000-compact.9.3.9.bin"
        }
    }
}

TASK [Debug upgrade] ************************************************************************************************************************************************************
task path: /home/ansible/testDir/upgrade.yaml:11
<10.1.1.2> attempting to start connection
<10.1.1.2> using connection plugin ansible.netcommon.network_cli
Found ansible-connection at path /home/ansible/.local/bin/ansible-connection
<10.1.1.2> found existing local domain socket, using it!
<10.1.1.2> invoked shell using ssh_type: paramiko
<10.1.1.2> ssh connection done, setting terminal
<10.1.1.2> loaded terminal plugin for network_os cisco.nxos.nxos
<10.1.1.2> Response received, triggered 'persistent_buffer_read_timeout' timer of 0.1 seconds
<10.1.1.2> firing event: on_open_shell()
<10.1.1.2> Response received, triggered 'persistent_buffer_read_timeout' timer of 0.1 seconds
<10.1.1.2> Response received, triggered 'persistent_buffer_read_timeout' timer of 0.1 seconds
<10.1.1.2> ssh connection has completed successfully
<10.1.1.2> Response received, triggered 'persistent_buffer_read_timeout' timer of 0.1 seconds
<10.1.1.2> Response received, triggered 'persistent_buffer_read_timeout' timer of 0.1 seconds
<10.1.1.2> Response received, triggered 'persistent_buffer_read_timeout' timer of 0.1 seconds
<10.1.1.2> Response received, triggered 'persistent_buffer_read_timeout' timer of 0.1 seconds
<10.1.1.2> Response received, triggered 'persistent_buffer_read_timeout' timer of 0.1 seconds
<10.1.1.2> Response received, triggered 'persistent_buffer_read_timeout' timer of 0.1 seconds
<10.1.1.2> Response received, triggered 'persistent_buffer_read_timeout' timer of 0.1 seconds
<10.1.1.2> Response received, triggered 'persistent_buffer_read_timeout' timer of 0.1 seconds
<10.1.1.2> Response received, triggered 'persistent_buffer_read_timeout' timer of 0.1 seconds
<10.1.1.2> Response received, triggered 'persistent_buffer_read_timeout' timer of 0.1 seconds
<10.1.1.2> Response received, triggered 'persistent_buffer_read_timeout' timer of 0.1 seconds
<10.1.1.2> Response received, triggered 'persistent_buffer_read_timeout' timer of 0.1 seconds
<10.1.1.2> Response received, triggered 'persistent_buffer_read_timeout' timer of 0.1 seconds
<10.1.1.2> Response received, triggered 'persistent_buffer_read_timeout' timer of 0.1 seconds
<10.1.1.2> updating play_context for connection
<10.1.1.2> 
<10.1.1.2> local domain socket path is /home/ansible/.ansible/pc/c3161504c3
ok: [N3048_1] => {
    "debugvar": {
        "changed": true,
        "failed": false,
        "install_state": [
            "Compatibility check is done:",
            "Module  bootable          Impact  Install-type  Reason",
            "------  --------  --------------  ------------  ------",
            "     1       yes      disruptive         reset  default upgrade is not hitless",
            "Images will be upgraded according to following table:",
            "Module       Image                  Running-Version(pri:alt)           New-Version  Upg-Required",
            "------  ----------  ----------------------------------------  --------------------  ------------",
            "     1        nxos                               7.0(3)I7(8)                9.3(9)           yes",
            "     1        bios                        v5.0.0(06/06/2018)    v5.0.0(06/06/2018)            no",
            "     1   power-seq                                       5.5                   5.5            no",
            "Module 1: Refreshing compact flash and upgrading bios/loader/bootrom."
        ]
    }
}
META: ran handlers
META: ran handlers

PLAY RECAP **********************************************************************************************************************************************************************
N3048_1                    : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Edit : This is also working fine on a N3K-C3548P-10G

YooBZH avatar Sep 14 '22 13:09 YooBZH

cc @mikewiebe @praveenramoorthy - Could you please help us out with this? Thank you.

NilashishC avatar Nov 10 '22 11:11 NilashishC

@boleslawlucjanek @YooBZH - I was able to repro this issue on N9K-C93180YC-EX. Checking on this. Thanks.

praveenramoorthy avatar Nov 14 '22 14:11 praveenramoorthy

We continue to debug this further. So far, we've understood that whenever the switchover happens as part of the upgrade, the connectivity is momentarily lost and that stalls the Ansible playbook run.

NilashishC avatar Mar 01 '23 12:03 NilashishC

Any news on this one ? I managed to workaround this issue with a lot of "block & rescue", timer and checks but at the cost that my playbooks now can take much more time to run if the issue is hit.

YooBZH avatar Jul 21 '23 10:07 YooBZH