ansible-role-interfaces
ansible-role-interfaces copied to clipboard
When configuring an IP over IB (`ipoib`) interface failing with "Interface ib0 is not active"
I'm trying to set up an InfiniBand interface on a Mellanox ConnectX-6 with OFED driver version 5.5-1.0.3.2 on Rocky 8.5
Drivers are installed and interfaces can be brought up manually.
I'm calling the role like this because the role has already been called earlier to set up the real Ethernet interfaces:
---
- name: Configure Infiniband interfaces
hosts: infiniband
tasks:
- name: Configure Infinband interfaces
import_role:
name: michaelrigart.interfaces
vars:
interfaces_pause_time: 120
interfaces_ether_interfaces:
- device: "{{ infiniband_interface }}"
bootproto: static
address: "{{ ib_ip }}"
netmask: "{{ infiniband_netmask }}"
type: ipoib
become: true
I've added interfaces_pause_time: 120
as I assumed that the interfaces were just taking time to become active after being bounced, I'
However when executing the playbook they end with:
RUNNING HANDLER [michaelrigart.interfaces : Check active Ethernet interface state] *********************************************
failed: [ib-host11] (item={'device': 'ib0', 'bootproto': 'static', 'address': '10.10.10.11', 'netmask': '255.255.252.0', 'type': 'ipoib'}) => {"ansible_loop_var": "item", "changed": false, "item": {"address": "10.10.10.11", "bootproto": "static", "device": "ib0", "netmask": "255.255.252.0", "type": "ipoib"}, "msg": "Interface ib0 is not active"}
I've check for other issues for ipoib and #76 and #58 look like they've been resolved, and don't seem to help resolve this issue.
Hi @Aethylred. You can see where that error is generated here. It means that the Ansible fact for the interface has marked it as not active.
You could check the actual interface status, to see if it is up. You could also check the generated ifcfg file, to see if it is as you would expect.
After the playbook fails, logging into the host the ifcfg-ib0
looks good and ifup ib0
works.
If I extend the interface pause to interfaces_pause_time: 300
then it succeeds.
I think there may be a delay while the interface and our subnet manager sort themselves out.
Interesting. Is there anything we need to change here?
Not sure, I think it would be better if it could poll for the interface being 'ready' or 'active' rather than refreshing the facts to get the interface state.
Ideally with a retry limit and a timeout.