xcat-core
xcat-core copied to clipboard
Some nodes fail to provision on xCAT 2.16.4
Hi Everyone
We have a cluster of around 2100 nodes. Recently xCAT was upgraded on the main management node to 2.16.4 to support RHEL 8.5 upgrade on the compute nodes which are all diskfull the compute nodes were earlier on RHEL7.9 and the xCAT management node is on RHEL7.7. We have intel s2600BP Blades as compute nodes the nodes use PXE boot i have set the PXE boot order correctly some nodes just dont install so far upgraded 246 nodes on the cluster with no issue 11 of them fail to install when i do rinstall on the node the node goes in a PXE loop failing to install please see attached picture for the error. The nodes also do not discover as they used to before by discover what i mean is if i replace the blade or remove the mac address of a blade using the correct PXE boot method it fails to rediscover it as was not the case before then i have to manually enter the mac address. All blades have the same firmware and same BIOS settings. Any clue why this is happening please see attached.
Regards Wasim
@Wasimwani1
I'm not sure what the problem is, but here are a few things you could pursue.
1.) Check to see if you are getting consistent DHCP behavior between a good node and bad node using xcatprobe detect_dhcpd:
xcatprobe detect_dhcpd -i MN_NODE_INTERFACE_NAME -m CN_MAC_ADDRESS
Example:
[root@mgmt01 ~]# lsdef cnf99p06 -i mac
Object name: cnf99p06
mac=70:e2:84:14:28:99
[root@mgmt01 ~]# tabdump site | grep dhcpinterfaces
"dhcpinterfaces","enP1p9s0f2,enP1p9s0f3:noboot",,
[root@mgmt01 ~]# xcatprobe detect_dhcpd -i enP1p9s0f2 -m 70:e2:84:14:28:99
Start to detect DHCP, please wait 10 seconds [INFO]
++++++++++++++++++++++++++++++++++ [INFO]
There are 1 servers replied to dhcp discover. [INFO]
Server:10.7.0.225 assign IP [10.7.99.6]. The next server is [10.7.0.225]! [INFO]
++++++++++++++++++++++++++++++++++
This could help you determine if the dhcp information for the bad compute nodes needs to be added to your dhcp configuration using makedhcp nodename or if there is something strange going on with your hierarchical DHCP configuration.
2.) Based on the screenshot above, there is some additional information available here: https://ipxe.org/err/040ee1
One of the suggestions on that page is to try the latest version of iPXE in case there is a bug fix already for the issue you are hitting. It appears that you are not using the latest version of xNBA (which is based on iPXE) based on your screenshot above. You could try to upgrade xNBA on your management node to see if it helps.
Latest version is available here: https://xcat.org/files/xcat/repos/yum/latest/xcat-dep/xnba-undi-1.21.1-1.noarch.rpm