xcat-core
xcat-core copied to clipboard
Nodestat reporting an incorrect osimage when net booting
System is stateless compute nodes.. the systems boot to the correct OSimage but xcat nodesta command reports wrong osimage. This happens on multiple nodes./ prior to this run the osimage ending in t0 was removed from xcat tables but nodestat continues to report wrong netboot image as show below
nodeset xyz6cn-024 osimage=abc_xyz_compute_prod xyz6cn-024: netboot rhels8.3.0-x86_64-abc_xyz_compute_prod xyz6cn-024: netboot rhels8.3.0-x86_64-abc_xyz_compute_prod
cold boot of node nodestat reports netboot on image that is no longer defined
[root@xyz6mgt-001 2022/07/27 19:30:47]/install/abc/backup-2022-07-19_1906# nodestat xyz6cn-024 xyz6cn-024: netboot rhels8.3.0-x86_64-abc_xyz_compute_t0
[root@xyz6mgt-001 2022/07/27 19:28:13]/install/abc/backup-2022-07-19_1906# nodestat xyz6cn-024 -m xyz6cn-024: nrpe,pbs,ssh Upon validation of node the node is booted to the correct os image and not the above listed _t0
@Emohseni
- What version of xCAT are you running ?
- Can you show the output of
lsdef xyz6cn-024? - Does the output
xyz6cn-024: netboot rhels8.3.0-x86_64-abc_xyz_compute_t0gets displayed while the node is in the process of booting, or after the node has finished booting ?
Version 2.16.1.
[root@xyz6sn-001 ~]# lsdef xyz6cn-024Object name: xyz6cn-024 addkcmdline=nomodeset intel_pstate=passive clocksource=tsc tsc=reliable arch=x86_64 bmc=xyz6cn-024-rm chain=runcmd=bmcsetup,shell chassis=xyz6smm-002 cons=ipmi currstate=netboot rhels8.3.0-x86_64-abc_xyz_compute_prod groups=r1,r2cn,cn,all,compute,sd650v2,ipmi ip= mac= mgt=ipmi netboot=xnba nicextraparams.ib0=GATEWAY= nichostnamesuffixes.ib0=-ib nicips.ib0= nicnetworks.ib0=ib nictypes.ib0=Infiniband nodetype=osi ondiscover=nodediscover os=rhels8.3.0 otherinterfaces=-rm: postbootscripts=otherpkgs postscripts=syslog,remoteshell,setupntp,syncfiles,confignetwork -s,abc/postinst.sh profile=abc_xyz_compute_prod provmethod=abc_xyz_compute_prod serialport=0 serialspeed=115200 servicenode=xyz6sn-001,xyz6sn-002 slot=12 status=failed statustime=07-27-2022 19:33:47 updatestatus=syncing updatestatustime=07-07-2022 23:32:33 xcatmaster=xyz6sn-001-cn
after the node has finished the boot process shows the output
I think netboot <osimage name> should only be reported while node is booting. Once the boot process is finished nodestat should display sshd. Perhaps your node did not cleanly finish the boot process ? I noticed status=failed.
Check /var/log/xcat/xcat.log on the compute node to see if any errors were logged.
The reported error is the nodestat reports wrong OS image that is being booted and reported during netboot; when the node boots up succesfully it does report sshd and other services. rerunning nodeset does not change the output of nodestat even though the _t0 osimage definition was removed. from the cluster and is no longer shown with lsdef -t osimage
The error is nodestat does not report the current netboot osimage state of the node.
@Emohseni
It looks like, while diskless node is booting, the nodestat gets the node status by calling nodeset <node> stat
Can you run nodeset xyz6cn-024 stat, while the node xyz6cn-024 is booting and nodestat xyz6cn-024 reports incorrect osimage name.
[root@xyz6sn-001 ~]# nodeset xyz6cn-024 stat
xyz6cn-024: netboot rhels8.3.0-x86_64-ssc_xyz_compute_prod
xyz6cn-024: netboot rhels8.3.0-x86_64-ssc_xyz_compute_prod
xyz6cn-024: netboot rhels8.3.0-x86_64-ssc_xyz_compute_t0
How about ls -l /tftpboot/xcat/xnba/nodes/xyz6cn-024* and
grep "netboot rhel" /tftpboot/xcat/xnba/nodes/xyz6cn-024*
Service node:
[root@xyz6sn-001 ~]# ls -l /tftpboot/xcat/xnba/nodes/xyz6cn-024*
-rw-r--r-- 1 root root 584 Jul 27 19:28 /tftpboot/xcat/xnba/nodes/xyz6cn-024
-rw-r--r-- 1 root root 550 Jul 27 19:28 /tftpboot/xcat/xnba/nodes/xyz6cn-024.uefi
[root@xyz6sn-001 ~]# ls -l /tftpboot/xcat/xnba/nodes/xyz6cn-024* |grep "netboot rhel" /tftpboot/xcat/xnba/nodes/xyz6cn-024*
/tftpboot/xcat/xnba/nodes/xyz6cn-024:#netboot rhels8.3.0-x86_64-ssc_xyz_compute_prod
Management node:
root@xyz6mgt-001 2022/07/28 18:53:08]/var/log/xcat# ls -l /tftpboot/xcat/xnba/nodes/xyz6cn-024* |grep "netboot rhel" /tftpboot/xcat/xnba/nodes/xyz6cn-024*
/tftpboot/xcat/xnba/nodes/xyz6cn-024:#netboot rhels8.3.0-x86_64-ssc_xyz_compute_t0
seems the /tftpboot is not being updated on the main management node or cleaned up. What is the action plan? Why mgt node not being updated with nodeset osimage?
What is your sharedtftp and tftpdir setting in site table on Management node ?
Management node
[root@xyz6mgt-001 ]~# lsdef -t site clustersite | grep shared
sharedtftp=0
[root@xyz6mgt-001 ]~# lsdef -t site clustersite | grep tft
sharedtftp=0
tftpdir=/tftpboot
Servicenode
[root@xyz6sn-001 ~]# lsdef -t site clustersite | grep tft
sharedtftp=0
tftpdir=/tftpboot
[root@xyz6sn-001 ~]# lsdef -t site clustersite | grep shared
sharedtftp=0
By default sharedtftp=1. With that setting, /tftpboot is mounted on service node from management node.
If sharedtftp=0, you need to manually update /tftpboot on service node.