xcat-core icon indicating copy to clipboard operation
xcat-core copied to clipboard

Nodestat reporting an incorrect osimage when net booting

Open Emohseni opened this issue 3 years ago • 13 comments

System is stateless compute nodes.. the systems boot to the correct OSimage but xcat nodesta command reports wrong osimage. This happens on multiple nodes./ prior to this run the osimage ending in t0 was removed from xcat tables but nodestat continues to report wrong netboot image as show below

nodeset xyz6cn-024 osimage=abc_xyz_compute_prod xyz6cn-024: netboot rhels8.3.0-x86_64-abc_xyz_compute_prod xyz6cn-024: netboot rhels8.3.0-x86_64-abc_xyz_compute_prod

cold boot of node nodestat reports netboot on image that is no longer defined

[root@xyz6mgt-001 2022/07/27 19:30:47]/install/abc/backup-2022-07-19_1906# nodestat xyz6cn-024 xyz6cn-024: netboot rhels8.3.0-x86_64-abc_xyz_compute_t0

[root@xyz6mgt-001 2022/07/27 19:28:13]/install/abc/backup-2022-07-19_1906# nodestat xyz6cn-024 -m xyz6cn-024: nrpe,pbs,ssh Upon validation of node the node is booted to the correct os image and not the above listed _t0

Emohseni avatar Jul 27 '22 19:07 Emohseni

@Emohseni

  • What version of xCAT are you running ?
  • Can you show the output of lsdef xyz6cn-024 ?
  • Does the output xyz6cn-024: netboot rhels8.3.0-x86_64-abc_xyz_compute_t0 gets displayed while the node is in the process of booting, or after the node has finished booting ?

gurevichmark avatar Jul 27 '22 20:07 gurevichmark

Version 2.16.1.

[root@xyz6sn-001 ~]# lsdef xyz6cn-024Object name: xyz6cn-024    addkcmdline=nomodeset intel_pstate=passive clocksource=tsc tsc=reliable    arch=x86_64    bmc=xyz6cn-024-rm    chain=runcmd=bmcsetup,shell    chassis=xyz6smm-002    cons=ipmi    currstate=netboot rhels8.3.0-x86_64-abc_xyz_compute_prod    groups=r1,r2cn,cn,all,compute,sd650v2,ipmi    ip=    mac=   mgt=ipmi    netboot=xnba    nicextraparams.ib0=GATEWAY=  nichostnamesuffixes.ib0=-ib    nicips.ib0=    nicnetworks.ib0=ib    nictypes.ib0=Infiniband    nodetype=osi    ondiscover=nodediscover    os=rhels8.3.0    otherinterfaces=-rm:    postbootscripts=otherpkgs    postscripts=syslog,remoteshell,setupntp,syncfiles,confignetwork -s,abc/postinst.sh    profile=abc_xyz_compute_prod    provmethod=abc_xyz_compute_prod    serialport=0    serialspeed=115200    servicenode=xyz6sn-001,xyz6sn-002    slot=12    status=failed    statustime=07-27-2022 19:33:47    updatestatus=syncing    updatestatustime=07-07-2022 23:32:33    xcatmaster=xyz6sn-001-cn

after the node has finished the boot process shows the output

Emohseni avatar Jul 27 '22 20:07 Emohseni

I think netboot <osimage name> should only be reported while node is booting. Once the boot process is finished nodestat should display sshd. Perhaps your node did not cleanly finish the boot process ? I noticed status=failed. Check /var/log/xcat/xcat.log on the compute node to see if any errors were logged.

gurevichmark avatar Jul 28 '22 15:07 gurevichmark

The reported error is the nodestat reports wrong OS image that is being booted and reported during netboot; when the node boots up succesfully it does report sshd and other services. rerunning nodeset does not change the output of nodestat even though the _t0 osimage definition was removed. from the cluster and is no longer shown with lsdef -t osimage

Emohseni avatar Jul 28 '22 16:07 Emohseni

The error is nodestat does not report the current netboot osimage state of the node.

Emohseni avatar Jul 28 '22 16:07 Emohseni

@Emohseni

It looks like, while diskless node is booting, the nodestat gets the node status by calling nodeset <node> stat Can you run nodeset xyz6cn-024 stat, while the node xyz6cn-024 is booting and nodestat xyz6cn-024 reports incorrect osimage name.

gurevichmark avatar Jul 28 '22 18:07 gurevichmark

[root@xyz6sn-001 ~]# nodeset xyz6cn-024 stat
xyz6cn-024: netboot rhels8.3.0-x86_64-ssc_xyz_compute_prod
xyz6cn-024: netboot rhels8.3.0-x86_64-ssc_xyz_compute_prod
xyz6cn-024: netboot rhels8.3.0-x86_64-ssc_xyz_compute_t0

Emohseni avatar Jul 28 '22 18:07 Emohseni

How about ls -l /tftpboot/xcat/xnba/nodes/xyz6cn-024* and grep "netboot rhel" /tftpboot/xcat/xnba/nodes/xyz6cn-024*

gurevichmark avatar Jul 28 '22 18:07 gurevichmark

Service node:

[root@xyz6sn-001 ~]# ls -l /tftpboot/xcat/xnba/nodes/xyz6cn-024*
-rw-r--r-- 1 root root 584 Jul 27 19:28 /tftpboot/xcat/xnba/nodes/xyz6cn-024
-rw-r--r-- 1 root root 550 Jul 27 19:28 /tftpboot/xcat/xnba/nodes/xyz6cn-024.uefi

[root@xyz6sn-001 ~]# ls -l /tftpboot/xcat/xnba/nodes/xyz6cn-024* |grep "netboot rhel" /tftpboot/xcat/xnba/nodes/xyz6cn-024*
/tftpboot/xcat/xnba/nodes/xyz6cn-024:#netboot rhels8.3.0-x86_64-ssc_xyz_compute_prod

Management node:

root@xyz6mgt-001 2022/07/28 18:53:08]/var/log/xcat#  ls -l /tftpboot/xcat/xnba/nodes/xyz6cn-024* |grep "netboot rhel" /tftpboot/xcat/xnba/nodes/xyz6cn-024*
/tftpboot/xcat/xnba/nodes/xyz6cn-024:#netboot rhels8.3.0-x86_64-ssc_xyz_compute_t0

Emohseni avatar Jul 28 '22 23:07 Emohseni

seems the /tftpboot is not being updated on the main management node or cleaned up. What is the action plan? Why mgt node not being updated with nodeset osimage?

Emohseni avatar Jul 28 '22 23:07 Emohseni

What is your sharedtftp and tftpdir setting in site table on Management node ?

gurevichmark avatar Jul 29 '22 13:07 gurevichmark

Management node

[root@xyz6mgt-001 ]~#  lsdef -t site clustersite | grep shared
    sharedtftp=0
[root@xyz6mgt-001 ]~#  lsdef -t site clustersite | grep tft
    sharedtftp=0
    tftpdir=/tftpboot

Servicenode

[root@xyz6sn-001 ~]# lsdef -t site clustersite | grep tft
    sharedtftp=0
    tftpdir=/tftpboot
[root@xyz6sn-001 ~]# lsdef -t site clustersite | grep shared
    sharedtftp=0

Emohseni avatar Aug 01 '22 16:08 Emohseni

By default sharedtftp=1. With that setting, /tftpboot is mounted on service node from management node. If sharedtftp=0, you need to manually update /tftpboot on service node.

gurevichmark avatar Aug 01 '22 19:08 gurevichmark